Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 4/28/2022 has been entered.
Response to Amendment
Applicant's amendments and remarks submitted 4/28/2022 have been entered and considered, but are not found convincing. Claims 1, 22-23 have been amended. In summary, claims 1-25 are pending in the application.
Claim Rejections - 35 U.S.C. 102:
Applicant’s argument filed 4/28/2022 regarding independent claim 1 have been fully considered but are moot because the rejection has been modified to address a newly added limitations.  The Examiner relies on Mishra for argues limitations.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 10-18 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 10 depends from claim 9 recites “wherein generating, using the neural network separately trained with the set of audio training data 3 121353863Application No.: 16/723,866Docket No.: P42882US1/77870000335101and the set of video training data, the set of characteristics for controlling the avatar representing the one or more movements of the user's face further comprises”, but claim 9 does not recite “generating, using the neural network separately trained with the set of audio training data and the set of video training data....” It is not clear if claim 10 depends from which claim. For purpose examination, Examiner interprets claim 10 depends from claim 1.
Note: If claim 10 depends from claim 1, claim 10 recites the limitations“ the first movement” and “the second movement” in lines 6-7. There is insufficient antecedent basis for this limitation in the claim.
Claim 12 depends from claim 11 recites “ wherein the first set of data representing the first movement of the user's face and the first set of characteristics are determined based on the received audio data.”, but claim 11 does not recite “first set of data presenting the first movement of the user’s face…”. It is not clear if which claim that claim 12 depends from?  Whether claim 12 depends claim 8 where claim 8 recites “a first set of data representing a first movement of the user's face” or claim 12 depends from claim 10 where claim 10 recites “ a first set of characteristics  representing the first movement of the user’s face” or claim 12 depends from different claims.  For purpose examination, Examiner interprets claim 12 depends from claim 8.
Note: If claim 12 depends from claim 8, claim 12 recites the limitations“ the first set of characteristics” in line 2. There is insufficient antecedent basis for this limitation in the claim.
Claim 13 depends from claim 11 recites “wherein the second set of data representing the second movement of the user's face and the second set of characteristics is based on the received video data separate from the audio data.”, but claim 11 does not recite “second set of data representing the second movement of the user’s face….” It is not clear if which claim that claim 13 depends from?  Whether claim 13 depends claim 8 where claim 8 recites “a second set of data representing a second movement of the user's face” or claim 13 depends from claim 10 where claim 10 recites “ a second set of characteristics  representing the second movement of the user’s face” or claim 12 depends from different claims.  For purpose examination, Examiner interprets claim 12 depends from claim 8.
Note: If claim 13 depends from claim 8, claim 13 recites the limitations“ the second set of characteristics” in lines 2-3. There is insufficient antecedent basis for this limitation in the claim.
Claim 15 depends from claim 14 recites “wherein the first set of data representing the first movement of the user's face and the first set of characteristics is based on the received audio data and the received video data.”, but claim 11 does not recite “first set of data representing the first movement of the user’s face…” It is not clear if which claim that claim 15 depends from?  Whether claim 15 depends claim 8 where claim 8 recites “a first set of data representing a first movement of the user's face” or claim 15 depends from claim 10 where claim 10 recites “ a first set of characteristics  representing the first movement of the user’s face” or claim 15 depends from different claims.  For purpose examination, Examiner interprets claim 15 depends from claim 8.
Note: If claim 15 depends from claim 8, claim 15 recites the limitations“ the first set of characteristics” in line 2. There is insufficient antecedent basis for this limitation in the claim.
Claim 16 depends from claim 14 recites “wherein the second set of data representing the second movement of the user's face and the second set of characteristics is based on the received video data”, but claim 14 does not recite “second set of data representing the second movement of the user’s face….” It is not clear if which claim that claim 16 depends from?  Whether claim 16 depends claim 8 where claim 8 recites “a second set of data representing a second movement of the user's face” or claim 16 depends from claim 10 where claim 10 recites “ a second set of characteristics  representing the second movement of the user’s face” or claim 16 depends from different claims.  For purpose examination, Examiner interprets claim 16 depends from claim 8.
Note: If claim 16 depends from claim 8, claim 16 recites the limitations“ the second set of characteristics” in lines 2-3. There is insufficient antecedent basis for this limitation in the claim.
Claim 17 depends from claim 10 recites “wherein the one or more programs further comprise instructions, which when executed by one or more processors of a first electronic device, cause the first electronic device to:”, but claim 10 does not recite “ one or more programs further comprise instructions, which when executed by one or more processors of a first electronic device, cause the first electronic device” It is unclear which claim that claim 17 depends from. For purpose examination, Examiner interprets claim 17 depends from claim 1.
Claim 14 is rejected based on rejection of claim 10 and claim 18 is rejected based on rejection of claim 17.
 Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
1.  Claims 1-13, 18-25 are rejected under 35 U.S.C. 103 as being unpatentable over Shin et al., U.S Patent Application Publication No.2020/0090393 (“Shin”) in view of el Mishra et al., U.S Patent Application Publication No 20190172243 (“Mishra”)
Regarding independent claim 1, Shin teaches a non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of a first electronic device (¶0389 “The method of operating the robot and the robot system according to an example embodiment can be implemented as a code readable by a processor on a recording medium readable by the processor. The processor-readable recording medium includes all kinds of recording apparatuses in which data that can be read by the processor is stored. Examples of the recording medium that can be read by the processor include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage apparatus, and/or the like, and may also be implemented in the form of a carrier wave such as transmission over the Internet. In addition, the processor-readable recording medium may be distributed over network-connected computer systems so that code readable by the processor in a distributed fashion can be stored and executed”), cause the first electronic device to:
receive an audio input (¶0082 “he robot 100 may include a voice input unit 125 for receiving a speech input of a user. The voice input unit may also be called a speech input unit or a voice/speech input device); 
receive a video input including at least a portion of a user's face, wherein the video input is separate from the audio input (¶0080 “he image acquisition unit 120 may photograph the front direction of the robot 100, and may photograph an image for user recognition”; ¶0158 “For example, the input data 590 may be moving image data photographed by the user, and the moving image data may include image data in which the user's face or the like is photographed and audio data including a speech uttered by a user.”);
determine, one or more movements of the user's face based on the received audio input and received video input (¶0170 “The face emotion recognizer 523 may recognize the facial expression of the user by detecting the facial area of the user in the input image data and recognizing facial expression landmark point information which is the feature points constituting the facial expression. The face emotion recognizer 523 may output the emotion class corresponding to the recognized facial expression or the probability value for each emotion class, and also output the facial feature point (facial expression landmark point) vector.”; ¶0231-0234 “Referring to FIG. 11, the robot 100 may acquire data related to a user (S1110). [0232] The data related to the user may be moving image data that photographed a user or real-time moving image data that is photographing the user. The robot 100 may use both the stored data and the data inputted in real time. [0233] The data related to the user may include image data (including the face of the user) and voice data (uttered by the user). The image data including the face of the user may be acquired through a camera of the image acquisition unit 120, and the voice data uttered by the user may be acquired through a microphone of the voice input unit 125. [0234] The emotion recognizer 74a may recognize the emotion information of the user based on the data related to the user (S1120).”;  where emotion information based on the data related to the user which include image data and voice data); and
 generate, using a neural network separately trained with a set of audio training data and a set of video training data (¶0178]-0179 “The plurality of recognizers (or plurality of recognition processors) for each modal may include an artificial neural network corresponding to input characteristics of the unimodal input data that are inputted respectively. A multimodal emotion recognizer 511 may include an artificial neural network corresponding to characteristics of the input data. [0179] For example, the facial emotion recognizer 523 for performing image-based learning and recognition may include a Convolutional Neural Network (CNN), the other emotion recognizers 521 and 522 include a deep-network neural network (DNN), and the multimodal emotion recognizer 511 may include an artificial neural network of a Recurrent Neural Network (RNN)”; ¶0182 “The multimodal recognizer 510 may perform multimodal deep learning with the intermediate vector value of each voice, image, and text..”), a set of characteristics for controlling an avatar representing the one or more movements of the user's face ([0190] The emotion recognizer 74a may output the plurality of unimodal emotion recognition results and one multimodal emotion recognition result as a level (probability) for each emotion class.[0191] For example, the emotion recognizer 74a may output the probability value for emotional classes of surprise, happiness, neutral, sadness, displeasure, anger, and fear, and there may be a higher probability of recognized emotional class as the probability value is higher. The sum of the probability values of seven emotion classes may be 100%.”;¶ 0241 “The robot 100 may generate an avatar character by mapping emotion information of the recognized user to the face information of the user included in the data related to the user (S1130).)”; ¶0210 “According to the embodiment, the robot 100 may generate an avatar character by synthesizing a facial expression landmark point image generated in correspondence with recognized emotion information on the face image data of the user as augmented reality. For example, the frowning eye, eyebrow, and forehead may cover the eye, eyebrow, and forehead of the user's face image in their own positions with augmented reality. Thus, an avatar character expressing the user's displeasure emotion may be generated.”’ ; ¶0220 “Referring to FIG. 8, when the emotion of the user is recognized as neutrality (or neutral), the avatar character may be generated as a smiling neutral expression 8” where  emotion is recognized as neutral and avatar character may be generated as a smiling neutral which is considered as  a set of characteristics (emotion information) for controlling an avatar representing the one or more movements of the user's face.). Shin is understood to be silent on the remaining limitations of claim 1.
In the same field of endeavor, Mishra teaches a set of characteristics for controlling an avatar representing the one or more movements of the user's face, wherein the set of characteristics cause the avatar to perform the one or more movements of the user's face (¶0007 “The emotion metric input can be obtained from facial analysis of an individual. The facial analysis can be based on using classifiers, using a deep neural network, and so on. The animated avatar can represent facial expressions of the individual. The animated emoji, cartoon, morphed imaged, etc. can represent a smile, a smirk, a frown, a laugh, a yawn, etc. The facial expression can be identified using a software development kit (SDK). The software development kit can be provided by a vendor, obtained as shareware, and so on. The animated avatar can represent an empathetic mirroring of the individual. In embodiments, the empathetic mirroring can cause the avatar to have a similar expression to the individual. The similar expression can include a smile in reaction to a smile, a smirk in reaction to a smirk, and so on.”; ¶0028 “The animated avatar image can represent a mirroring of emotions. For example, in response to a person smiling, the animated avatar image can smile back. In response to a person laughing, the animated avatar image can laugh back, which includes both visual and vocal animation” ;¶[0029 of Mishra “The animated avatar image can be altered after it is first generated. For example, a person may be sad for a long time, and an avatar may be generated based on facial images of the person while a sad face is exhibited. At a later time, one or more additional facial images of the person may be obtained, and the avatar image can be modified, altered, or changed based on the subsequent one or more additional facial images. For example, a later, deep smile for the person may generate a mirrored smiling avatar image that replaces or animates from the first generated avatar image. The person may use a self-avatar which is generated according to the current invention. Thus some embodiments comprise altering a self-avatar image of a person based on facial analysis of images of the person obtained after the self-avatar image was generated)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify generating a character expressing emotion of a video call counterpart of Shin with representing a mirroring of emotions as seen in Mishra because this modification would cause the avatar to have a similar expression to the individual (¶0007 of Mishra).
Thus, the combination of Shin and Mishra teaches a non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of a first electronic device, cause the first electronic device to: receive an audio input; receive a video input including at least a portion of a user's face, wherein the video input is separate from the audio input; determine, one or more movements of the user's face based on the received audio input and received video input; and generate, using a neural network separately trained with a set of audio training data and a set of video training data, a set of characteristics for controlling an avatar representing the one or more movements of the user's face, wherein the set of characteristics cause the avatar to perform the one or more movements of the user's face.
Regarding claim 2, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 1, wherein the audio input is received by a microphone of the first electronic device (¶0233 of Shin “The data related to the user may include image data (including the face of the user) and voice data (uttered by the user). The image data including the face of the user may be acquired through a camera of the image acquisition unit 120, and the voice data uttered by the user may be acquired through a microphone of the voice input unit 125.”)
Regarding claim 3, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 1, wherein the video input is received by a camera of the first electronic device (¶0233 of Shin “The data related to the user may include image data (including the face of the user) and voice data (uttered by the user). The image data including the face of the user may be acquired through a camera of the image acquisition unit 120, and the voice data uttered by the user may be acquired through a microphone of the voice input unit 125.”)
Regarding claim 4, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 1, wherein the audio input and the video input are received from a second electronic device (¶0346-01347 of Shin “For example, the second robot 100b may receive, from the first robot 100a, image data photographed by the user of the first robot 100a, voice data uttered by the user of the first robot 100a, etc. (S1810). After that, the first robot 100a and the second robot 100b may transmit and receive data necessary for video call while continuously performing a video call. [0347] The second robot 100b, which received the image data and the voice data from the first robot 100a, may recognize the emotion of the user of the first robot 100a (i.e., the video call counterpart) based on the received image data and voice data (S1820).”)
Regarding claim 5, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 1, wherein the video input includes at least a portion of a first user's face and wherein the audio input includes speech of a second user (¶0233 of Shin “The data related to the user may include image data (including the face of the user) and voice data (uttered by the user). The image data including the face of the user may be acquired through a camera of the image acquisition unit 120, and the voice data uttered by the user may be acquired through a microphone of the voice input unit 125.”; ¶0047 of Mishra “FIG. 7 is a diagram showing image and audio collection including multiple mobile devices. Data including image data and audio data can be collected using multiple mobile devices, where the data can be used for image generation for avatar image animation using translation vectors. The plurality of translation vectors can be identified using a bottleneck layer within an autoencoder such as a variational autoencoder and a generative autoencoder. In the diagram 700, the multiple mobile devices can be used separately or in combination to collect video data, audio data, or both video data and audio data on a user 710. While one person is shown, the video data and audio data can be collected on multiple people”; ¶0048 of Mishra “As noted before, video data and audio data can be collected on one or more users in substantially identical or different situations and viewing either a single media presentation or a plurality of presentations. The data collected on the user 710 can be analyzed and viewed for a variety of purposes including expression analysis, mental state analysis, emotional state analysis, and so on. The electronic display 712 can be on a laptop computer 720 as shown, a tablet computer 750, a cell phone 740, a television, a mobile monitor, or any other type of electronic device. In one embodiment, video data including expression data is collected on a mobile device such as a cell phone 740, a tablet computer 750, a laptop computer 720, or a watch 770. Similarly, the audio data including speech data and non-speech vocalizations can be collected on one or more of the mobile devices..”) In addition, the same motivation is used as the rejection for claim 1.
Regarding claim 6, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 1, wherein the one or more programs further comprise instructions, which when executed by one or more processors of a first electronic device, cause the first electronic device to:
 provide a set of audio training data to the neural network; provide a set of video training data to the neural network (¶0179 of Shin “For example, the facial emotion recognizer 523 for performing image-based learning and recognition may include a Convolutional Neural Network (CNN), the other emotion recognizers 521 and 522 include a deep-network neural network (DNN), and the multimodal emotion recognizer 511 may include an artificial neural network of a Recurrent Neural Network (RNN).”; ¶0184] The emotion recognizer 74a may use a total of four deep learning models including the deep learning model of three emotion recognizers for each modal 521, 522, 523 and the deep learning model of one multimodal recognizer 510).; and 
92113896318Attorney Docket No.: P42882US1/77870000335101 train the neural network using both the audio training data and the video training data (¶0184 of Shin “The emotion recognizer 74a may use a total of four deep learning models including the deep learning model of three emotion recognizers for each modal 521, 522, 523 and the deep learning model of one multimodal recognizer 510.” ¶0185 of Shin “The multimodal recognizer 510 may include a merger 512 (or hidden state merger) for combining the feature point vectors outputted from the plurality of recognizers for each modal 521, 522, and 523, and a multimodal emotion recognizer 511 that is learned to recognize emotion information of the user included in the output data of the merger 512.”)
Regarding claim 7, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 6, wherein training the neural network using both the audio training data and the video training data includes at least one of: training the neural network with the audio training data and the video training data concurrently; training the neural network with the audio training data and without the video training data; and training the neural network with the video training data and without the audio training data (¶0179 of Shin “For example, the facial emotion recognizer 523 for performing image-based learning and recognition may include a Convolutional Neural Network (CNN), the other emotion recognizers 521 and 522 include a deep-network neural network (DNN), and the multimodal emotion recognizer 511 may include an artificial neural network of a Recurrent Neural Network (RNN).”; ¶0184 of Shin The emotion recognizer 74a may use a total of four deep learning models including the deep learning model of three emotion recognizers for each modal 521, 522, 523 and the deep learning model of one multimodal recognizer 510.” ¶0185 of Shin “The multimodal recognizer 510 may include a merger 512 (or hidden state merger) for combining the feature point vectors outputted from the plurality of recognizers for each modal 521, 522, and 523, and a multimodal emotion recognizer 511 that is learned to recognize emotion information of the user included in the output data of the merger 512.”)
Regarding claim 8, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 1, wherein determining, a set of data representing one or more movements of the user's face based on the received audio input and received video input further comprises: 
determining a first set of data representing a first movement of the user's face (¶0210 of Shin “According to the embodiment, the robot 100 may generate an avatar character by synthesizing a facial expression landmark point image generated in correspondence with recognized emotion information on the face image data of the user as augmented reality. For example, the frowning eye, eyebrow, and forehead may cover the eye, eyebrow, and forehead of the user's face image in their own positions with augmented reality. Thus, an avatar character expressing the user's displeasure emotion may be generated.”’ ¶0220 of Shin “Referring to FIG. 8, when the emotion of the user is recognized as neutrality (or neutral), the avatar character may be generated as a smiling neutral expression 8”); and determining a second set of data representing a second movement of the user's face (¶0221 of Shin “ When the emotion of the user is recognized as a surprise, the avatar character may be generated showing a surprise expression 820 of raising eyebrows and opening the mouth”; ¶0242 “The avatar character may express individuality of the user by a character reflecting at least one of the features extracted from the face information of the user. For example, the avatar character may be generated by reflecting at least one of the facial expression landmark point extracted from the face information of the user. If the facial expression landmark point of a specific user is an eye, various emotions can be expressed by keeping the eye as a feature point. Alternatively, if eyes and mouth are considered as landmark point, eyes and mouth to a plurality of sample characters, or to characterize only eyes and mouth shapes like a caricature.”; ¶0217 of Shin “If the recognized emotion level of the user is larger, the expression degree of specific emotion can be greatly changed in the default expression. For example, if the level of happiness is large, the degree of opening of the mouth, which is the landmark point included in the expression of the happiness emotion class, can be changed more widely.”; ¶[0029 of Mishra “The animated avatar image can be altered after it is first generated. For example, a person may be sad for a long time, and an avatar may be generated based on facial images of the person while a sad face is exhibited. At a later time, one or more additional facial images of the person may be obtained, and the avatar image can be modified, altered, or changed based on the subsequent one or more additional facial images. For example, a later, deep smile for the person may generate a mirrored smiling avatar image that replaces or animates from the first generated avatar image. The person may use a self-avatar which is generated according to the current invention. Thus some embodiments comprise altering a self-avatar image of a person based on facial analysis of images of the person obtained after the self-avatar image was generated) In addition, the same motivation is used as  the rejection for claim 1.
Regarding claim 9, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 1, wherein the neural network further comprises a plurality of neural networks including a first neural network, a second neural network, and a third neural network (¶0239 of Shin As described with reference to FIG. 5, the server 70 including the emotion recognizer 74a may include a plurality of artificial neural networks learned by the unimodal input, and may include an artificial neural network learned by the multi-modal input based on the plurality of unimodal inputs”,  ¶0335 “ As described with reference to FIG. 5, the emotion recognition server 70 may include a plurality of artificial neural networks 521, 522, and 523 learned by the unimodal input. The emotion recognition server 70 may include an artificial neural network 511 learned by the multimodal input based on the plurality of unimodal inputs. The neural networks 511, 521, 522, 523 included in the emotion recognition server 70 may be an artificial neural network suitable for respective input data”)
Regarding claim 10, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 9 (1), wherein generating, using the neural network separately trained with the set of audio training data and the set of video training data, the set of characteristics for controlling an avatar representing the one or more movements of the user's face further comprises: 
generating, with the first neural network, a first set of characteristics representing the first movement of the user's face (¶0335 of Shin “ As described with reference to FIG. 5, the emotion recognition server 70 may include a plurality of artificial neural networks 521, 522, and 523 learned by the unimodal input. The emotion recognition server 70 may include an artificial neural network 511 learned by the multimodal input based on the plurality of unimodal inputs. The neural networks 511, 521, 522, 523 included in the emotion recognition server 70 may be an artificial neural network suitable for respective input data” ¶0164 of Shin The sound unimodal input data 532 may be inputted, while being used as the speech learning data, to a speech emotion recognizer 522 (or speech emotion recognition processor) that performs deep learning”; ¶0168 of Shin “The speech emotion recognizer 522 may extract the feature points of the input speech data. The speech feature points may include tone, volume, waveform, etc. of speech. The speech emotion recognizer 522 may determine the emotion of the user by detecting a tone of speech or the like.” ¶0169 of Shin “The speech emotion recognizer 522 may also output the emotion recognition result and the detected speech feature point vectors”. ¶0069 of Mishra “Deep learning applications include processing of image data, audio data, and so on. Image data applications include image recognition, facial recognition, etc. Image data applications can include differentiating dogs from cats, identifying different human faces, and the like. The image data applications can include identifying moods, mental states, emotional states, and so on, from the facial expressions of the faces that are identified. Audio data applications can include analyzing audio such as ambient room sounds, physiological sounds such as breathing or coughing, noises made by an individual such as tapping and drumming, voices, and so on. The voice data applications can include analyzing a voice for timbre, prosody, vocal register, vocal resonance, pitch, loudness, speech rate, or language content. The voice data analysis can be used to determine one or more moods, mental states, emotional states, etc”.); 
generating, with the second neural network, a second set of characteristics representing the second movement of the user's face (¶0335 of Shin“ As described with reference to FIG. 5, the emotion recognition server 70 may include a plurality of artificial neural networks 521, 522, and 523 learned by the unimodal input. The emotion recognition server 70 may include an artificial neural network 511 learned by the multimodal input based on the plurality of unimodal inputs. The neural networks 511, 521, 522, 523 included in the emotion recognition server 70 may be an artificial neural network suitable for respective input data”; ¶0165 of Shin “The image unimodal input data 533 (including one or more face image data) may be inputted, while being used as the image learning data, to a face emotion recognizer 523 (or face emotion recognition processor) that performs deep learning.”; ¶0170 of Shin “The face emotion recognizer 523 may recognize the facial expression of the user by detecting the facial area of the user in the input image data and recognizing facial expression landmark point information which is the feature points constituting the facial expression. The face emotion recognizer 523 may output the emotion class corresponding to the recognized facial expression or the probability value for each emotion class, and also output the facial feature point (facial expression landmark point) vector.”; ¶0069 of Mishra “Deep learning applications include processing of image data, audio data, and so on. Image data applications include image recognition, facial recognition, etc. Image data applications can include differentiating dogs from cats, identifying different human faces, and the like. The image data applications can include identifying moods, mental states, emotional states, and so on, from the facial expressions of the faces that are identified. Audio data applications can include analyzing audio such as ambient room sounds, physiological sounds such as breathing or coughing, noises made by an individual such as tapping and drumming, voices, and so on. The voice data applications can include analyzing a voice for timbre, prosody, vocal register, vocal resonance, pitch, loudness, speech rate, or language content. The voice data analysis can be used to determine one or more moods, mental states, emotional states, etc”).; and 
93 113896318Attorney Docket No.: P42882US1/77870000335101 generating, with the third neural network, a combined set of characteristics representing the first movement and the second movement of the user's face (¶0185 of Shin “The multimodal recognizer 510 may include a merger 512 (or hidden state merger) for combining the feature point vectors outputted from the plurality of recognizers for each modal 521, 522, and 523, and a multimodal emotion recognizer 511 that is learned to recognize emotion information of the user included in the output data of the merger 512”; ¶0186 of Shin “The merger 512 may synchronize the output data of the plurality of recognizers for each modal 521, 522, and 523, and may combine (vector concatenation) the feature point vectors to output to the multimodal emotion recognizer 511”; ¶0188 of Shin “For example, the multimodal emotion recognizer 511 may output the emotion class having the highest probability among a certain number of preset emotion classes as the emotion recognition result, and/or may output a probability value for each emotion class as the emotion recognition result.”)  In addition, the same motivation is used as the rejection for claim 1.
Regarding claim 11, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 10, wherein the first neural network is trained with the audio training data (¶0335 of Shin “ As described with reference to FIG. 5, the emotion recognition server 70 may include a plurality of artificial neural networks 521, 522, and 523 learned by the unimodal input. The emotion recognition server 70 may include an artificial neural network 511 learned by the multimodal input based on the plurality of unimodal inputs. The neural networks 511, 521, 522, 523 included in the emotion recognition server 70 may be an artificial neural network suitable for respective input data” ¶0164 of Shin “The sound unimodal input data 532 may be inputted, while being used as the speech learning data, to a speech emotion recognizer 522 (or speech emotion recognition processor) that performs deep learning”) and the second neural network is trained with the video training data(¶0335 of Shin“ As described with reference to FIG. 5, the emotion recognition server 70 may include a plurality of artificial neural networks 521, 522, and 523 learned by the unimodal input. The emotion recognition server 70 may include an artificial neural network 511 learned by the multimodal input based on the plurality of unimodal inputs. The neural networks 511, 521, 522, 523 included in the emotion recognition server 70 may be an artificial neural network suitable for respective input data”; ¶0165 of Shin “The image unimodal input data 533 (including one or more face image data) may be inputted, while being used as the image learning data, to a face emotion recognizer 523 (or face emotion recognition processor) that performs deep learning.”)
Regarding claim 12, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 11 (8), wherein the first set of data representing the first movement of the user's face and the first set of characteristics are determined based on the received audio data (¶0165 of Shin “The image unimodal input data 533 (including one or more face image data) may be inputted, while being used as the image learning data, to a face emotion recognizer 523 (or face emotion recognition processor) that performs deep learning.”; ¶0170 of Shin “The face emotion recognizer 523 may recognize the facial expression of the user by detecting the facial area of the user in the input image data and recognizing facial expression landmark point information which is the feature points constituting the facial expression. The face emotion recognizer 523 may output the emotion class corresponding to the recognized facial expression or the probability value for each emotion class, and also output the facial feature point (facial expression landmark point) vector.”; ¶0045 of Mishra “FIG. 6 is an example illustrating translation vectors. Avatar image animation can be based on translation vectors 600. A plurality of translation vectors can be identified using a bottleneck layer within an autoencoder. An autoencoder, such as a variational autoencoder, a generational autoencoder, and so on, can include an artificial neural network. The artificial neural network can include a convolutional neural network, a deep neural network, etc. The autoencoder can be trained to generate a synthetic emotive face, where the synthetic emotive face can be an emoji, a cartoon, an image of a person, a morphed image, and so on. The generating a synthetic emotive face can include generating a neutral avatar face, where the neutral avatar face can display a neutral avatar expression 610. The neutral avatar expression can be learned for a neutral facial expression of a person, for averaged facial expressions for a plurality of people, and so on. The learning can be based on using a convolutional neural network for which layers of the convolutional neural network can be generated. In embodiments, the learning can include generating a first set of bottleneck layer parameters, from the bottleneck layer, learned for a neutral face. The neutral face can be "translated" using translation vectors 620 to show other emotions, where determining the one or more emotions can be based on detecting laughter, cries, sighs, yawns, grunts, filled and unfilled pauses, and so on, from the video data and the audio data. In embodiments, the translating can be based on generating a second set of bottleneck layer parameters for an emotional face. In other embodiments, subtracting the first set of bottleneck layer parameters from the second set of bottleneck layer parameters can be used in the translation vectors.) In addition, the same motivation is used as the rejection for claim 1.
Regarding claim 13, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 11(8), wherein the second set of data representing the second movement of the user's face and the second set of characteristics is based on the received video data separate from the audio data (¶0165 of Shin “The image unimodal input data 533 (including one or more face image data) may be inputted, while being used as the image learning data, to a face emotion recognizer 523 (or face emotion recognition processor) that performs deep learning.”; ¶0170 of Shin “The face emotion recognizer 523 may recognize the facial expression of the user by detecting the facial area of the user in the input image data and recognizing facial expression landmark point information which is the feature points constituting the facial expression. The face emotion recognizer 523 may output the emotion class corresponding to the recognized facial expression or the probability value for each emotion class, and also output the facial feature point (facial expression landmark point) vector.”)
Regarding claim 15, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 14 (8), wherein the first set of data representing the first movement of the user's face and the first set of characteristics is based on the received audio data and the received video data (¶0045-0046 of Mishra “FIG. 6 is an example illustrating translation vectors. Avatar image animation can be based on translation vectors 600. A plurality of translation vectors can be identified using a bottleneck layer within an autoencoder. An autoencoder, such as a variational autoencoder, a generational autoencoder, and so on, can include an artificial neural network. The artificial neural network can include a convolutional neural network, a deep neural network, etc. The autoencoder can be trained to generate a synthetic emotive face, where the synthetic emotive face can be an emoji, a cartoon, an image of a person, a morphed image, and so on. The generating a synthetic emotive face can include generating a neutral avatar face, where the neutral avatar face can display a neutral avatar expression 610. The neutral avatar expression can be learned for a neutral facial expression of a person, for averaged facial expressions for a plurality of people, and so on. The learning can be based on using a convolutional neural network for which layers of the convolutional neural network can be generated. In embodiments, the learning can include generating a first set of bottleneck layer parameters, from the bottleneck layer, learned for a neutral face. The neutral face can be "translated" using translation vectors 620 to show other emotions, where determining the one or more emotions can be based on detecting laughter, cries, sighs, yawns, grunts, filled and unfilled pauses, and so on, from the video data and the audio data” [0046] The avatar or bot, including an animated avatar, may display other empathetic data which is different from the empathetic data of a user. The displaying of empathetic data can be based on the translation vectors. In embodiments, the avatar can be based on "empathetic mirroring". For empathetic mirroring, the avatar might mirror back the same facial expression as seen on the face of a person, while at other times, the avatar might mirror back a different expression. In embodiments, the avatar expression might mirror laughter when the person is laughing, while the avatar might mirror a sad face when the person is crying, or a thinking face when the person is angry. The translation vectors can be used to generate a variety of facial expressions including a smile 630, a frown 632, a yawn 634, a smirk 636, a laugh 638, sadness, thinking, ennui, and so on”) In addition, the same motivation is used as the rejection for claim 1.
Regarding claim 16, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 14 (8), wherein the second set of data representing the second movement of the user's face and the second set of characteristics is based on the received video data (¶0165 of Shin “The image unimodal input data 533 (including one or more face image data) may be inputted, while being used as the image learning data, to a face emotion recognizer 523 (or face emotion recognition processor) that performs deep learning.”; ¶0170 of Shin “The face emotion recognizer 523 may recognize the facial expression of the user by detecting the facial area of the user in the input image data and recognizing facial expression landmark point information which is the feature points constituting the facial expression. The face emotion recognizer 523 may output the emotion class corresponding to the recognized facial expression or the probability value for each emotion class, and also output the facial feature point (facial expression landmark point) vector.”; ¶[0029 of Mishra “The animated avatar image can be altered after it is first generated. For example, a person may be sad for a long time, and an avatar may be generated based on facial images of the person while a sad face is exhibited. At a later time, one or more additional facial images of the person may be obtained, and the avatar image can be modified, altered, or changed based on the subsequent one or more additional facial images. For example, a later, deep smile for the person may generate a mirrored smiling avatar image that replaces or animates from the first generated avatar image. The person may use a self-avatar which is generated according to the current invention. Thus some embodiments comprise altering a self-avatar image of a person based on facial analysis of images of the person obtained after the self-avatar image was generated) In addition, the same motivation is used as the rejection for claim 1.
Regarding claim 17, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 10 (1), wherein the one or more programs further comprise instructions, which when executed by one or more processors of a first electronic device, cause the first electronic device to: 
generate an avatar representing the user (¶0211  of Shin “Alternatively, the robot 100 may first generate the animation character based on face information of the user. Such an animation character may also be generated by reflecting the detected facial expression landmark points of the user. For example, in the example of a user having a large nose, animation character having a large nose may be created”; ¶0029 of Mishra “ The animated avatar image can be altered after it is first generated. For example, a person may be sad for a long time, and an avatar may be generated based on facial images of the person while a sad face is exhibited. At a later time, one or more additional facial images of the person may be obtained, and the avatar image can be modified, altered, or changed based on the subsequent one or more additional facial images. For example, a later, deep smile for the person may generate a mirrored smiling avatar image that replaces or animates from the first generated avatar image. The person may use a self-avatar which is generated according to the current invention. Thus some embodiments comprise altering a self-avatar image of a person based on facial analysis of images of the person obtained after the self-avatar image was g); and  94 113896318Attorney Docket No.: P42882US1/77870000335101 
animate the avatar using the combined set of characteristics representing the first movement and the second movement of the user's face (¶00211 of Shin  “Additionally, the robot 100 may change the facial expression landmark points of the generated animation character to correspond to the recognized emotion information, thereby generating an avatar character expressing the specific emotion of the user.” ¶0220-0222 of Shin “ Referring to FIG. 8, when the emotion of the user is recognized as neutrality (or neutral), the avatar character may be generated as a smiling neutral expression 810. The neutral expression 810 may be set to a default expression that is used when the robot 100 does not recognize a particular emotion.  [0221] When the emotion of the user is recognized as a surprise, the avatar character may be generated showing a surprise expression 820 of raising eyebrows and opening the mouth.  [0222] When the emotion of the user is recognized as a displeasure, the avatar character may be generated showing a displeasure expression 830 of dropping the corner of his mouth and frowning”;  ¶[0029 of Mishra “The animated avatar image can be altered after it is first generated. For example, a person may be sad for a long time, and an avatar may be generated based on facial images of the person while a sad face is exhibited. At a later time, one or more additional facial images of the person may be obtained, and the avatar image can be modified, altered, or changed based on the subsequent one or more additional facial images. For example, a later, deep smile for the person may generate a mirrored smiling avatar image that replaces or animates from the first generated avatar image. The person may use a self-avatar which is generated according to the current invention. Thus some embodiments comprise altering a self-avatar image of a person based on facial analysis of images of the person obtained after the self-avatar image was generated). In addition, the same motivation is used as the rejection for claim 1.
Regarding claim 18, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 17, wherein animating the avatar using the combined set of characteristics representing the first movement and the second movement of the user's face further comprises: 
animating a first portion of the avatar using the first set of characteristics representing the first movement of the user's face; and animating a second portion of the avatar using the second set of characteristics representing the second movement of the user's face (¶0220-0222 of Shin “ Referring to FIG. 8, when the emotion of the user is recognized as neutrality (or neutral), the avatar character may be generated as a smiling neutral expression 810. The neutral expression 810 may be set to a default expression that is used when the robot 100 does not recognize a particular emotion.  [0221] When the emotion of the user is recognized as a surprise, the avatar character may be generated showing a surprise expression 820 of raising eyebrows and opening the mouth.  [0222] When the emotion of the user is recognized as a displeasure, the avatar character may be generated showing a displeasure expression 830 of dropping the corner of his mouth and frowning”; ¶0029 of Mishra “The animated avatar image can be altered after it is first generated. For example, a person may be sad for a long time, and an avatar may be generated based on facial images of the person while a sad face is exhibited. At a later time, one or more additional facial images of the person may be obtained, and the avatar image can be modified, altered, or changed based on the subsequent one or more additional facial images. For example, a later, deep smile for the person may generate a mirrored smiling avatar image that replaces or animates from the first generated avatar image. The person may use a self-avatar which is generated according to the current invention. Thus some embodiments comprise altering a self-avatar image of a person based on facial analysis of images of the person obtained after the self-avatar image was generated) In addition, the same motivation is used as the rejection for claim 1.
Regarding claim 19, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 1, wherein the one or more programs further comprise instructions, which when executed by one or more processors of a first electronic device, cause the first electronic device to: 
generate an avatar representing the user (¶0262 of Shin “According to the embodiment, the robot 100 may generate an avatar character by synthesizing a facial expression landmark point image generated in correspondence with recognized emotion information on the face image data of the user, with augmented reality.”); and 
animate the avatar using the set of characteristics representing the one or more movements of the user's face (¶0263 of Shin “Alternatively, the robot 100 may first generate the animation character based on the face information of the user. Such an animation character may also be generated by reflecting the detected landmark points of the user. The robot 100 may change the facial expression landmark points of the generated animation character to correspond to the recognized emotion information, thereby generating an avatar character expressing a specific emotion of the user”; [0029 of Mishra “The animated avatar image can be altered after it is first generated. For example, a person may be sad for a long time, and an avatar may be generated based on facial images of the person while a sad face is exhibited. At a later time, one or more additional facial images of the person may be obtained, and the avatar image can be modified, altered, or changed based on the subsequent one or more additional facial images. For example, a later, deep smile for the person may generate a mirrored smiling avatar image that replaces or animates from the first generated avatar image. The person may use a self-avatar which is generated according to the current invention. Thus some embodiments comprise altering a self-avatar image of a person based on facial analysis of images of the person obtained after the self-avatar image was generated) In addition, the same motivation is used as the rejection for claim 1.
	Regarding claim 20, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 19 , wherein animating the avatar using the set of characteristics representing the one or more movements of the user's face further comprises: 
animating a first portion of the avatar using a first portion of the set of characteristics representing a first movement of the user's face (¶0224 of Shin “FIG. 9 illustrates facial expressions of an avatar character expressing the emotion class of anger. Referring to FIGS. 9(a) and 9(b), a first anger expression 910 and a second anger expression 920 may express shapes of eyes and mouth differently.” where shape of eyes); and 
animating a second portion of the avatar using a second portion of the set of characteristics representing a second movement of the user's face (¶0225 of Shin  “FIG. 10 illustrates facial expressions of an avatar character expressing the emotion class of happiness. Referring to FIGS. 10(a), 10(b), and 10(c), a first happiness expression 1010, a second happiness expression 1020, and a third happiness expression 1030 may express shapes of the eyes and the mouth differently.” where shapes of the mouth)
Regarding claim 21, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 19 , wherein the one or more programs further comprise instructions, which when executed by one or more processors of a first electronic device, cause the first electronic device to:
 display the animated avatar on a screen of the electronic device (¶0066 of Shin “The robot 100 may include a head 110 disposed in the upper side of the main body. A display 182 for displaying an image may be disposed on the front surface of the head 110”; ¶0284 of Shin “ The robot may recognize emotion such as happiness, sadness, anger, surprise, fear, neutrality, and displeasure of at least one of the video call participants, map the recognized emotion to the character, and display this during a call.”)
Regarding independent claim 22, Shin teaches an electronic device comprising: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for (¶0389] “he method of operating the robot and the robot system according to an example embodiment can be implemented as a code readable by a processor on a recording medium readable by the processor. The processor-readable recording medium includes all kinds of recording apparatuses in which data that can be read by the processor is stored. Examples of the recording medium that can be read by the processor include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage apparatus, and/or the like, and may also be implemented in the form of a carrier wave such as transmission over the Internet. In addition, the processor-readable recording medium may be distributed over network-connected computer systems so that code readable by the processor in a distributed fashion can be stored and executed”): Remaining of claim 22 is similar in scope to claim 1, and therefore rejected under the same rationale. 
	Regarding independent claim 23, Shin teaches a method, comprising: at an electronic device with one or more processors and memory (¶0389] “he method of operating the robot and the robot system according to an example embodiment can be implemented as a code readable by a processor on a recording medium readable by the processor. The processor-readable recording medium includes all kinds of recording apparatuses in which data that can be read by the processor is stored. Examples of the recording medium that can be read by the processor include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage apparatus, and/or the like, and may also be implemented in the form of a carrier wave such as transmission over the Internet. In addition, the processor-readable recording medium may be distributed over network-connected computer systems so that code readable by the processor in a distributed fashion can be stored and executed”): 
receiving an audio input (¶0082 “The robot 100 may include a voice input unit 125 for receiving a speech input of a user. The voice input unit may also be called a speech input unit or a voice/speech input device);
 receiving a video input including at least a portion of a user's face, wherein the video input is separate from the audio input (¶0080 “he image acquisition unit 120 may photograph the front direction of the robot 100, and may photograph an image for user recognition”; ¶0158 “For example, the input data 590 may be moving image data photographed by the user, and the moving image data may include image data in which the user's face or the like is photographed and audio data including a speech uttered by a user.”); 
determining a set of data representing one or more movements of the user's face based on the received audio input and received video input (¶0231-0234 “Referring to FIG. 11, the robot 100 may acquire data related to a user (S1110). [0232] The data related to the user may be moving image data that photographed a user or real-time moving image data that is photographing the user. The robot 100 may use both the stored data and the data inputted in real time. [0233] The data related to the user may include image data (including the face of the user) and voice data (uttered by the user). The image data including the face of the user may be acquired through a camera of the image acquisition unit 120, and the voice data uttered by the user may be acquired through a microphone of the voice input unit 125. [0234] The emotion recognizer 74a may recognize the emotion information of the user based on the data related to the user (S1120).” where emotion information based on the data related to the user which include image data and voice data); and 
generating, using a neural network separately trained with a set of audio training data and a set of video training data(¶0178]-0179 “The plurality of recognizers (or plurality of recognition processors) for each modal may include an artificial neural network corresponding to input characteristics of the unimodal input data that are inputted respectively. A multimodal emotion recognizer 511 may include an artificial neural network corresponding to characteristics of the input data. [0179] For example, the facial emotion recognizer 523 for performing image-based learning and recognition may include a Convolutional Neural Network (CNN), the other emotion recognizers 521 and 522 include a deep-network neural network (DNN), and the multimodal emotion recognizer 511 may include an artificial neural network of a Recurrent Neural Network (RNN)”; ¶0182 “The multimodal recognizer 510 may perform multimodal deep learning with the intermediate vector value of each voice, image, and text..”), a set of characteristics for controlling an avatar representing the one or more movements of the user's face (¶0190] The emotion recognizer 74a may output the plurality of unimodal emotion recognition results and one multimodal emotion recognition result as a level (probability) for each emotion class.[0191] For example, the emotion recognizer 74a may output the probability value for emotional classes of surprise, happiness, neutral, sadness, displeasure, anger, and fear, and there may be a higher probability of recognized emotional class as the probability value is higher. The sum of the probability values of seven emotion classes may be 100%.”;¶ 0241 “The robot 100 may generate an avatar character by mapping emotion information of the recognized user to the face information of the user included in the data related to the user (S1130).)” 0210 “According to the embodiment, the robot 100 may generate an avatar character by synthesizing a facial expression landmark point image generated in correspondence with recognized emotion information on the face image data of the user as augmented reality. For example, the frowning eye, eyebrow, and forehead may cover the eye, eyebrow, and forehead of the user's face image in their own positions with augmented reality. Thus, an avatar character expressing the user's displeasure emotion may be generated.”’ ; ¶0220 “Referring to FIG. 8, when the emotion of the user is recognized as neutrality (or neutral), the avatar character may be generated as a smiling neutral expression 8” where  emotion is recognized as neutral and avatar character may be generated as a smiling neutral which is considered as  a set of characteristics (emotion information) for controlling an avatar representing the one or more movements of the user's face.). Shin is understood to be silent on the remaining limitations of claim 23.
In the same field of endeavor, Mishra teaches a set of characteristics for controlling an avatar representing the one or more movements of the user's face, wherein the set of characteristics cause the avatar to perform the one or more movements of the user's face (¶0007 “The emotion metric input can be obtained from facial analysis of an individual. The facial analysis can be based on using classifiers, using a deep neural network, and so on. The animated avatar can represent facial expressions of the individual. The animated emoji, cartoon, morphed imaged, etc. can represent a smile, a smirk, a frown, a laugh, a yawn, etc. The facial expression can be identified using a software development kit (SDK). The software development kit can be provided by a vendor, obtained as shareware, and so on. The animated avatar can represent an empathetic mirroring of the individual. In embodiments, the empathetic mirroring can cause the avatar to have a similar expression to the individual. The similar expression can include a smile in reaction to a smile, a smirk in reaction to a smirk, and so on.”; ¶0028 “The animated avatar image can represent a mirroring of emotions. For example, in response to a person smiling, the animated avatar image can smile back. In response to a person laughing, the animated avatar image can laugh back, which includes both visual and vocal animation” ;¶[0029 of Mishra “The animated avatar image can be altered after it is first generated. For example, a person may be sad for a long time, and an avatar may be generated based on facial images of the person while a sad face is exhibited. At a later time, one or more additional facial images of the person may be obtained, and the avatar image can be modified, altered, or changed based on the subsequent one or more additional facial images. For example, a later, deep smile for the person may generate a mirrored smiling avatar image that replaces or animates from the first generated avatar image. The person may use a self-avatar which is generated according to the current invention. Thus some embodiments comprise altering a self-avatar image of a person based on facial analysis of images of the person obtained after the self-avatar image was generated)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify generating a character expressing emotion of a video call counterpart of Shin with representing a mirroring of emotions as seen in Mishra because this modification would cause the avatar to have a similar expression to the individual (¶0007 of Mishra).
Thus, the combination of Shin and Mishra teaches a method, comprising: at an electronic device with one or more processors and memory: receiving an audio input; 6 121353863Application No.: 16/723,866Docket No.: P42882US1/77870000335101 receiving a video input including at least a portion of a user's face, wherein the video input is separate from the audio input; determining a set of data representing one or more movements of the user's face based on the received audio input and received video input; and generating, using a neural network separately trained with a set of audio training data and a set of video training data, a set of characteristics for controlling an avatar representing the one or more movements of the user's face, wherein the set of characteristics cause the avatar to perform the one or more movements of the user's face.
Regarding claim 24, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 1, wherein the set of characteristics for controlling the avatar cause the avatar to move (¶0210 of Shin “According to the embodiment, the robot 100 may generate an avatar character by synthesizing a facial expression landmark point image generated in correspondence with recognized emotion information on the face image data of the user as augmented reality. For example, the frowning eye, eyebrow, and forehead may cover the eye, eyebrow, and forehead of the user's face image in their own positions with augmented reality. Thus, an avatar character expressing the user's displeasure emotion may be generated.”’ ; ¶0220 of Shin “Referring to FIG. 8, when the emotion of the user is recognized as neutrality (or neutral), the avatar character may be generated as a smiling neutral expression 8”; ¶0007 of Mishra “The emotion metric input can be obtained from facial analysis of an individual. The facial analysis can be based on using classifiers, using a deep neural network, and so on. The animated avatar can represent facial expressions of the individual. The animated emoji, cartoon, morphed imaged, etc. can represent a smile, a smirk, a frown, a laugh, a yawn, etc. The facial expression can be identified using a software development kit (SDK). The software development kit can be provided by a vendor, obtained as shareware, and so on. The animated avatar can represent an empathetic mirroring of the individual. In embodiments, the empathetic mirroring can cause the avatar to have a similar expression to the individual. The similar expression can include a smile in reaction to a smile, a smirk in reaction to a smirk, and so on.”; ¶0028 of Mishra “The animated avatar image can represent a mirroring of emotions. For example, in response to a person smiling, the animated avatar image can smile back. In response to a person laughing, the animated avatar image can laugh back, which includes both visual and vocal animation”; ¶0029 of Mishra “The animated avatar image can be altered after it is first generated. For example, a person may be sad for a long time, and an avatar may be generated based on facial images of the person while a sad face is exhibited. At a later time, one or more additional facial images of the person may be obtained, and the avatar image can be modified, altered, or changed based on the subsequent one or more additional facial images. For example, a later, deep smile for the person may generate a mirrored smiling avatar image that replaces or animates from the first generated avatar image. The person may use a self-avatar which is generated according to the current invention. Thus some embodiments comprise altering a self-avatar image of a person based on facial analysis of images of the person obtained after the self-avatar image was generated) In addition, the same motivation is used as the rejection for claim 1.
Regarding claim 25, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 1, wherein the set of characteristics for controlling the avatar cause the avatar to reflect the one or more movements of the user's face (¶0007 of Mishra “The emotion metric input can be obtained from facial analysis of an individual. The facial analysis can be based on using classifiers, using a deep neural network, and so on. The animated avatar can represent facial expressions of the individual. The animated emoji, cartoon, morphed imaged, etc. can represent a smile, a smirk, a frown, a laugh, a yawn, etc. The facial expression can be identified using a software development kit (SDK). The software development kit can be provided by a vendor, obtained as shareware, and so on. The animated avatar can represent an empathetic mirroring of the individual. In embodiments, the empathetic mirroring can cause the avatar to have a similar expression to the individual. The similar expression can include a smile in reaction to a smile, a smirk in reaction to a smirk, and so on.”; ¶0028 of Mishra “The animated avatar image can represent a mirroring of emotions. For example, in response to a person smiling, the animated avatar image can smile back. In response to a person laughing, the animated avatar image can laugh back, which includes both visual and vocal animation” ¶0029 of Mishra “The animated avatar image can be altered after it is first generated. For example, a person may be sad for a long time, and an avatar may be generated based on facial images of the person while a sad face is exhibited. At a later time, one or more additional facial images of the person may be obtained, and the avatar image can be modified, altered, or changed based on the subsequent one or more additional facial images. For example, a later, deep smile for the person may generate a mirrored smiling avatar image that replaces or animates from the first generated avatar image. The person may use a self-avatar which is generated according to the current invention. Thus some embodiments comprise altering a self-avatar image of a person based on facial analysis of images of the person obtained after the self-avatar image was generated) In addition, the same motivation is used as the rejection for claim 1.
2.  Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Shin et al., U.S Patent Application Publication No.2020/0090393 (“Shin”) in view of el Mishra et al., U.S Patent Application Publication No 20190172243 (“Mishra”) further in view of el Kaliouby et al., U.S Patent Application Publication No 20190012599 (“el Kaliouby”)
Regarding claim 14, Shin and Mishra teach the non-transitory computer-readable storage medium of claim 10, wherein the second neural network is trained with the video training data(¶0335 of Shin “ As described with reference to FIG. 5, the emotion recognition server 70 may include a plurality of artificial neural networks 521, 522, and 523 learned by the unimodal input. The emotion recognition server 70 may include an artificial neural network 511 learned by the multimodal input based on the plurality of unimodal inputs. The neural networks 511, 521, 522, 523 included in the emotion recognition server 70 may be an artificial neural network suitable for respective input data”; ¶0165 of Shin “The image unimodal input data 533 (including one or more face image data) may be inputted, while being used as the image learning data, to a face emotion recognizer 523 (or face emotion recognition processor) that performs deep learning.”) Shin and Mishra are understood to be silent on the remaining limitations of claim 14.
In the same field of endeavor, el Kaliouby teaches wherein the first neural network is trained with the audio training data and the video training data (¶0057 “FIG. 3 illustrates a high-level diagram for deep learning. Multimodal machine learning can be based on deep learning. A plurality of information channels is captured into a computing device such as a smartphone, personal digital assistant (PDA), tablet, laptop computer, and so on. The plurality of information channels includes contemporaneous audio information and video information from an individual. Trained weights are learned on a multilayered convolutional computing system. The trained weights are learned using the audio information and the video information from the plurality of information channels. The trained weights cover both the audio information and the video information and are trained simultaneously. The learning facilitates emotional analysis of the audio information and the video information. Further information is captured into a second computing device. The second computing device and the first computing device may be the same computing device. The further information can include physiological information, contextual information, and so on. The further information is analyzed using the trained weights to provide an emotion metric based on the further information.”; ¶0059 “Deep learning is a branch of machine learning which seeks to imitate in software the activity which takes place in layers of neurons in the neocortex of the human brain. Deep learning applications include processing of image data, audio data, and so on. FIG. 3 illustrates a high-level diagram for deep learning 300. The deep learning can be accomplished using a multilayered convolutional computing system, a convolutional neural network, or other techniques. The deep learning can accomplish image analysis, audio analysis, and other analysis tasks. A deep learning component 320 collects and analyzes various types of information from a plurality of information channels. The information channels can include video facial information 310, audio voice information 312, other information 314, and so on. In embodiments, the other information can include one or more of electrodermal activity, heart rate, heart rate variability, skin temperature, blood pressure, muscle movements, or respiration”)
Therefore, in combination of Shin and Mishra, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify using first neural network is trained with the audio  training data of Shin with training using audio information and video information as seen in el Kaliouby because this modification would facilitate emotional analysis of the audio information and the video information (¶0057 of el Kaliouby).
Thus, the combination of Shin, Mishra and el Kaliouby teaches wherein the first neural network is trained with the audio training data and the video training data and the second neural network is trained with the video training data.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Singh (U.S Patent Application Publication No. 20200051565)- computing system for generating image data representing a speaker's face includes a detection device configured to route data representing a voice signal to one or more processors and a data processing device comprising the one or more processors configured to generate a representation of a speaker that generated the voice signal in response to receiving the voice signal. The data processing device executes a voice embedding function to generate a feature vector from the voice signal representing one or more signal features of the voice signal, maps a signal feature of the feature vector to a visual feature of the speaker by a modality transfer function specifying a relationship between the visual feature of the speaker and the signal feature of the feature vector; and generates a visual representation of at least a portion of the speaker based on the mapping, the visual representation comprising the visual feature.
Mosseri et al, (U.S Patent Application Publication No.20200335121)- method includes: obtaining, for each frame in a stream of frames from a video in which faces of one or more speakers have been detected, a respective per-frame face embedding of the face of each speaker; processing, for each speaker, the per-frame face embeddings of the face of the speaker to generate visual features for the face of the speaker; obtaining a spectrogram of an audio soundtrack for the video; processing the spectrogram to generate an audio embedding for the audio soundtrack; combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the video; determining a respective spectrogram mask for each of the one or more speakers; and determining a respective isolated speech spectrogram for each speaker.
MITTAL et al, U.S Patent Application Publication No. 20200135226) computer-implemented technique for animating a visual representation of a face based on spoken words of a speaker is described herein. A computing device receives an audio sequence comprising content features reflective of spoken words uttered by a speaker. The computing device generates latent content variables and latent style variables based upon the audio sequence. The latent content variables are used to synchronized movement of lips on the visual representation to the spoken words uttered by the speaker. The latent style variables are derived from an expected appearance of facial features of the speaker as the speaker utters the spoken words and are used to synchronize movement of full facial features of the visual representation to the spoken words uttered by the speaker. The computing device causes the visual representation of the face to be animated on a display based upon the latent content variables and the latent style variables.

Contact
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SARAH LE whose telephone number is (571)270-7842. The examiner can normally be reached Monday: 8AM-4:30PM EST, Tuesday: 8 AM-3:30PM EST, Wednesday: 8AM-2:30PM EST, Thursday and Friday off.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kent Chang can be reached on (571) 272-7667. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SARAH LE/Primary Examiner, Art Unit 2619