DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

Response to Preliminary Amendment
Applicant’s preliminary amendment filed 6/9/2020 has been entered. The claims 1-7 have been amended. The claims 8-16 have been cancelled. The claims 17-29 have been newly added. The claims 1-7 and 17-29 are pending in the current application. 

Specification
The title of the invention is not descriptive.  A new title is required that is clearly indicative of the invention to which the claims are directed. 

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


The claims 24-29 are rejected under 35 USC 101 as being directed to non-statutory subject matter.  The claim 24 recites “[a] computer readable storage medium having a computer program stored thereon”. 
Applicant’s specification discloses at Page 25 that “the computer readable medium shown in the present invention may be a computer readable signal medium….computer readable but is not limited to….The computer readable signal medium may also be any computer readable medium other than the computer readable storage medium”. 
The claimed computer readable storage medium, under BRI, is not necessarily non-transitory computer readable storage medium. The claim invention in the claim 24 is thus non-statutory subject matter. The claims 25-29 are subject to the same rationale of rejection as the claim 24.   

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 2, 4-7, 17, 18, 20-25 and 27-29 are rejected under 35 U.S.C. 103 as being unpatentable over Nauseef et al. US-PGPUB No. 2016/0191958 (hereinafter Nauseef) in view of Sahn US-PGPUB No. 2015/0099946 (hereinafter Sahn).   
Re Claim 1: 
Nauseef teaches an augmented reality method comprising: 
acquiring video information of a target (Nauseef teaches at FIG. 1 and Paragraph 0032 that each of the first user 102 and the second user 104 may hold a user device in front of his or her face so that a camera 110, 112 included in each respective user device 106, 108 may capture a live video feed of each user’s face); and 
Nauseef teaches at FIG. 1 and Paragraph 0032 that each of the first user 102 and the second user 104 may hold a user device in front of his or her face so that a camera 110, 112 included in each respective user device 106, 108 may capture a live video feed of each user’s face. Audio of each user may also be captured by a microphone); 
using the real image information to determine at least one image-based target state data (Nauseef teaches at Paragraph 0033 that analyzing the live video and/or audio feeds may enable the server to detect facial features of each user 102, 104 as well as any speech characteristics of each user’s speech. The facial features and/or speech characteristics identified during analysis of the video communication connection may be used to identify emotional cues such as facial gestures or vocal inflections of each user 102, 104 that are associated with predetermined emotions); 
using the real sound information to determine at least one sound-based target state data (The sound-based target state data may include the locational cues and/or emotional cues and/or contextual features relating to the vocal pitch/tone shifting. 
Nauseef teaches at Paragraph 0122 that the first user may be presented with contextual sound clips and/or acoustic filters that the first user may apply to the conversation and at Paragraph 0056 that the features unit 322 may utilize the numerical values of identified emotional cues to identify one or more contextual features (icons, audio clips) to be presented to a user device. 
Nauseef teaches at Paragraph 0053 that recognition 320 may be utilized for identifying vocal inflections of users…the gesture analysis unit 320 may analyze vocal inflection identified by the facial/vocal recognition unit 318 to identify emotional cues of users….emotional cues may include….tongue movements, teeth movements, vocal pitch shifting, vocal tone shifting, changes in word delivery speed and at Paragraph 0099 that the facial/vocal recognition unit 318 may analyze any captured audio of each user to identify changes in vocal pitch and/or vocal tone…..to determine whether the user is laughing, crying, yelling, screaming, using sarcasm, and/or is otherwise displaying a particular emotion…contextual features may be presented to various users at relevant times during a conversion and at Paragraph 0121 that involving audio data such as pitch, cadence may be analyzed by the facial/vocal recognition unit 318 and/or the features unit 322 to discern emotions and other contextual information);  
acquiring virtual information corresponding to the target portrait data (Nauseef teaches at Paragraph 0033 that analyzing the live video and/or audio feeds may enable the server to detect facial features of each user 102, 104 as well as any speech characteristics of each user’s speech. The facial features and/or speech characteristics identified during analysis of the video communication connection may be used to identify emotional cues such as facial gestures or vocal inflections of each user 102, 104 that are associated with predetermined emotions); and 
superimposing the virtual information on the video information (Nauseef teaches at Paragraph 0033 that analyzing the live video and/or audio feeds may enable the server to detect facial features of each user 102, 104 as well as any speech characteristics of each user’s speech. The facial features and/or speech characteristics identified during analysis of the video communication connection may be used to identify emotional cues such as facial gestures or vocal inflections of each user 102, 104 that are associated with predetermined emotions and at Paragraph 0109 that when a user smiles, an image of a dinosaur that has been overlaid on the image of the user in the live video feed of the user may smile as well using the user’s detected smile as a reference….a smiley face icon may follow the movements of a user’s face in the live video feed so that when a user moves his head within the frame of the live video feed, the smiley face icon stays overlain on the user’s face).  
Nauseef implicitly teaches the claim limitation: 
fusing the image-based target state data and the sound-based target state data of a same type to obtain target portrait data (Nauseef teaches at Paragraph 0122 that the first user may be presented with contextual sound clips and/or acoustic filters that the first user may apply to the conversation and at Paragraph 0056 that the features unit 322 may utilize the numerical values of identified emotional cues to identify one or more contextual features (icons, audio clips) to be presented to a user device. 
Nauseef teaches at Paragraph 0049 that contextual features may include icons, emotions, images, text, audio samples and/or video clips associated with one or more predetermined emotions and at Paragraph 0108 the relevant contextual features may be presented to the user….selecting a contextual feature for incorporation may include overlaying a live video feed and/or a live audio feed with an image, text, an icon, an audio clip and the like…selecting a contextual feature for incorporation….may further include masking and/or modifying a live audio feed of a user by modulating the user’s voice…..augmenting a background image of the live video feed with a pattern with an image of a particular setting and at Paragraph 0109 a smiley face icon may follow the movements of a user’s face in the live video feed 
Nauseef teaches at 0038 that based on detection of a first user 102’s smile and raised eyebrows, the server may provide to the first user device 106 a set of contextual features 118 associated with happiness, such as smiley face icons, a party hat, and/or the like. The first user 102 may then select one or more of the provided contextual features 118 to overlay the first user 102’s face in the video communication connection to enhance the happy emotions currently being experienced by the first user 102 and at Paragraph 0053 that emotional cues may include vocal pitch shifting and/or vocal tone shifting and at Paragraph 0057 that the features unit 322 may identify one or more contextual features, e.g., icons, audio samples stored in the content storage unit 334 to be presented to a second user. The user may then select one or more of the contextual features such as smiley face icon for overlay into the video communication connection. Accordingly, the audio samples and icons are fused to obtain target portrait data). 
Sahn teaches the claim limitation: fusing the image-based target state data and the sound-based target state data of a same type to obtain target portrait data (Sahn teaches at FIGS. 10A-10B and Paragraph 0231-0234 that audio-derived emotional cues are applied to the identified emotional states to refine the emotional state of at least one individual…audio-derived emotional cues may be used to promote or demote the various options to identify a most likely emotional state candidate…audio-derived emotional cues may be used as a primary reference…to determine the emotional state of at least one individual….a feedback algorithm may augment the video feed of a heads-up display of a data collection device to overlay a description of the emotional state of the individual…a term or sentence for the emotional state may be presented audibly to the user such as mom is happy. Further, audio or video feedback may spell out to the user the particular response behavior to invoke such as an audible cue directing the subject to smile now or a visual cue including the text nod your head and look concerned…the user may be presented with verbal and/or audible warnings such as may bite or back away). 
It would have been obvious to one of the ordinary skill in the art before filing date of the instant application to have combined the teaching of Sahn and Nauseef to have augmented an individual’s video data and audio data combined with visual cues such as the emotional icons and audible feedback cues to have indicated the emotional states of the individual detected based 
Re Claim 2: 
The claim 2 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that the image-based target state data includes at least one of emotion data, age data., and gender data, wherein the sound-based target state data includes at least one of emotion data, age data and gender data, and wherein at least one of the image-based target state data and the sound-based target state data includes a judgment result and a confidence degree corresponding to the judgment result. 
However, Nauseef further teaches the claim limitation that the image-based target state data includes at least one of emotion data, age data., and gender data, wherein the sound-based target state data includes at least one of emotion data, age data and gender data (
Nauseef teaches at Paragraph 0122 that the first user may be presented with contextual sound clips and/or acoustic filters that the first user may apply to the conversation and at Paragraph 0056 that the features unit 322 may utilize the numerical values of identified emotional cues to identify one or more contextual features (icons, audio clips) to be presented to a user device. 
Nauseef teaches at Paragraph 0033 that analyzing the live video and/or audio feeds may enable the server to detect facial features of each user 102, 104 as well as any speech characteristics of each user’s speech. The facial features and/or speech characteristics identified during analysis of the video communication connection may be used to identify emotional cues such as facial gestures or vocal inflections of each user 102, 104 that are associated with predetermined emotions. Nauseef teaches at Paragraph 0104 that a live video feed and/or a live audio feed….may be analyzed for a particular locational cues such as landmarks…speech accents, dialects…the facial/vocal recognition unit 318 may identify one or more locational cues included in the live video feed of the user and determine at least a partial match between identified objects of interest, e.g., location cues, and predetermined landmarks…., accents, associated with a known location), and wherein at least one of the image-based target state data and the sound-based target state data includes a judgment result (Nauseef teaches at Paragraph 0114 that the feature unit may identify or select contextual features relating to a dominant content which may be perceived as more relevant or likely to contribute to the conversation based on a relevance score) and a confidence degree corresponding to the judgment result (Nauseef teaches at Paragraph 0089 and Paragraph 0118 that the facial/vocal recognition unit 318 may identify objects of interest and/or emotional cues in the image based on a comparison of pixel color values and/or locations in the image…..the facial/vocal recognition unit 318 may determine at least a partial match, e.g., a partial match that meets and/or exceeds a predetermined threshold of confidence between an identified object of interest and a known facial feature to thereby confirm that the object of interest in the image is indeed a facial feature of the user). 

Re Claim 4: 
The claim 4 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that using the real sound information to determine the at least one sound-based target state data includes: extracting a plurality of audio feature parameters in the 
Nauseef and Sahn further teach the claim limitation that using the real sound information to determine the at least one sound-based target state data includes: extracting a plurality of audio feature parameters in the real sound information (Nauseef teaches at Paragraph 0099 that the facial/vocal recognition unit 318 may analyze any captured audio of each user to identify changes in vocal pitch and/or vocal tone); 
performing clustering of the audio feature parameters; and inputting the clustered audio feature parameters into a pre-established sound classification model to obtain the at least one sound-based target state data (The sound-based target state data may include the locational cues and/or emotional cues and/or contextual features relating to the vocal pitch/tone shifting. 
Nauseef teaches at Paragraph 0122 that the first user may be presented with contextual sound clips and/or acoustic filters that the first user may apply to the conversation and at Paragraph 0056 that the features unit 322 may utilize the numerical values of identified emotional cues to identify one or more contextual features (icons, audio clips) to be presented to a user device. 
Nauseef teaches at Paragraph 0053 that recognition 320 may be utilized for identifying vocal inflections of users…the gesture analysis unit 320 may analyze vocal inflection identified by the facial/vocal recognition unit 318 to identify emotional cues of users….emotional cues may include….tongue movements, teeth movements, vocal pitch shifting, vocal tone shifting, changes in word delivery speed and at Paragraph 0099 that the facial/vocal recognition unit 318 may analyze any captured audio of each user to identify changes in vocal pitch and/or vocal tone…..to determine whether the user is laughing, crying, yelling, screaming, using sarcasm, and/or is otherwise displaying a particular emotion…contextual features may be presented to various users at relevant times during a conversion and at Paragraph 0121 that involving audio data such as pitch, cadence may be analyzed by the facial/vocal recognition unit 318 and/or the features unit 322 to discern emotions and other contextual information). 
Re Claim 5: 
The claim 5 encompasses the same scope of invention as that of the claim 2 except additional claim limitation wherein the image-based target state data includes first state data including a first judgment result and a first confidence degree and the sound-based target state data includes second state data including a second judgment results and a second confidence degree and wherein fusing the image-based target state data and the sound-based target state data of the same type to obtain the target portrait data includes: comparing whether the first judgment result is identical with the second judgment result; when the comparison result indicates the first judgment result is identical to the second judgment [are identical]; detecting whether the sum of the first confidence degree and the second confidence degree is greater than a first confidence threshold; and when the sum of the first confidence degree and the second confidence degree is greater than the first confidence threshold; determining the first judgment result or the second judgment result as the target portrait data; and when the comparison result indicates the first judgment result is different from the second judgment result; detecting whether a greater one of the first confidence degree and the second confidence degree is greater than a second confidence threshold; and when the greater one of the first confidence degree and the second confidence degree is greater than the second confidence threshold; determining the judgment result 
Nauseef further teaches the claim limitation wherein the image-based target state data includes first state data including a first judgment result and a first confidence degree and the sound-based target state data includes second state data including a second judgment results and a second confidence degree and wherein fusing the image-based target state data and the sound-based target state data of the same type to obtain the target portrait data includes: comparing whether the first judgment result is identical with the second judgment result (Nauseef teaches at Paragraph 0114 that the feature unit may identify or select contextual features relating to a dominant content which may be perceived as more relevant or likely to contribute to the conversation based on a relevance score and Nauseef teaches at Paragraph 0118 that only select contextual features whose relevance score meets a predetermined threshold value (it is noted that the first relevance score meeting a predetermined threshold value is equal to the second relevance score meeting a predetermined threshold value.  
Nauseef teaches at Paragraph 0053 that the gestures analysis unit 320 may analyze changes in facial features and/or vocal inflection identified by the facial/vocal recognition unit 318 to identify emotional cues of users….emotional cues may include…vocal pitch shifting, vocal tone shifting and changes in word delivery speed and at Paragraph 0054 that the gesture analysis unit 320 may quantify identified emotional cues and/or intensity of identified emotional cues by assigning a numerical value to each identified emotional cue. 
Sahn teaches at Paragraph 0232-0233 that if the emotional state of the individual based upon video analysis alone, suggested two or more potential emotional states, the audio-derived emotional cues may be used to promote or demote the various options to identify a most likely emotional state candidate. The audio-derived analysis does not promote or demote the various options, the judgement results by the audio derived analysis are the same as the judgement results of the video analysis); when the comparison result indicates the first judgment result is identical to the second judgment [are identical] (Nauseef teaches at Paragraph 0118 that only select contextual features whose relevance score meets a predetermined threshold value (it is noted that the first relevance score meeting a predetermined threshold value is equal to the second relevance score meeting a predetermined threshold value.  
Nauseef teaches at Paragraph 0053 that the gestures analysis unit 320 may analyze changes in facial features and/or vocal inflection identified by the facial/vocal recognition unit 318 to identify emotional cues of users….emotional cues may include…vocal pitch shifting, vocal tone shifting and changes in word delivery speed and at Paragraph 0054 that the gesture analysis unit 320 may quantify identified emotional cues and/or intensity of identified emotional cues by assigning a numerical value to each identified emotional cue and at Paragraph 0056 that the features unit 322 may utilize the numerical values of identified emotional cues to identify one or more contextual descriptions of emotions to be presented to a user device. It is noted that if the numerical value for the emotional cue relating the vocal pitch shifting or vocal tone shifting is equal to the numerical value for the emotional cue relating to a tongue movement, teeth movement, cheek movements, forehead movements, the first confidence score is the same as the second confidence score. 
Sahn teaches at Paragraph 0232-0233 that if the emotional state of the individual based upon video analysis alone, suggested two or more potential emotional states, the audio-derived emotional cues may be used to promote or demote the various options to identify a most likely emotional state candidate. The audio-derived analysis does not promote or demote the various options, the judgement results by the audio derived analysis are the same as the judgement results of the video analysis ); detecting whether the sum of the first confidence degree and the second confidence degree is greater than a first confidence threshold; and when the sum of the first confidence degree and the second confidence degree is greater than the first confidence threshold, determining the first judgment result or the second judgment result as the target portrait data (Nauseef teaches weighting the numerical values associated with the emotional cues relating to the facial features and/or the vocal pitch/tone features to determine the contextual icons to be overlaid which involves the weighted summation of the confidence/numerical values and if a whole match is determined to meet or exceed a predetermined threshold of confidence, the weighted numerical confidence value is larger than the predetermined threshold of confidence as Nauseef teaches at Paragraph 0056 that the features unit 322 may utilize the (combination of) the numerical values of identified emotional cues….to identify one or more contextual features to be presented. 
Nauseef teaches at Paragraph 0117 that emotional cues….or identified contextual features may be prioritized by the feature unit 322 and at Paragraph 0118 that the feature units 322 may generate a relevance score associated with each identified emotional cue or a contextual feature and the relevance score may communicate how strongly or intensely an emotion was sensed and/or perceived and alternatively, the features unit 322 may be configured to only select contextual features whose relevance score meets and/or exceeds a predetermined threshold value. 
Nauseef teaches at Paragraph 0053 that the gestures analysis unit 320 may analyze changes in facial features and/or vocal inflection identified by the facial/vocal recognition unit 318 to identify emotional cues of users….emotional cues may include…vocal pitch shifting, vocal tone shifting and changes in word delivery speed and at Paragraph 0054 that the gesture analysis unit 320 may quantify identified emotional cues and/or intensity of identified emotional cues by assigning a numerical value to each identified emotional cue and at Paragraph 0056 that the features unit 322 may utilize the numerical values of identified emotional cues to identify one or more contextual descriptions of emotions to be presented to a user device and at Paragraph 0054 the gesture analysis unit 320 may assign a larger weight to an identified emotional cue lasting one minute than an identified emotional cue lasting thirty seconds and at Paragraph 0117 that the features unit 322 may update its numerical valuing and/or weighting techniques based on popularity, frequency of use, and/or other factors associated with predetermined emotions, gestures, facial features…locational cues, emotional cues and at Paragraph 0102 that a numerical value associated with an identified large smile gesture might be weighted by the gesture analysis unit 320 and/or the features unit 322 more heavily than a numerical value associated with an identified small smirk gesture. 
It is noted that if the numerical value for the emotional cue relating the vocal pitch shifting or vocal tone shifting is equal to the numerical value for the emotional cue relating to a tongue movement, teeth movement, cheek movements, forehead movements, the first confidence score is the same as the second confidence score. 
Sahn teaches at Paragraph 0232-0233 that if the emotional state of the individual based upon video analysis alone, suggested two or more potential emotional states, the audio-derived emotional cues may be used to promote or demote the various options to identify a most likely emotional state candidate…audio-derived emotional cues may be used as a primary reference to determine the emotional state of at least one individual. Accordingly, the emotional states of the individual includes the primary reference and the secondary reference which are derived based on a combination of the audio-derived analysis and the video analysis); and when the comparison result indicates the first judgment result is different from the second judgment result; detecting whether a greater one of the first confidence degree and the second confidence degree is greater than a second confidence threshold; and when the greater one of the first confidence degree and the second confidence degree is greater than the second confidence threshold; determining the judgment result corresponding to the greater one of the first confidence degree and the second confidence degree as the target portrait data (Nauseef teaches at Paragraph 0114 that the feature unit may identify or select contextual features relating to a dominant content which may be perceived as more relevant or likely to contribute to the conversation based on a relevance score. 
Nauseef teaches at Paragraph 0114 that the feature unit may identify or select contextual features relating to a dominant content which may be perceived as more relevant or likely to contribute to the conversation based on a relevance score. 
Nauseef teaches at Paragraph 0053 that recognition 320 may be utilized for identifying vocal inflections of users…the gesture analysis unit 320 may analyze vocal inflection identified by the facial/vocal recognition unit 318 to identify emotional cues of users….emotional cues may include….tongue movements, teeth movements, vocal pitch shifting, vocal tone shifting, changes in word delivery speed and at Paragraph 0099 that the facial/vocal recognition unit 318 may analyze any captured audio of each user to identify changes in vocal pitch and/or vocal tone…..to determine whether the user is laughing, crying, yelling, screaming, using sarcasm, and/or is otherwise displaying a particular emotion…contextual features may be presented to various users at relevant times during a conversion and at Paragraph 0121 that involving audio data such as pitch, cadence may be analyzed by the facial/vocal recognition unit 318 and/or the features unit 322 to discern emotions and other contextual information. 
Nauseef teaches at Paragraph 0089 and Paragraph 0118 that the facial/vocal recognition unit 318 may identify objects of interest and/or emotional cues in the image based on a comparison of pixel color values and/or locations in the image…..the facial/vocal recognition unit 318 may determine at least a partial match, e.g., a partial match that meets and/or exceeds a predetermined threshold of confidence between an identified object of interest and a known facial feature to thereby confirm that the object of interest in the image is indeed a facial feature of the user. Nauseef further teaches at Paragraph 0118 that the feature point 322 may generate a relevance score associated with each identified emotional cue, location cue and/or a contextual feature. The relevance score may correspond to a level of confidence in that each identified emotional cue, locational cue and/or contextual feature is indeed relevant to a conversation enabled by the video communication connection…the features unit 322 may be configured to only select contextual features whose relevance score meets and/or exceeds a predetermine threshold value. 
It is noted that when the matched emotion cue is identified based on the changes in the vocal tone and/or vocal pitch, the relevance score of the vocal feature is higher than the relevance score of other facial features. 
Sahn teaches at Paragraph 0232-0233 that if the emotional state of the individual based upon video analysis alone, suggested two or more potential emotional states, the audio-derived emotional cues may be used to promote or demote the various options to identify a most likely emotional state candidate…audio-derived emotional cues may be used as a sole reference to determine the emotional state of at least one individual. Accordingly, the second confidence threshold for the audio-derived emotional cue is larger than the first confidence threshold for the video analyzed emotional cues). 
It would have been obvious to one of the ordinary skill in the art before filing date of the instant application to have combined the teaching of Sahn and Nauseef to have augmented an individual’s video data and audio data combined with visual cues such as the emotional icons and audible feedback cues to have indicated the emotional states of the individual detected based on the video data and the audio data of the individual. One of the ordinary skill in the art would have been motivated to have presented visual/audio cues of the individual to have identified a point at which the individual should pause or emphasis a word while presenting a conversation snippet or speck fed to the individual (Sahn Paragraph 0147). 

Re Claim 6: 
The claim 6 encompasses the same scope of invention as that of the claim 5 except additional claim limitation that the second confidence threshold is greater than the first confidence threshold. 
Nauseef and Sahn further teach the claim limitation that the second confidence threshold is greater than the first confidence threshold (Sahn teaches at Paragraph 0232-0233 that if the emotional state of the individual based upon video analysis alone, suggested two or more potential emotional states, the audio-derived emotional cues may be used to promote or demote the various options to identify a most likely emotional state candidate…audio-derived emotional cues may be used as a sole reference to determine the emotional state of at least one individual. Accordingly, the second confidence threshold for the audio-derived emotional cue is larger than the first confidence threshold for the video analyzed emotional cues. 
Nauseef teaches at Paragraph 0114 that the feature unit may identify or select contextual features relating to a dominant content which may be perceived as more relevant or likely to contribute to the conversation based on a relevance score. 
Nauseef teaches at Paragraph 0053 that recognition 320 may be utilized for identifying vocal inflections of users…the gesture analysis unit 320 may analyze vocal inflection identified by the facial/vocal recognition unit 318 to identify emotional cues of users….emotional cues may include….tongue movements, teeth movements, vocal pitch shifting, vocal tone shifting, changes in word delivery speed and at Paragraph 0099 that the facial/vocal recognition unit 318 may analyze any captured audio of each user to identify changes in vocal pitch and/or vocal tone…..to determine whether the user is laughing, crying, yelling, screaming, using sarcasm, and/or is otherwise displaying a particular emotion…contextual features may be presented to various users at relevant times during a conversion and at Paragraph 0121 that involving audio data such as pitch, cadence may be analyzed by the facial/vocal recognition unit 318 and/or the features unit 322 to discern emotions and other contextual information. 
Nauseef teaches at Paragraph 0089 and Paragraph 0118 that the facial/vocal recognition unit 318 may identify objects of interest and/or emotional cues in the image based on a comparison of pixel color values and/or locations in the image…..the facial/vocal recognition unit 318 may determine at least a partial match, e.g., a partial match that meets and/or exceeds a predetermined threshold of confidence between an identified object of interest and a known facial feature to thereby confirm that the object of interest in the image is indeed a facial feature of the user. Nauseef further teaches at Paragraph 0118 that the feature point 322 may generate a relevance score associated with each identified emotional cue, location cue and/or a contextual feature. The relevance score may correspond to a level of confidence in that each identified emotional cue, locational cue and/or contextual feature is indeed relevant to a conversation enabled by the video communication connection…the features unit 322 may be configured to only select contextual features whose relevance score meets and/or exceeds a predetermine threshold value. 
It is noted that when the matched emotion cue is identified based on the changes in the vocal tone and/or vocal pitch, the relevance score of the vocal feature is higher than the relevance score of other facial features). 
It would have been obvious to one of the ordinary skill in the art before filing date of the instant application to have combined the teaching of Sahn and Nauseef to have augmented an individual’s video data and audio data combined with visual cues such as the emotional icons and audible feedback cues to have indicated the emotional states of the individual detected based on the video data and the audio data of the individual. One of the ordinary skill in the art would have been motivated to have presented visual/audio cues of the individual to have identified a point at which the individual should pause or emphasis a word while presenting a conversation snippet or speck fed to the individual (Sahn Paragraph 0147). 

Re Claim 7: 
The claim 7 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that the virtual information includes at least one of visual information, acoustic information and effect information.
Nauseef further teaches the claim limitation that the virtual information includes at least one of visual information, acoustic information and effect information (Nauseef teaches at Paragraph 0033 that analyzing the live video and/or audio feeds may enable the server to detect facial features of each user 102, 104 as well as any speech characteristics of each user’s speech. The facial features and/or speech characteristics identified during analysis of the video communication connection may be used to identify emotional cues such as facial gestures or vocal inflections of each user 102, 104 that are associated with predetermined emotions and at Paragraph 0109 that when a user smiles, an image of a dinosaur that has been overlaid on the image of the user in the live video feed of the user may smile as well using the user’s detected smile as a reference….a smiley face icon may follow the movements of a user’s face in the live video feed so that when a user moves his head within the frame of the live video feed, the smiley face icon stays overlain on the user’s face).
Re Claim 17: 
The claim 17 is in parallel with the claim 1 in the form of an apparatus claim. The claim 17 is subject to the same rationale of rejection as the claim 1. 
The claim 17 further recites an electronic apparatus, comprising: one or more processors; and a storage device for storing one or more programs; wherein the one or more processors are configured, via execution of the one or more programs, to [perform the method of the claim 1]. 
However, Sahn further teaches the claim limitation of an electronic apparatus, comprising: one or more processors; and a storage device for storing one or more programs; wherein the one or more processors are configured, via execution of the one or more programs, to [perform the method of the claim 1] (Sahn Paragraph 0275-0277). 
Re Claim 18: 
The claim 18 is in parallel with the claim 2 in the form of an apparatus claim. The claim 18 is subject to the same rationale of rejection as the claim 2. 
Re Claim 20: 

Re Claim 21: 
The claim 21 is in parallel with the claim 5 in the form of an apparatus claim. The claim 21 is subject to the same rationale of rejection as the claim 5. 
Re Claim 22: 
The claim 22 is in parallel with the claim 6 in the form of an apparatus claim. The claim 22 is subject to the same rationale of rejection as the claim 6. 
Re Claim 23: 
The claim 23 is in parallel with the claim 7 in the form of an apparatus claim. The claim 23 is subject to the same rationale of rejection as the claim 7. 

Re Claim 24: 
The claim 24 is in parallel with the claim 1 in the form of a computer readable storage medium claim. The claim 24 is subject to the same rationale of rejection as the claim 1. 
The claim 24 further recites a computer readable storage medium having a computer program stored thereon executable by a processor to perform a set of functions, the set of functions [of the method of the claim 1]. 
However, Sahn further teaches the claim limitation of a computer readable storage medium having a computer program stored thereon executable by a processor to perform a set of functions, the set of functions [of the method of the claim 1] (Sahn Paragraph 0275-0277). 
Re Claim 25: 

Re Claim 27: 
The claim 27 is in parallel with the claim 4 in the form of an apparatus claim. The claim 27 is subject to the same rationale of rejection as the claim 4. 
Re Claim 28: 
The claim 28 is in parallel with the claim 5 in the form of an apparatus claim. The claim 28 is subject to the same rationale of rejection as the claim 5. 
Re Claim 29: 
The claim 29 is in parallel with the claim 6 in the form of an apparatus claim. The claim 29 is subject to the same rationale of rejection as the claim 6. 

Claims 3, 19 and 26 are rejected under 35 U.S.C. 103 as being unpatentable over Nauseef et al. US-PGPUB No. 2016/0191958 (hereinafter Nauseef) in view of Sahn US-PGPUB No. 2015/0099946 (hereinafter Sahn) and Wexler et al. US-PGPUB No. 2017/0061200 (hereinafter Wexler).  

Re Claim 3: 
The claim 3 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that the real image information includes facial image information of the target; and using the real image information to determine the at least one image-based target state data includes: determining position information of a plurality of critical points from the facial image information; performing tilt correction on the facial image information using the 
With the exception of “tilt correction”, Nauseef teaches the claim limitation that the real image information includes facial image information of the target; and using the real image information to determine the at least one image-based target state data includes: determining position information of a plurality of critical points from the facial image information (Nauseef teaches at Paragraph 0104 that a live video feed and/or a live audio feed….may be analyzed for a particular locational cues such as landmarks…speech accents, dialects…the facial/vocal recognition unit 318 may identify one or more locational cues included in the live video feed of the user and determine at least a partial match between identified objects of interest, e.g., location cues, and predetermined landmarks…., accents, associated with a known location); 
Nauseef at least implicitly teaches tilt adjusting on the facial image information. Nauseef teaches at Paragraph 0109 that when a user smiles, an image of a dinosaur that has been overlaid on the image of the user in the live video feed of the user may smile as well using the user’s detected smile as a reference….a smiley face icon may follow the movements of a user’s face in the live video feed so that when a user moves his head within the frame of the live video feed, the smiley face icon stays overlain on the user’s faceNauseef teaches at Paragraph 0089 and Paragraph 0118 that the facial/vocal recognition unit 318 may identify objects of interest and/or emotional cues in the image based on a comparison of pixel color values and/or locations in the image…..the facial/vocal recognition unit 318 may determine at least a partial match, e.g., a partial match that meets and/or exceeds a predetermined threshold of confidence between an identified object of interest and a known facial feature to thereby confirm that the object of interest in the image is indeed a facial feature of the user). 
Wexler teaches the claim limitation: 
performing tilt correction on the facial image information using the position information; extracting a plurality of facial feature values in the corrected facial image information (Wexler teaches at Paragraph 0283 that the mood may be classified into emotional states and features of a facial expression may be used to determine the mood and at Paragraph 0287 that monitoring module 603 may interact with orientation identification module 601 and orientation adjustment module 602 to track the movement of the facial features of at least one person and to maintain those features in the field-of-view of image sensor 220). 
It would have been obvious to one of the ordinary skill in the art to have adjusted the tilt or orientation of the face or the head before the face recognition module to have accurately detected feature features of the face. One of the ordinary skill in the art would have been motivated to have provided facial feature detection.  
Re Claim 19: 
The claim 19 is in parallel with the claim 3 in the form of an apparatus claim. The claim 19 is subject to the same rationale of rejection as the claim 3. 
Re Claim 26: 
The claim 26 is in parallel with the claim 3 in the form of an apparatus claim. The claim 26 is subject to the same rationale of rejection as the claim 3. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JIN CHENG WANG whose telephone number is (571)272-7665.  The examiner can normally be reached on Mon-Fri 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Xiao Wu can be reached on 571-272-7761.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.