DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

Reasons for Allowance
The following is an examiner’s statement of reasons for allowance: The claims 1, 3, 4, 7, 17, 19, 20, 23, 24, 26 and 27 of the claim invention filed 9/29/2021 are allowed for the reasons set forth below. 

Re Claim 1: 
Nauseef et al. US-PGPUB No. 2016/0191958 (hereinafter Nauseef) teaches an augmented reality method comprising: 
acquiring video information of a target (Nauseef teaches at FIG. 1 and Paragraph 0032 that each of the first user 102 and the second user 104 may hold a user device in front of his or her face so that a camera 110, 112 included in each respective user device 106, 108 may capture a live video feed of each user’s face); and 
acquiring real image information of the target and real sound information of the target from the video information (Nauseef teaches at FIG. 1 and Paragraph 0032 that each of the first user 102 and the second user 104 may hold a user device in front of his or her face so that a camera 110, 112 included in each respective user device 106, 108 may capture a live video feed of each user’s face. Audio of each user may also be captured by a microphone); 
Nauseef teaches at Paragraph 0033 that analyzing the live video and/or audio feeds may enable the server to detect facial features of each user 102, 104 as well as any speech characteristics of each user’s speech. The facial features and/or speech characteristics identified during analysis of the video communication connection may be used to identify emotional cues such as facial gestures or vocal inflections of each user 102, 104 that are associated with predetermined emotions); 
using the real sound information to determine at least one sound-based target state data corresponding to each of the plurality of dimensions (The sound-based target state data may include the locational cues and/or emotional cues and/or contextual features relating to the vocal pitch/tone shifting. 
Nauseef teaches at Paragraph 0122 that the first user may be presented with contextual sound clips and/or acoustic filters that the first user may apply to the conversation and at Paragraph 0056 that the features unit 322 may utilize the numerical values of identified emotional cues to identify one or more contextual features (icons, audio clips) to be presented to a user device. 
Nauseef teaches at Paragraph 0053 that recognition 320 may be utilized for identifying vocal inflections of users…the gesture analysis unit 320 may analyze vocal inflection identified by the facial/vocal recognition unit 318 to identify emotional cues of users….emotional cues may include….tongue movements, teeth movements, vocal pitch shifting, vocal tone shifting, changes in word delivery speed and at Paragraph 0099 that the facial/vocal recognition unit 318 may analyze any captured audio of each user to identify changes in vocal pitch and/or vocal tone…..to determine whether the user is laughing, crying, yelling, screaming, using sarcasm, and/or is otherwise displaying a particular emotion…contextual features may be presented to various users at relevant times during a conversion and at Paragraph 0121 that involving audio data such as pitch, cadence may be analyzed by the facial/vocal recognition unit 318 and/or the features unit 322 to discern emotions and other contextual information);  
acquiring virtual information corresponding to the target portrait data (Nauseef teaches at Paragraph 0033 that analyzing the live video and/or audio feeds may enable the server to detect facial features of each user 102, 104 as well as any speech characteristics of each user’s speech. The facial features and/or speech characteristics identified during analysis of the video communication connection may be used to identify emotional cues such as facial gestures or vocal inflections of each user 102, 104 that are associated with predetermined emotions); and 
superimposing the virtual information on the video information (Nauseef teaches at Paragraph 0033 that analyzing the live video and/or audio feeds may enable the server to detect facial features of each user 102, 104 as well as any speech characteristics of each user’s speech. The facial features and/or speech characteristics identified during analysis of the video communication connection may be used to identify emotional cues such as facial gestures or vocal inflections of each user 102, 104 that are associated with predetermined emotions and at Paragraph 0109 that when a user smiles, an image of a dinosaur that has been overlaid on the image of the user in the live video feed of the user may smile as well using the user’s detected smile as a reference….a smiley face icon may follow the movements of a user’s face in the live video feed so that when a user moves his head within the frame of the live video feed, the smiley face icon stays overlain on the user’s face).  
Nauseef implicitly teaches the claim limitation: 

Nauseef teaches fusing the image-based target state data (such as a smile emotion) and sound-based target state data (a laughing emotion in user’s speech) of the same contextual feature that conveys happiness. 
Nauseef teaches separately identifying a laughing speech based on changes in vocal pitch and/or vocal tone or keywords in a user’s speech based on the vocal recognition techniques of the facial/vocal recognition unit 318 (see Paragraph 0099 the facial/vocal recognition unit 318 may analyze any captured audio of the user…to determine whether that user is laughing….or is otherwise displaying a particular emotion It is noted that the user’s speech is analyzed to determine a laughing speech) and separately identifying happy emotion based on the gesture analysis unit 320 (Paragraph 0101 the gesture analysis unit 320 may determine an amount of movement of one or more facial features based on pixel location of identified facial features….based on determining that both corners of the user’s lips moved upwards in relation to other identified facial features, the gesture analysis unit 320 may determine that the user is smiling and Paragraph 0102 an identified smile gesture may be assigned a positive numerical value). Accordingly, an identified smile gesture (positive emotion) using the gesture analysis unit 320 has the same dimension as the identified laughing emotion in user’s speech (positive emotion) so that the feature unit 322 may then identify and/or select one or more features relevant to the location cue in the content storage unit 334 for presentation to the user (Paragraph 0105) and a relevant contextual feature to the user emotion can be placed in the live video feed (Paragraph 0104-0108-019 “when a user smiles, an image of a dinosaur that has been overlaid the image of the user in the live video feed of the user may smile as well…a smiley face icon may follow the movements of a user’s face in live video feed”). 
Nauseef teaches at Paragraph 0109 a smiley face icon may follow the movements of a user’s face in the live video feed (which includes the user’s identified laughing speech). Nauseef teaches at Paragraph 0034 that these detected emotional cues (e.g., smile) convey happiness and at Paragraph 0035 that the server may identify in the database a set of images that are associated with positive, happy emotions and at Paragraph 0099 identifying objects of interest such as changes in vocal pitch and/or vocal tone or keywords in a user’s speech in this manner may enable the facial/vocal recognition unit 318 to determine whether the user is laughing, crying, yelling, screaming, using sarcasm and/or is otherwise displaying a particular emotion and at Paragraph 0102 that an identified smile gesture may be assigned a positive numerical value and at Paragraph 0108 that the relevant contextual features may be presented to the user in a toolbar, a menu and/or other portion of a user interface. Selecting a contextual feature for incorporation into the video communication may include overlaying a live video feed and/or a live audio feed with an image, text, an icon, an audio clip and at Paragraph 0109 that a smiley face icon may follow the movement of a user’s face in the live video feed. 
Nauseef teaches at Paragraph 0122 that the first user may be presented with contextual sound clips and/or acoustic filters that the first user may apply to the conversation and at Paragraph 0056 that the features unit 322 may utilize the numerical values of identified emotional cues to identify one or more contextual features (icons, audio clips) to be presented to a user device. 
Nauseef teaches at Paragraph 0049 that contextual features may include icons, emotions, images, text, audio samples and/or video clips associated with one or more predetermined emotions and at Paragraph 0108 the relevant contextual features may be presented to the user….selecting a contextual feature for incorporation may include overlaying a live video feed and/or a live audio feed with an image, text, an icon, an audio clip and the like…selecting a contextual feature for incorporation….may further include masking and/or modifying a live audio feed of a user by modulating the user’s voice…..augmenting a background image of the live video feed with a pattern with an image of a particular setting and at Paragraph 0109 a smiley face icon may follow the movements of a user’s face in the live video feed.  
Nauseef teaches at 0038 that based on detection of a first user 102’s smile and raised eyebrows, the server may provide to the first user device 106 a set of contextual features 118 associated with happiness, such as smiley face icons, a party hat, and/or the like. The first user 102 may then select one or more of the provided contextual features 118 to overlay the first user 102’s face in the video communication connection to enhance the happy emotions currently being experienced by the first user 102 and at Paragraph 0053 that emotional cues may include vocal pitch shifting and/or vocal tone shifting and at Paragraph 0057 that the features unit 322 may identify one or more contextual features, e.g., icons, audio samples stored in the content storage unit 334 to be presented to a second user. The user may then select one or more of the contextual features such as smiley face icon for overlay into the video communication connection. Accordingly, the audio samples and icons are fused to obtain target portrait data). 
Sahin US-PGPUB No. 2015/0099946 (hereinafter Sahin) teaches the claim limitation: Fusing, for each dimension of the plurality of dimensions, the image-based target state data corresponding to the dimension and the sound-based target state data corresponding to the same dimension to obtain target portrait data (
Sahin teaches at Paragraph 0198 that visual mechanisms may be used to trigger a desired response from the individual….a funny sound may be played to invoke a smile or giggle (a positive emotion) from the individual in response to a socially relevant event that normally invokes pleasure and at Paragraph 0210 positive reinforcement feedback may include an enjoyable or celebratory sound such as a fanfare, cheering or happy music. Verbal positive feedback such as the words success, hooray, good job, may be visually presented to the user and at Paragraph 0212 levels of pleasure (a numerical value of happiness) with the currently presented feedback may be derived from reviewing a subject-pointing video recording to review relative pupil dilation, eye moistness or eyebrow position. Further, levels of pleasure may be derived from reviewing subject physiological data and at Paragraph 0234 that a simplified cartoon icon representing an emotional state such as happy may supplant the individual’s face in the heads-up display.
It is understood that the pleasure/happy emotional state identified in Paragraph 0212 based on the subject video recording has the same type or the same dimension as the laughing emotional state identified based on the audio data at Paragraph 0231. 
Sahin teaches FIGS. 10A-10B and Paragraph 0231-0234 that the audio may be reviewed for tone, volume, pitch, patterns in pitch, e.g., sing-song, questioning, etc., vocal tremors, sobbing, hiccupping, laughing, giggling, snorting, sniffing, and other verbalizations and/or intonations that may be associated with emotional state. The emotional identification and training module may further identify one or more emotional words or phrases within the audio data…..audio-derived emotional cues are applied to the identified emotional states to refine the emotional state of at least one individual…audio-derived emotional cues may be used to promote the various options to identify a most likely emotional state candidate…audio-derived emotional cues may be used as a primary reference…to determine the emotional state of at least one individual….a feedback algorithm may augment the video feed of a heads-up display of a data collection device to overlay a description of the emotional state of the individual…an icon 1028 representing the emotional state of the individual 1022, as well as a label 1029 (“happy”) are presented within the analysis pane 1026…a term or sentence for the emotional state may be presented audibly to the user such as mom is happy. Further, audio or video feedback may spell out to the user the particular response behavior to invoke such as an audible cue directing the subject to smile now or a visual cue including the text nod your head and look concerned…the user may be presented with verbal and/or audible warnings such as may bite or back away). 

The prior art references do not anticipate or suggest the new claim limitation of “wherein the image-based target state data includes at least one of emotion data, age data, and gender data, wherein the sound-based target state data includes at least one of emotion data, age data, and gender data, and wherein at least one of the image-based target state data and the sound-based target state data includes a judgment result and a confidence degree corresponding to the judgment result, wherein the image-based target state data includes first state data including a first judgment result and a first confidence degree and the sound-based target state data includes second state data including a second judgment results and a second confidence degree and wherein fusing the image-based target state data corresponding to the dimension and the sound-based target state data corresponding to the same dimension to obtain the target portrait data includes: comparing whether the first judgment result is identical with the second judgment result: when the comparison result indicates the first judgment result is identical to the second judgment result are identical: detecting whether the sum of the first confidence degree and the .

Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”

Conclusion

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Xiao Wu can be reached on 571-272-7761. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JIN CHENG WANG/Primary Examiner, Art Unit 2613