DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority Acknowledgment
2.               Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in Application 201811009994.8 on 08/31/2018 in the Chinese Patent Office. 

Response to Arguments/Amendments
3. 	With respect to Claim Rejections under 35 U.S.C. § 101, the amended claim 20 overcomes 101 rejection. Thus, the rejections under 35 U.S.C. § 101 to claim 20 is withdrawn. 
	With respect to 102/103 rejections, the Applicant has amended the independent claims 1 and 10 by incorporating claims 2 or 4 and claims 11 or 13 respectively. Claims 2, 4, 11, 13 are previously rejected under 35 U.S.C.103 as being unpatentable over Negishi  in view of Krupka et al.
	The Applicant argued on pages 3-4 of the Remarks that “according to the above disclosure, Krupka at most discloses that the voiceprint may be established based on a location where lip movement is detected but fails to disclose that a face image at the location where lip movement is detected is determined as a target face image. Thus, Krupa fails to disclose the step of determining a face image showing an opening and closing action of a lip as a target face image. Further, since Krupka fails to disclose the step of determining a face image showing an opening and closing action of a lip as a target face image, Krupka also fails to disclose the step of determining a portrait corresponding to the target face image as the speaking object corresponding to the voice information. Besides, the applicant submits that the paragraphs cited in the Office action (paragraphs [0028] and [0047]) at most disclose that the face identification machine in Krupka may be configured to determine an identity of each candidate face and the diarization machine in Krupka may output the person who is speaking, the time period when the speaker is speaking and the location where the speaker is speaking but fail to disclose the above method step recited in the distinguishing technical feature 1.
In response, Examiners respectfully notes that Krupka et al. disclose a method for establishing a voice print of a human speaker. To do so, first Krupka et al. recognize at least one face image in the video information (Krupka et al. [0027] As shown in Fig. 4, face location machine 124 is configured to find candidate faces 166 in digital video 114. As an example, Fig. 4 shows face location machine 124 finding candidate FACE(1) at 23o , candidate FACE(2) at 178o , and candidate FACE(3) at 303o . The candidate faces 166 output by the face location machine 124 may include coordinates of a bounding box around a located face image, a portion of the digital image where the face was located, other location information (e.g., an angle such as 23o), and/or labels (e.g., “FACE(1)”)), second Krupka et al. determining a face image showing an opening and closing action of a lip as a target face image (Krupka et al. [0060] the system at some point may make a determination as to when to conduct sampling and/or, for a given amount of sampled audio, what portions of such audio are to be designated as sampled to be used for generating the voice print. Examples of such determination include (1) sampling and/or processing audio beginning at a point when the speaker lips are moving), and determining a portrait corresponding to the target face image as the speaking object corresponding to the voice information (Krupka et al. [0028] Face identification machine 164 optionally may be configured to determine an identity 168 of each candidate face 166 by analyzing just the portions of the digital video 114 where candidate faces 166 have been found, [0047] FIG. 7 is a visual representation of an example output of diarization machine 602. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking). Krupka et al. selects a sample of meeting audio according to a protocol, the sample representing an utterance make by one of the human speakers, and establishing, based at least on the sample, a voiceprint of the human speaker. In order to establishing the voiceprint of the human speaker based on at least on the voice sample, Krupka et al. determines a speaking object corresponding to the voice sample. Krupka et al. recognizes faces in the video, determines which face having lip opening and closing, and determines the speaking object corresponding to the selected voice sample. 
Claim 1 was amended to require one or the other of the limitations of original claims 2 and original claim 4, and so the previous rejection pertaining to claim 2 is sufficient to address the limitations of amended claim 1 (i.e. the original limitations of claim 1 and the original limitations of claim 2). The independent claims 1, 10 and 20 recite substantially the same concept but do so in the context of a method, an apparatus and a non-transitory readable storage medium.
Applicant’s arguments are not persuasive, and thus for these reasons, the Examiner respectfully disagrees.

Claim Rejections - 35 USC § 103
4.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

5.	Claims 1, 3, 5-6, 8, 10, 12, 14-15, 17, 20 are rejected under 35 U.S.C.103 as being unpatentable over Negishi (US 2021/0150145 A1) in view of Krupka et al. (US 2019/0341055 A1.)

 	With respect to Claim 1, Negishi et al. discloses 
 	A sign language information processing method, comprising: 
 	obtaining voice information and video information collected by a user terminal in real time (Negishi [0040] the sensor section 10 may include a microphone that senses sound information, [0041] the sensor section 110 may include an image sensor that senses an image (still image or moving image)); 
 	determining, in the video information, a speaking object corresponding to the voice information (Negishi [0136] the device 100 first recognizes the context of the activity of a user (step S110). For example, the device 100 recognizes the context of activity by classifying the activity of a user into the three classes of gesturing, speaking, and the others on the basis of sensing information obtained by a camera); and 
 	superimposing and displaying an augmented reality (AR) sign language animation corresponding to the voice information on a gesture area corresponding to the speaking object to obtain a sign language video (Negishi [0123] in a case where the first action subject is a spoken language user and the second action subject is a sign language user, the device 100 may convert a message expressed by the first action subject using the spoken language into a moving image of a hand performing the sign language gesture corresponding to the message. The gesture is superimposed on the first action subject displayed on a transmissive display, and displayed in an AR manner. Watching the gesture performed by a hand, superimposed on the first action subject displayed on the transmissive display, and displayed in an AR manner allows the second action subject to recognize the message outputted by the first action subject. This allows the second action subject to recognize a message from the first action subject as if the first action subject actually made a remark in sign language, [0156] the second user 20 briefly speaks, to the first user 10, “Turn right at the second corner on this street.” The device 100 then performs speech recognition in real time, and superimposes and displays, in an AR manner, an arm 30 performing the corresponding sign language gesture on the second user 20 displayed on the see-through display), 
	Negishi et al. fail to explicitly teach 
 	wherein the determining, in the video information, a speaking object corresponding to the voice information comprises: 
 	recognizing at least one face image in the video information determining a face image showing an opening and closing action of a lip as a target face image; and determining a portrait corresponding to the target face image as the speaking object corresponding to the voice information, 
 	or, 
 	obtaining sound attribute information corresponding to the voice information determining, in a pre-stored face set, a historical face image corresponding to the sound attribute information searching, in the video information, for a target face image that matches the historical face image; and determining a portrait corresponding to the target face image as the speaking object corresponding to the voice information.  
	However, Krupka et al. teach 
 	wherein the determining, in the video information, a speaking object corresponding to the voice information comprises: 
 	recognizing at least one face image in the video information (Krupka et al. [0027] As shown in Fig. 4, face location machine 124 is configured to find candidate faces 166 in digital video 114. As an example, Fig. 4 shows face location machine 124 finding candidate FACE(1) at 23o , candidate FACE(2) at 178o , and candidate FACE(3) at 303o . The candidate faces 166 output by the face location machine 124 may include coordinates of a bounding box around a located face image, a portion of the digital image where the face was located, other location information (e.g., an angle such as 23o), and/or labels (e.g., “FACE(1)”)); determining a face image showing an opening and closing action of a lip as a target face image (Krupka et al. [0060] the system at some point may make a determination as to when to conduct sampling and/or, for a given amount of sampled audio, what portions of such audio are to be designated as sampled to be used for generating the voice print. Examples of such determination include (1) sampling and/or processing audio beginning at a point when the speaker lips are moving); and
determining a portrait corresponding to the target face image as the speaking object corresponding to the voice information (Krupka et al. [0028] Face identification machine 164 optionally may be configured to determine an identity 168 of each candidate face 166 by analyzing just the portions of the digital video 114 where candidate faces 166 have been found, [0047] FIG. 7 is a visual representation of an example output of diarization machine 602. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking), 
 	or, 
 	obtaining sound attribute information corresponding to the voice information determining, in a pre-stored face set, a historical face image corresponding to the sound attribute information searching, in the video information, for a target face image that matches the historical face image; and determining a portrait corresponding to the target face image as the speaking object corresponding to the voice information.  

 	Negishi and Krupka et al. are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of converting the spoken language into sign language as taught by Negishi, using teaching of detecting lip movement of the human and identifying face of the human as taught by Krupka et al. for the benefit of determining who is speaking, when that speaker is speaking and when that speaker is speaking (Krupka et al. [0047] FIG. 7 is a visual representation of an example output of diarization machine 602. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 602 may use this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608. The segments 604 and/or corresponding labels may be output from the diarization machine 602 in any suitable format.)

	With respect to Claim 3, Negishi in view of Krupka et al. teach 
 	wherein, when the target face image is determined according to the face image showing the opening and closing action of the lip (Krupka et al. [0060] the system at some point may make a determination as to when to conduct sampling and/or, for a given amount of sampled audio, what portions of such audio are to be designated as sampled to be used for generating the voice print. Examples of such determination include (1) sampling and/or processing audio beginning at a point when the speaker lips are moving), after the determining a portrait corresponding to the target face image as the speaking object corresponding to the voice information, the method further comprises: obtaining sound attribute information corresponding to the voice information (Krupka et al. [0068] Detecting that the human speaker may include analyzing output from microphone(s) 108 (e.g., detecting output above a threshold magnitude)), Fig. 14 element 1416 Establish based at least on sample, voiceprint of human speaker); and associating and storing the sound attribute information and the target face image (Krupka et al. Fig. 14 element 1430 Diarize other sample to associate each respective utterance with corresponding human speaker, [0069] For example, the voiceprint may be established based on data used by or output from face location machine 124, such as a location output from the machine of the human speaker's face that helps spatially associate utterances made by the human speaker with his or her location; and/or data used by or output from face identification machine(s) 126, such as an identity that can be associate with the voiceprint and audio samples representing utterances from the human speaker.)

With respect to Claim 5, Negishi in view of Krupka et al. teach 
wherein the sound attribute information comprises: amplitude information (Negishi [0125] the messages in spoken language are text information. The meta-information in spoken language is information indicating speaking speed, voice volume), audio information, and/or accent cycle information. Examiner notes that the term “or” in the claimed language connotes a disjunctive list.)

With respect to Claim 6, Negishi in view of Krupka et al. teach 
 	wherein before the superimposing and displaying an augmented reality (AR) sign language animation corresponding to the voice information on a gesture area corresponding to the speaking object to obtain a sign language video, the method further comprises: 
 	performing a semantic recognition on the voice information to obtain voice text information (Negishi Fig. 4 element S126 Extract word through speech recognition); 
 	querying, in a pre-stored AR gesture animation, at least one AR gesture animation corresponding to the voice text information (Negishi Fig. 4 element S128 Translate into sign language); and 
 	obtaining a sign language AR animation corresponding to the voice information according to the at least one AR gesture animation (Negishi Fig. 4 element S132 Generate moving image of gesture.)  

With respect to Claim 8, Negishi in view of Krupka et al. teach 
 	further comprising: 
 	obtaining gesture action information of a user himself in the video information (Negishi Fig. 4 element S122 Perform gesture recognition on sign language word); 
 	obtaining action text information of the gesture action information (Negishi [0145] For the sign language of the first user 10, the device 100 performs speech synthesis to generate speech indicating the spoken language converted from the sign language (step S130). For example, the device 100 performs speech synthesis by Text to Speech technology); 
 	searching, in a pre-stored voice information, for user voice information corresponding to the action text information (Negishi Fig. 4 element S124 Translate into spoken language); and 
 	playing the user voice information (Negishi Fig. 4 element S130 Perform speech synthesis.)

 	With respect to Claim 10, Negishi discloses 
 	A sign language information processing apparatus, comprising: 
 	a memory, a processor, and a computer program stored on the memory and operable on the processor (Negishi [0222] The storage unit 908 may include a storage medium, a recording unit that records data in the storage medium, a reading unit that reads data from the storage medium, a deletion unit that deletes data recorded in the storage medium, and the like. This storage unit 908 stores programs and various kinds of data to be executed by the CPU 901, various kinds of data acquired from the outside, and the like),
 	wherein the processor, when running the computer program (Negishi [0222] The storage unit 908 may include a storage medium, a recording unit that records data in the storage medium, a reading unit that reads data from the storage medium, a deletion unit that deletes data recorded in the storage medium, and the like. This storage unit 908 stores programs and various kinds of data to be executed by the CPU 901, various kinds of data acquired from the outside, and the like), is configured to:
 	obtain voice information and video information collected by a user terminal in real time (Negishi [0040] the sensor section 10 may include a microphone that senses sound information, [0041] the sensor section 110 may include an image sensor that senses an image (still image or moving image));
 	determine, in the video information, a speaking object corresponding to the voice information (Negishi [0136] the device 100 first recognizes the context of the activity of a user (step S110). For example, the device 100 recognizes the context of activity by classifying the activity of a user into the three classes of gesturing, speaking, and the others on the basis of sensing information obtained by a camera); and
 	superimpose and display an augmented reality (AR) sign language animation corresponding to the voice information on a gesture area corresponding to the speaking object to obtain a sign language video (Negishi [0123] in a case where the first action subject is a spoken language user and the second action subject is a sign language user, the device 100 may convert a message expressed by the first action subject using the spoken language into a moving image of a hand performing the sign language gesture corresponding to the message. The gesture is superimposed on the first action subject displayed on a transmissive display, and displayed in an AR manner. Watching the gesture performed by a hand, superimposed on the first action subject displayed on the transmissive display, and displayed in an AR manner allows the second action subject to recognize the message outputted by the first action subject. This allows the second action subject to recognize a message from the first action subject as if the first action subject actually made a remark in sign language, [0156] the second user 20 briefly speaks, to the first user 10, “Turn right at the second corner on this street.” The device 100 then performs speech recognition in real time, and superimposes and displays, in an AR manner, an arm 30 performing the corresponding sign language gesture on the second user 20 displayed on the see-through display.)
	Negishi fail to explicitly teach 
 	,wherein the processor is further configured to: 
 	recognize at least one face image in the video information, determine a face image showing an opening and closing action of a lip as a target face image, and determine a portrait corresponding to the target face image as the speaking object corresponding to the voice information, 
 	or, 
 	obtain sound attribute information corresponding to the voice information, determine, in a pre-stored face set, a historical face image corresponding to the sound attribute information, search, in the video information, for a target face image that matches the historical face image, and determine a portrait corresponding to the target face image as the speaking object corresponding to the voice information.  
	However, Krupka et al. teach 
	,wherein the processor (Krupa et al. [0078] one or more processor) is further configured to: 
 	recognize at least one face image in the video information (Krupka et al. [0027] As shown in Fig. 4, face location machine 124 is configured to find candidate faces 166 in digital video 114. As an example, Fig. 4 shows face location machine 124 finding candidate FACE(1) at 23o , candidate FACE(2) at 178o , and candidate FACE(3) at 303o . The candidate faces 166 output by the face location machine 124 may include coordinates of a bounding box around a located face image, a portion of the digital image where the face was located, other location information (e.g., an angle such as 23o), and/or labels (e.g., “FACE(1)”)), determine a face image showing an opening and closing action of a lip as a target face image (Krupka et al. [0060] the system at some point may make a determination as to when to conduct sampling and/or, for a given amount of sampled audio, what portions of such audio are to be designated as sampled to be used for generating the voice print. Examples of such determination include (1) sampling and/or processing audio beginning at a point when the speaker lips are moving), and determine a portrait corresponding to the target face image as the speaking object corresponding to the voice information (Krupka et al. [0028] Face identification machine 164 optionally may be configured to determine an identity 168 of each candidate face 166 by analyzing just the portions of the digital video 114 where candidate faces 166 have been found, [0047] FIG. 7 is a visual representation of an example output of diarization machine 602. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking), 
 	or, 
 	obtain sound attribute information corresponding to the voice information, determine, in a pre-stored face set, a historical face image corresponding to the sound attribute information, search, in the video information, for a target face image that matches the historical face image, and determine a portrait corresponding to the target face image as the speaking object corresponding to the voice information.  
Negishi and Krupka et al. are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of converting the spoken language into sign language as taught by Negishi, using teaching of detecting lip movement of the human and identifying face of the human as taught by Krupka et al. for the benefit of determining who is speaking, when that speaker is speaking and when that speaker is speaking (Krupka et al. [0047] FIG. 7 is a visual representation of an example output of diarization machine 602. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 602 may use this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608. The segments 604 and/or corresponding labels may be output from the diarization machine 602 in any suitable format.)

With respect to Claim 12, Negishi in view of Krupka et al. teach 
 	wherein, when the target face image is determined according to the face image showing the opening and closing action of the lip (Krupka et al. [0060] the system at some point may make a determination as to when to conduct sampling and/or, for a given amount of sampled audio, what portions of such audio are to be designated as sampled to be used for generating the voice print. Examples of such determination include (1) sampling and/or processing audio beginning at a point when the speaker lips are moving), the processor, after determining the portrait corresponding to the target face image as the speaking object corresponding to the voice information, is further configured to: 
 	obtain sound attribute information corresponding to the voice information (Krupka et al. [0068] Detecting that the human speaker may include analyzing output from microphone(s) 108 (e.g., detecting output above a threshold magnitude)), Fig. 14 element 1416 Establish based at least on sample, voiceprint of human speaker); and 
 	associate and store the sound attribute information and the target face image (Krupka et al. Fig. 14 element 1430 Diarize other sample to associate each respective utterance with corresponding human speaker, [0069] For example, the voiceprint may be established based on data used by or output from face location machine 124, such as a location output from the machine of the human speaker's face that helps spatially associate utterances made by the human speaker with his or her location; and/or data used by or output from face identification machine(s) 126, such as an identity that can be associate with the voiceprint and audio samples representing utterances from the human speaker.)

With respect to Claim 14, Negishi in view of Krupka et al. teach 
wherein the sound attribute information comprises: amplitude information (Negishi [0125] the messages in spoken language are text information. The meta-information in spoken language is information indicating speaking speed, voice volume), audio information, and/or accent cycle information. Examiner notes that the term “or” in the claimed language connotes a disjunctive list.)

With respect to Claim 15, Negishi in view of Krupka et al. teach 
 	wherein the processor is further configured to: 
 	before superimposing and displaying the augmented reality (AR) sign language animation corresponding to the voice information on the gesture area corresponding to the speaking object to obtain the sign language video, 
 	perform a semantic recognition on the voice information to obtain voice text information (Negishi Fig. 4 element S126 Extract word through speech recognition); 
 	query, in a pre-stored AR gesture animation, at least one AR gesture animation corresponding to the voice text information (Negishi Fig. 4 element S128 Translate into sign language); and 
 	obtain a sign language AR animation corresponding to the voice information according to the at least one AR gesture animation (Negishi Fig. 4 element S132 Generate moving image of gesture.)  

With respect to Claim 17, Negishi in view of Krupka et al. teach 
 	wherein the processor is further configured to: 
 	obtain gesture action information of a user himself in the video information (Negishi Fig. 4 element S122 Perform gesture recognition on sign language word); 
 	obtain action text information of the gesture action information (Negishi [0145] For the sign language of the first user 10, the device 100 performs speech synthesis to generate speech indicating the spoken language converted from the sign language (step S130). For example, the device 100 performs speech synthesis by Text to Speech technology); 
 	search, in a pre-stored voice information, for user voice information corresponding to the action text information (Negishi Fig. 4 element S124 Translate into spoken language); and 
 	play the user voice information (Negishi Fig. 4 element S130 Perform speech synthesis.)

	With respect to Claim 20, Claim 20 recites “A non-transitory readable storage medium, wherein the non- transitory readable storage medium stores a computer program, and the computer program, when executed by a processor, implements the sign language information processing method according to claim 1.” 1.” Thus, claim 20 is rejected under 35 U.S.C.103 as being unpatentable over Negishi (US 2021/0150145 A1) in view of Krupka et al. (US 2019/0341055 A1) as Claim 1. 

6.	Claims 7, 16 are rejected under 35 U.S.C.103 as being unpatentable over Negishi (US 2021/0150145 A1) in view of Krupka et al. (US 2019/0341055 A1) and Forutanpour et al. (US 2014/0081634 A1.)

With respect to Claim 7, Negishi in view of Krupka et al. teach all the limitations of Claim 1 upon which Claim 7 depends. Negishi in view of Krupka et al. fail to explicitly teach  
wherein before the superimposing and displaying an augmented reality (AR) sign language animation corresponding to the voice information on a gesture area corresponding to the speaking object to obtain a sign language video, the method further comprises: 
 	determining, in the video information, an area around a face of the speaking object (; and
determining, in the area around the face, the gesture area corresponding to the speaking object.  
	However, Forutanpour et al. teach 
wherein before the superimposing and displaying an augmented reality (AR) sign language animation corresponding to the voice information on a gesture area corresponding to the speaking object to obtain a sign language video, the method further comprises: 
 	determining, in the video information, an area around a face of the speaking object (Forutanpour et al. [0031] the location of a person's eyes is tracked. Tracking a person's eyes may be useful to determine: who they speaking to, and for superimposing virtual objects over the person's face and/or eyes (such that when the virtual object is viewed by the user, the user at least appears to be maintaining eye contact with the person), [0038] Various arrangements may be used by display module 170 to present text to the user that is to be attributed to a particular person. Text to be presented to the user may be presented in the form of a virtual object such as a speech bubble. The speech bubble may be a graphical element that indicates to which person text within the speech bubble should be attributed. Speech bubbles may be superimposed on a real-world scene such that they appear near the person who spoke the speech represented by the text. The speech bubbles may be partially transparent such that the user may see what is "behind" the speech bubble in the real-world scene. Display module 170 may also be used to present additional information, such as a name and language of persons present within the scene. In other embodiments, text may be superimposed as a virtual object over the face of the person who spoke the speech occur sponsor the text. As such, when the user is reading the text, the user will be looking at the person who spoke the speech); and
determining, in the area around the face, the gesture area corresponding to the speaking object (Forutanpour et al. [0042] Face superimposition module 180 may receive locations and identities associated with faces (and/or heads) from face identification and tracking module 120. Face superimposition module 180 may determine if the face (or, more specifically, the eyes and the facial region around the eyes) should be superimposed with a virtual object, such as text corresponding to speech spoken by the person. For example, based on input received from a user, face superimposition module 180 may not superimpose virtual objects on any face. (That is, the user may have the ability to turn on and off the superimposition of virtual objects on faces.) Face superimposition module 180 may determine which virtual object should be superimposed over the face. Determining which virtual object should be superimposed over the face may be based on the identity of the person associated with the face, whether the person associated with the face is talking, whether the user is looking at the person, whether the user is talking to the person, and/or a set of user preferences defined by the user. In some embodiments, rather than causing text to be superimposed over the face of the person, face superimposition module 180 may control the size, color, transparency, sharpness, and/or location of speech bubbles.)
 	Negishi, Krupka et al. and Forutanpour et al. are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of converting the spoken language into sign language as taught by Negishi, using teaching of detecting lip movement of the human and identifying face of the human as taught by Krupka et al. for the benefit of determining who is speaking, when that speaker is speaking and when that speaker is speaking, using teaching of tracking face and eye of the speaking user as taught by Forutanpour et al. for the benefit of controlling the size and the location to the speech bubbles (Forutanpour et al. [0042] Face superimposition module 180 may receive locations and identities associated with faces (and/or heads) from face identification and tracking module 120. Face superimposition module 180 may determine if the face (or, more specifically, the eyes and the facial region around the eyes) should be superimposed with a virtual object, such as text corresponding to speech spoken by the person. For example, based on input received from a user, face superimposition module 180 may not superimpose virtual objects on any face. (That is, the user may have the ability to turn on and off the superimposition of virtual objects on faces.) Face superimposition module 180 may determine which virtual object should be superimposed over the face. Determining which virtual object should be superimposed over the face may be based on the identity of the person associated with the face, whether the person associated with the face is talking, whether the user is looking at the person, whether the user is talking to the person, and/or a set of user preferences defined by the user. In some embodiments, rather than causing text to be superimposed over the face of the person, face superimposition module 180 may control the size, color, transparency, sharpness, and/or location of speech bubbles. Examiner notes that Forutanpour et al. determines the location for speech bubbles before displaying the content in the speech bubbles.)

With respect to Claim 16, Negishi in view of Krupka teach all the limitations of Claim 10 upon which Claim 16 depends. Negishi in view of Krupka fail to explicitly teach  
 	wherein the processor, before the superimposing and displaying an augmented reality (AR) sign language animation corresponding to the voice information on a gesture area corresponding to the speaking object to obtain a sign language video, is further configured to: 
 	determine, in the video information, an area around a face of the speaking object; and 
 	determine, in the area around the face, the gesture area corresponding to the speaking object.  
However, Forutanpour et al. teach
 	wherein the processor, before the superimposing and displaying an augmented reality (AR) sign language animation corresponding to the voice information on a gesture area corresponding to the speaking object to obtain a sign language video, is further configured to: 
 	determine, in the video information, an area around a face of the speaking object (Forutanpour et al. [0031] the location of a person's eyes is tracked. Tracking a person's eyes may be useful to determine: who they speaking to, and for superimposing virtual objects over the person's face and/or eyes (such that when the virtual object is viewed by the user, the user at least appears to be maintaining eye contact with the person), [0038] Various arrangements may be used by display module 170 to present text to the user that is to be attributed to a particular person. Text to be presented to the user may be presented in the form of a virtual object such as a speech bubble. The speech bubble may be a graphical element that indicates to which person text within the speech bubble should be attributed. Speech bubbles may be superimposed on a real-world scene such that they appear near the person who spoke the speech represented by the text. The speech bubbles may be partially transparent such that the user may see what is "behind" the speech bubble in the real-world scene. Display module 170 may also be used to present additional information, such as a name and language of persons present within the scene. In other embodiments, text may be superimposed as a virtual object over the face of the person who spoke the speech occur sponsor the text. As such, when the user is reading the text, the user will be looking at the person who spoke the speech); and
  	determine, in the area around the face, the gesture area corresponding to the speaking object (Forutanpour et al. [0042] Face superimposition module 180 may receive locations and identities associated with faces (and/or heads) from face identification and tracking module 120. Face superimposition module 180 may determine if the face (or, more specifically, the eyes and the facial region around the eyes) should be superimposed with a virtual object, such as text corresponding to speech spoken by the person. For example, based on input received from a user, face superimposition module 180 may not superimpose virtual objects on any face. (That is, the user may have the ability to turn on and off the superimposition of virtual objects on faces.) Face superimposition module 180 may determine which virtual object should be superimposed over the face. Determining which virtual object should be superimposed over the face may be based on the identity of the person associated with the face, whether the person associated with the face is talking, whether the user is looking at the person, whether the user is talking to the person, and/or a set of user preferences defined by the user. In some embodiments, rather than causing text to be superimposed over the face of the person, face superimposition module 180 may control the size, color, transparency, sharpness, and/or location of speech bubbles. Examiner notes that Forutanpour et al. determines the location for speech bubbles before displaying the content in the speech bubbles.)
 	Negishi, Krupka et al. and Forutanpour et al. are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of converting the spoken language into sign language as taught by Negishi, using teaching of detecting lip movement of the human and identifying face of the human as taught by Krupka et al. for the benefit of determining who is speaking, when that speaker is speaking and when that speaker is speaking, using teaching of tracking face and eye of the speaking user as taught by Forutanpour et al. for the benefit of controlling the size and the location to the speech bubbles (Forutanpour et al. [0042] Face superimposition module 180 may receive locations and identities associated with faces (and/or heads) from face identification and tracking module 120. Face superimposition module 180 may determine if the face (or, more specifically, the eyes and the facial region around the eyes) should be superimposed with a virtual object, such as text corresponding to speech spoken by the person. For example, based on input received from a user, face superimposition module 180 may not superimpose virtual objects on any face. (That is, the user may have the ability to turn on and off the superimposition of virtual objects on faces.) Face superimposition module 180 may determine which virtual object should be superimposed over the face. Determining which virtual object should be superimposed over the face may be based on the identity of the person associated with the face, whether the person associated with the face is talking, whether the user is looking at the person, whether the user is talking to the person, and/or a set of user preferences defined by the user. In some embodiments, rather than causing text to be superimposed over the face of the person, face superimposition module 180 may control the size, color, transparency, sharpness, and/or location of speech bubbles.)

7.	Claims 9, 18, 19 are rejected under 35 U.S.C.103 as being unpatentable over Negishi (US 2021/0150145 A1) in view of Krupka et al. (US 2019/0341055 A1) and Shahar et al. (US 10,607,069 B2.)

With respect to Claim 9, Negishi in view of Krupka et al. teach all the limitations of Claim 8 upon which Claim 9 depends. Negishi in view of Krupka et al. fail to explicitly teach  
 	wherein the obtaining gesture action information of a user himself in the video information comprises: 
 	obtaining a distance of a gesture-like image in the video information, wherein the distance indicates a distance between hands corresponding to the gesture-like image and a camera; and 
 	determining a gesture-like image whose distance is less than a threshold as the gesture action information of the user himself.  
	However, Shahar et al. teach
 	wherein the obtaining gesture action information of a user himself in the video information comprises: 
 	obtaining a distance of a gesture-like image in the video information, wherein the distance indicates a distance between hands corresponding to the gesture-like image and a camera (Shahar et al. col. 1 lines 31-35 Different camera systems obtain the depth information in different ways. One such camera system uses two or more cameras physically spaced apart and compares simultaneous images to determine a distance from the cameras to the hand); and 
 	determining a gesture-like image whose distance is less than a threshold as the gesture action information of the user himself (Shahar et al. col. 2 lines 40-47 the accuracy of a depth determination is limited by the distance between the depth camera and the hand. If the hand is too close or too far, then the depth determination will not be accurate enough to be useful, col. 3 lines 16-17 The maximum distance maxZ, past the second hand, is a distance beyond which the depth data is not accurate, col. 2 lines 15-34 In order to understand a pointing hand gesture, the camera and underlying processing system find the tip of the pointing finger on the pointing hand and then determine the relationship of the fingertip to the rest of the hand. The direction of pointing can be indicated as a vector where the user is pointing in the relevant space. For a virtual reality system, the vector may be in the virtual space. As described herein, a 3D camera may be used to determine a 3D direction vector from gestures performed in front of a camera. The direction vector may then be used to determine a virtual object that the user is attempting to touch or move or any of a variety of other commands and machine inputs. The described techniques may be applied to a multi-purpose 3D camera and can provide added functionality, such as collision avoidance. Using a mid-range multi-purpose 3D camera, precise pointing vector determinations may be made consistently for fingers that range in distance from a few centimeters up to 4 meters from the camera,)
 	Negishi, Krupka et al. and Shahar et al. are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of converting the spoken language into sign language as taught by Negishi, using teaching of detecting lip movement of the human and identifying face of the human as taught by Krupka et al. for the benefit of determining who is speaking, when that speaker is speaking and when that speaker is speaking, using teaching of threshold for distance between the hand and the camera to obtaining the gesture information of the user (Shahar et al. col. 2 lines 40-47 the accuracy of a depth determination is limited by the distance between the depth camera and the hand. If the hand is too close or too far, then the depth determination will not be accurate enough to be useful, col. 3 lines 16-17 The maximum distance maxZ, past the second hand, is a distance beyond which the depth data is not accurate.)

With respect to Claim 18, Negishi in view of Krupka et al. all the limitations of Claim 17 upon which Claim 18 depends. Negishi in view of Krupka et al. fail to explicitly teach  
 	wherein the processor is further configured to: 
 	obtain a distance of a gesture-like image in the video information, wherein the distance indicates a distance between hands corresponding to the gesture-like image and a camera. 
  	However, Shahar et al. teach
wherein the processor is further configured to: 
 	obtain a distance of a gesture-like image in the video information, wherein the distance indicates a distance between hands corresponding to the gesture-like image and a camera (Shahar et al. col. 1 lines 31-35 Different camera systems obtain the depth information in different ways. One such camera system uses two or more cameras physically spaced apart and compares simultaneous images to determine a distance from the cameras to the hand)
Negishi, Krupka et al. and Shahar et al. are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of converting the spoken language into sign language as taught by Negishi, using teaching of detecting lip movement of the human and identifying face of the human as taught by Krupka et al. for the benefit of determining who is speaking, when that speaker is speaking and when that speaker is speaking, using teaching of threshold for distance between the hand and the camera to obtaining the gesture information of the user (Shahar et al. col. 2 lines 40-47 the accuracy of a depth determination is limited by the distance between the depth camera and the hand. If the hand is too close or too far, then the depth determination will not be accurate enough to be useful, col. 3 lines 16-17 The maximum distance maxZ, past the second hand, is a distance beyond which the depth data is not accurate.)

With respect to Claim 19, Negishi in view of Krupka et al. and Shahar et al. teach
 	wherein the processor is further configured to: determine a gesture-like image whose distance is less than a threshold as the gesture action information of the user himself (Shahar et al. col. 2 lines 40-47 the accuracy of a depth determination is limited by the distance between the depth camera and the hand. If the hand is too close or too far, then the depth determination will not be accurate enough to be useful, col. 3 lines 16-17 The maximum distance maxZ, past the second hand, is a distance beyond which the depth data is not accurate, col. 2 lines 15-34 In order to understand a pointing hand gesture, the camera and underlying processing system find the tip of the pointing finger on the pointing hand and then determine the relationship of the fingertip to the rest of the hand. The direction of pointing can be indicated as a vector where the user is pointing in the relevant space. For a virtual reality system, the vector may be in the virtual space. As described herein, a 3D camera may be used to determine a 3D direction vector from gestures performed in front of a camera. The direction vector may then be used to determine a virtual object that the user is attempting to touch or move or any of a variety of other commands and machine inputs. The described techniques may be applied to a multi-purpose 3D camera and can provide added functionality, such as collision avoidance. Using a mid-range multi-purpose 3D camera, precise pointing vector determinations may be made consistently for fingers that range in distance from a few centimeters up to 4 meters from the camera,)

Conclusion
8.	The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. 
a.	VanBlon et al. (US 2015/0154983 A1.) In this reference, VanBlon et al. disclose a method for detecting pause in audio input based on determining whether the mouth of the user is moving. 
b.	Vasilieff et al. (US 2013/0021459 A1.) In this reference, Vasilieff et al. disclose a method of determining whether the user is speaking based on movement of the user’s mouth. 
c.  	Cowburn (US 10,074,381 B1). In this reference, Cowburn et al. disclose a method for the user is speaking based on facial landmarks of the user. 

9.	THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

10.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to THUYKHANH LE whose telephone number is (571)272-6429. The examiner can normally be reached Mon-Fri: 9am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew C. Flanders can be reached on 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/THUYKHANH LE/Primary Examiner, Art Unit 2655