DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
In response to the office action from 9/29/2021, the applicant has submitted an amendment, filed 12/29/2021, amending claims 1, 5, 7, 8, 17, cancelling claim 20, while arguing to traverse the prior art and other rejections. Applicant’s arguments have been fully considered but are moot with respect to new grounds of rejections further in view of MADHVANATH (US 2012/0075184) mandated by the latest amendments and for the reasons explained in the response to arguments.
Response to Arguments
In what follows applicant’s arguments and comments will be addressed in the order presented with each argument presented in a given ¶, to be followed by one or more ¶’s of examiner’s responses.
Following a broad overview of the latest amendments in section “I” on page 9, section “II” discusses the double patenting rejection of the previous action.
Due to the latest amendments the said rejection is withdrawn.
Section “III” on pages 8 and 9 discuss the previous claim objections.
Due to the latest amendments the said objections are withdrawn.

Due to the latest amendments the said rejections are withdrawn.
Following a broad overview of the last office action on page 10, the entire page 11 provides arguments directed at the amended limitation not present in the last office action. Likewise page 12 the last ¶ also provides arguments directed at that limitation.
Please visit the new office action further in view of MADHVANATH for that limitation.
Page 12 the third ¶ discusses the previous claim 20 now absorbed by claim 1. In particular it is argued that: “Further, Lee is not teaching tracking motions of the lips themselves and correlating them to a key phrase, Lee teaches no key phrases or commands associated with lip motion in particular”.
As an initial matter, Lee does specifically teach tracking “lips” “movement” to identify a specific act associated with a mode switch; i.e., see Lee ¶ 0051 first column, last 9 lines: “lips” “can be detected for its movement” “between images to determine whether to activate the power saving mode”. Since the “activat[ion]” is triggered by analysis of “lip images”, it amounts to detection of a silent predetermined “lip movement” (e.g. a key phrase emulation) associated with that “activat[ion]”. Secondly the two limitations for which Lee was relied on did not require any “correlating” with a “key phrase”. They merely required an “associat[ion]” with a “key phrase”. Lee ¶ 0029 last two lines in relation to “power saving mode” which the office action relied on to 
Nonetheless the other limitation added in this amendment which specifically required a “key phrase” detected by “lip” “images” is now relied on by the new reference for which the applicant is respectfully directed in the new office action.
Claim Objections
Claim 8 objected to because of the following informalities:  “determine at least some content is determined” has redundant and grammatically incorrect duplication of a verb.  Appropriate correction is required.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-13 , is/are rejected under 35 U.S.C. 103 as being unpatentable over MIYAMOTO (US 2020/0106884),  in view of Prasad et al. (US Patent 5,680,481) and MADHVANATH (US 2012/0075184) and further in view of Lee et al. (US 2014/0043498).
Regarding claim 1, MIYAMOTO does teach a lip language recognition method, applied to a mobile terminal (¶ 0118 sentence 1: “The contents of the uttered sentence 
having a sound mode and a silent mode (“lip sync” according to ¶ 0103 line 2 only requires information associated with “motion of the mouth of the user on the basis of the image of the user” and according to ¶ 0146 “in case of getting on the train” “With respect to the audio it is difficult to speak” (a silent mode) because according to ¶ 0195 lines 3+: “in case of being in a train” “on the basis of the lip sync information” “the uttered sentence is also generatable”; ¶ 0147: “in case of getting on a car” “it is possible to freely speak” (this requires a sound mode)), 
the method comprising:
collecting a user's lip images in the silent mode (¶ 0195 lines 3+: “in case of being in a train” (in the silent mode) “on the basis of the lip sync information” (collecting a user’s lip images, because “lip sync” according to ¶ 0103 line 2 only requires information associated with “motion of the mouth of the user on the basis of the image of the user”)  ; and
identifying content corresponding to the user’s lip images with a deep neural network 
MIYAMOTO does not specifically disclose:
training a deep neural network in the sound mode.
Prasad et al. do teach:
training a deep neural network in the sound mode (Col. 2 lines 38-40: “using a time delay neural network” (training a deep neural network) “visual speech recognition system in conjunction with the acoustic speech recognition system” (in the sound mode using “acoustic” features which require verbal input, because according to Col. 11 lines 27-31: “ Referring back to FIG. 1 of the combined acoustic and visual speech recognition system, the acoustic data signals occurred within a time window of one second duration and were taken simultaneously by a cardoid microphone 28” (the “acoustic data” required a “microphone” (sound mode); note that here also a “visual only” (silent mode) “(VO) multilayer TDNN” (deep neural network) according to Col. 3 lines 38-39 is supported)).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the “lip” motion coupled with “acoustic” “data” “TDNN” of Prasad et al.  into the machine learning of MIYAMOTO would enable the combined systems and their associated methods to perform in 
MIYAMOTO in view of Prasad et al. do not specifically disclose:
Identifying content corresponding to the user’s lip images including at least one key phrase corresponding to an associated function performed by the mobile terminal.
MADHVANATH does teach:
Identifying content corresponding to the user’s lip images including at least one key phrase corresponding to an associated function performed by the mobile terminal (¶ 0009 sentence 2: “embodiments  propose using silent speech” “lip movement” (using lip images since it is “camera-based lip reading” (¶ 0020 last 2 lines)) “in a multimodal command scenario, where silent speech may act as one of the commands to a computer system” (to recognize a “command” ( i.e., key phrase: ¶ 0033 lines 5-6: “wherein a user may just mouth the command word(s) without actually speaking them”), which is associated with a function corresponding to the “computer system” (i.e., ¶ 0031 sentence 2: e.g.,  “a mobile device” (mobile terminal)); For example ¶ 0021 last 8 lines: “he or she may mouth the second command” “100%” (a key phrase by just mouth movement identified with a zooming content)); or ¶ 0023 last 6 lines: “A switch between silent speech mode and a voice based speech mode takes place seamlessly”).

MIYAMOTO in view of Prasad et al. and MADHVANATH do not specifically disclose:
starting the silent mode with a user input of a key phrase;
wherein the key phrase is recognized by the mobile terminal through the user's lip movements without associated voice.
LEE et al. do teach:
starting the silent mode with a user input of a key phrase; wherein the key phrase is recognized by the mobile terminal through the user's lip movements without associated voice (¶ 0050 lines 1-3: “intelligent mode changing module 167” which “checks whether a human face is captured by analyzing an image received from the camera” (i.e., the analysis of the face (e.g. lips included) is done without detection of any associated voice); ¶ 0051 1st column, last 9 lines: “lips” “can be detected for its 
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the function of the “intelligent mode changing module 167” of LEE et al. into the “smartphone” or “laptop” of MIYAMOTO in MYAMOTO in view of Prasad et al. and MADHVANATH would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable MIYAMOTO in view of Prasad et al. and  MADHAVANATH to enjoy “an automatic power saving function” as disclosed in LEE et al. ¶ 0029 lines 8-9.

Regarding claim 2, MIYAMOTO does teach the method of claim 1, wherein the training comprises:
collecting lip images and corresponding voice data for training (FIG. 10 middle of lower row: “Received communication data” (data used for training comprises) “image” 
MIYAMOTO does not specifically disclose:
obtaining image data corresponding to the collected lip images for training, the image data comprising pixel information; and
training the deep neural network based on the image data and the voice data for training.
Prasad et al. do teach:
obtaining image data corresponding to the collected lip images for training, the image data comprising pixel information (Abstract last 8 lines: for “speech features such as lower and upper lip” “Time derivatives are estimated by pixel position” (image data comprises pixel information of the lip images)) ; and
training the deep neural network based on the image data and the voice data for training (col. 2 lines 41-43: “Another object is to provide the classifier” (training the deep neural network) “with a continuous stream related visual” (using image) “and acoustic data” (and voice data) “from which the acoustical utterance may be detected and classified”; note the “classifier” corresponds to the “time delay neural network classifier” (Col. 2 lines 48-49)).
For obviousness to combine MIYAMOTO and Prasad et al. see claim 1.


collecting lip images and corresponding voice data for training (FIG. 10 middle of lower row: “Received communication data” (data used for training comprises) “image” “audio: uttered sentence” (voice data), where the “image” is attributed to “lip sync” (lip images)).
MIYAMOTO does not specifically disclose:
obtaining image data corresponding to the collected lip images for training, the image data comprising pixel information;
obtaining text encoding corresponding to the voice data for training; and
training the deep neural network based on the image data and the text encoding for training.
Prasad et al. do teach:
obtaining image data corresponding to the collected lip images for training, the image data comprising pixel information (Abstract last 8 lines: for “speech features such as lower and upper lip” “Time derivatives are estimated by pixel position” (image data comprises pixel information of the lip images));
obtaining text encoding corresponding to the voice data for training (Col. 2 lines 62+: “The acoustic feature extraction apparatus converts acoustic speech signals representative of an utterance into a corresponding spectral feature vector set” 
training the deep neural network based on the image data and the text encoding for training (Col. 3 lines 1+: “The neural network classifying” (training the deep neural network) “converts the dynamic acoustic” (using text encoding and) “and visual feature vectors” (and the image data) “into a conditional probability distribution”; note: Col. 11 lines 17+: “pixel image” (i.e., the image data) “is then supplied to the visual feature vector generator”).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the “lip” motion coupled with “acoustic” “data” “TDNN” of Prasad et al.  into the machine learning of MIYAMOTO would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable it to determine “conditional probability distribution that describes the probability of each candidate utterance having been spoken given the observed acoustic and visual data” as disclosed in Prasad et al. Col. 3 lines 1-5.

Regarding claim 4, MIYAMOTO does teach the method of claim 3, wherein the identifying the content corresponding to the user’s lip images with the deep neural network comprises:

MIYAMOTO does not specifically disclose:
identifying user text encoding corresponding to the user’s lip images by applying the deep neural network on the user image data.
Prasad et al. do teach:
identifying user text encoding corresponding to the user’s lip images by applying the deep neural network on the user image data (Col. 3 lines 1+: “The neural network classifying” (applying the deep neural network) “converts the dynamic acoustic and visual feature vectors” (on the image data) “into a conditional probability distribution”(to determine text encoding data corresponding to the lip images) “that describes the probability of each candidate utterance” “spoken”).
For obviousness to combine MIYOMOTO and Prasad et al. see claim 3. 

Regarding claim 5, MIYAMOTO does teach the method of claim 2, further comprising extracting one or more user's voice features based on the voice data for training (¶ 0108 sentence 2: “Requesting at various reproduction levels, for example, the audio of the normal conversation, i.e., the audio in which the contents of the  intonation, and the like) are not missing” (features of a user voice such as “pitch” “volume” “intonation” (tone) and “speed” are extracted)).

Regarding claim 6, MIYAMOTO does teach the method of claim 5, wherein the user’s voice features comprise at least one of tone color, pitch, or volume (¶ 0108 sentence 2: “Requesting at various reproduction levels, for example, the audio of the normal conversation, i.e., the audio in which the contents of the uttered sentence and the intonation (the speed, the pitch, the volume, the intonation, and the like) are not missing” (features of a user voice such as “pitch” “volume” “intonation” (tone) and “speed” are extracted)).

Regarding claim 7, MIYAMOTO does teach the method of claim 6, further comprising synthesizing user voice data having the user’s voice features based on the extracted user’s voice features and an associated function being associated with one or more items of  content corresponding to the user’s lip images (¶ 0121: “The audio synthesis unit 44 complements the input audio” “on the basis of auxiliary information” “Typically the complemented audio data is synthesized” (synthesizing based on) “audio to which the intonation” (voice features) “is added and  audio expressed by the contents of the uttered sentence” “as communication data”; ¶ 0125 last sentence: “The 

Regarding claim 8, MIYAMOTO does teach a mobile terminal (¶ 0118 sentence 1: “The contents of the uttered sentence including the backchannels can be sufficiently complemented on the basis of the lip sync” (using a lip language recognition to understand an utterance associated with a sentence) “information”; ¶ 0083 last sentence: as examples of the systems this technique is applied to is “laptop PC and smartphone” (a mobile terminal))
having a sound mode and a silent mode (“lip sync” according to ¶ 0103 line 2 only requires information associated with “motion of the mouth of the user on the basis of the image of the user” and according to ¶ 0146 “in case of getting on the train” “With respect to the audio it is difficult to speak” (a silent mode) because according to ¶ 0195 lines 3+: “in case of being in a train” “on the basis of the lip sync information” “the uttered sentence is also generatable”; ¶ 0147: “in case of getting on a car” “it is possible to freely speak” (this requires a sound mode)), 
comprising:
 acquires audio (input information) captured by the microphone”; ¶ 0092 line 1: “The video acquisition unit 32 acquires video (input information) captured by the camera”);
and a processing portion (¶ 0081 lines 1-4: “The controller 11” which “includes” a “CPU”);
wherein: the acquisition portion is configured to acquire a user's lip images (¶ 0195 lines 3+: “in case of being in a train” (in the silent mode) “on the basis of the lip sync information” (collecting a user’s lip images, because “lip sync” according to ¶ 0103 line 2 only requires information associated with “motion of the mouth of the user on the basis of the image of the user”);
and the processing portion being provided in communication with the acquisition portion and being configured to identify content corresponding to the user’s lip images by utilizing a deep neural network (¶ 0103: “mouth area recognition unit 42 detects a motion of the mouth of the user on the basis of the image of the user 5 output” “and generates words uttered” (identifying content based on) “by the user 5 as lip synchronization” “lip sync” (based on “mouth” and/or “lip” “image”) “utilizing machine learning” (using a deep neural network)).
MIYAMOTO does not specifically disclose:
the deep neural network established in the sound mode.
Prasad et al. do teach:

It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the “lip” motion coupled with “acoustic” “data” “TDNN” of Prasad et al.  into the machine learning of MIYAMOTO would enable the combined systems and their associated methods to perform in combination as they do separately and to further help “improve the performance of speech recognition systems that only use acoustic or visual lip position” as disclosed in Prasad et al. Col. 2 lines 30-33.
MIYAMOTO in view of Prasad et al. do not specifically disclose:
Wherein the processing portion is provided with a plurality of computer executable instructions to perform various steps, including the following:
Determine at least some content, wherein the content contains at least some information corresponding to the user’s lip images including at least one key phrase corresponding to an associated function performed by the mobile terminal.
MADHVANATH does teach:
Determine at least some content, wherein the content contains at least some information corresponding to the user’s lip images including at least one key phrase corresponding to an associated function performed by the mobile terminal (¶ 0009 sentence 2: “embodiments  propose using silent speech” “lip movement” (using lip images since it is “camera-based lip reading” (¶ 0020 last 2 lines)) “in a multimodal command scenario, where silent speech may act as one of the commands to a computer system” (to recognize a “command” ( i.e., key phrase: ¶ 0033 lines 5-6: “wherein a user may just mouth the command word(s) without actually speaking them”), which is associated with a function corresponding to the “computer system” (i.e., ¶ 0031 sentence 2: e.g.,  “a mobile device” (mobile terminal)); For example ¶ 0021 last 8 lines: “he or she may mouth the second command” “100%” (a key phrase by just mouth movement identified with a zooming content)); or ¶ 0023 last 6 lines: “A switch between silent speech mode and a voice based speech mode takes place seamlessly”).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the methods associated with “lip movement” and analysis of MADHVANATH into the “lip sync” techniques of 
MIYAMOTO in view of Prasad et al. and MADHVANATH do not specifically disclose:
start the silent mode upon recognition of an associated key phrase;
wherein the associated key phrase is recognized by the mobile terminal through the user's lip movements without associated voice.
LEE et al. do teach:
start the silent mode upon recognition of an associated key phrase;
wherein the associated key phrase is recognized by the mobile terminal through the user's lip movements without associated voice (¶ 0050 lines 1-3: “intelligent mode changing module 167” which “checks whether a human face is captured by analyzing an image received from the camera” (i.e., the analysis of the face (e.g. lips included) is done without detection of any associated voice); ¶ 0051 1st column, last 9 lines: “lips” “can be detected for its movement” (user’s lip movement (associated with a key phrase) detected without any associated voice) “between images to determine whether to activate the power saving mode” (to start a silent mode by e.g. turning off of one of the 
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the function of the “intelligent mode changing module 167” of LEE et al. into the “smartphone” or “laptop” of MIYAMOTO in MYAMOTO in view of Prasad et al. and MADHVANATH would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable MIYAMOTO in view of Prasad et al. and  MADHAVANATH to enjoy “an automatic power saving function” as disclosed in LEE et al. ¶ 0029 lines 8-9.

Regarding claim 9, MIYAMOTO does teach the mobile terminal of claim 8, wherein:
the acquisition portion is configured to collect, with an imaging device and a microphone device,  lip images and corresponding voice data for training in the sound mode (FIG. 10 middle of lower row: “Received communication data” (data used for training comprises) “image” “audio: uttered sentence” (voice data), where the “image” is attributed to “lip sync” (lip images)).

And the processing portion is configured to:
obtain image data including pixel information based on the collected lip images for training; and
train the deep neural network according to the image data and the voice data for training.
Prasad et al. do teach:
obtain image data including pixel information based on the collected lip images for training (Abstract last 8 lines: for “speech features such as lower and upper lip” “Time derivatives are estimated by pixel position” (image data comprises pixel information of the lip images)) ; and
train the deep neural network according to the image data and the voice data for training (col. 2 lines 41-43: “Another object is to provide the classifier” (training the deep neural network) “with a continuous stream related visual” (using image) “and acoustic data” (and voice data) “from which the acoustical utterance may be detected and classified”; note the “classifier” corresponds to the “time delay neural network classifier” (Col. 2 lines 48-49)).
For obviousness to combine MIYAMOTO and Prasad et al. see claim 8.


the acquisition portion is configured to acquire lip images and corresponding voice data for training  in the sound mode (FIG. 10 middle of lower row: “Received communication data” (data used for training comprises) “image” “audio: uttered sentence” (voice data), where the “image” is attributed to “lip sync” (lip images)).
MIYAMOTO does not specifically disclose:
The processing portion is configured to:
obtain image data corresponding to the collected lip images for training;
obtain text encoding for training corresponding to the voice data for training; and
train the deep neural network according to the image data and the text encoding for training.
Prasad et al. do teach:
obtain image data corresponding to the collected lip images for training (Abstract last 8 lines: for “speech features such as lower and upper lip” “Time derivatives are estimated by pixel position” (image data comprises pixel information of the lip images) “as inputs to a time-delay neural network” (for training by the neural network));
obtain text encoding for training corresponding to the voice data for training (Col. 2 lines 62+: “The acoustic feature extraction apparatus converts acoustic speech signals representative of an utterance into a corresponding spectral feature vector set” 
train the deep neural network according to the image data and the text encoding for training (Col. 3 lines 1+: “The neural network classifying” (training the deep neural network) “converts the dynamic acoustic” (using text encoding and) “and visual feature vectors” (and the image data) “into a conditional probability distribution”; note: Col. 11 lines 17+: “pixel image” (i.e., the image data) “is then supplied to the visual feature vector generator”).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the “lip” motion coupled with “acoustic” “data” “TDNN” of Prasad et al.  into the machine learning of MIYAMOTO would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable it to determine “conditional probability distribution that describes the probability of each candidate utterance having been spoken given the observed acoustic and visual data” as disclosed in Prasad et al. Col. 3 lines 1-5.

Regarding claim 11, MIYAMOTO does not specifically disclose the mobile terminal of claim 10, wherein the processing portion is further configured to identify the text encoding for training using the deep neural network.

identify the text encoding for training using  the deep neural network  (Col. 18 lines 15-17: “a TDNN based AO acoustic processor has been described” (i.e., all “acoustic signals” responsible for obtaining “spectral feature vector set” (text encoding data (Col. 2 lines 62+)) are identified using the “TDNN” (the deep neural network)).
For obviousness to combine MIYOMOTO and Prasad et al. see claim 10. 

Regarding claim 12, MIYAMOTO does teach the mobile terminal of claim 10, further comprising a feature extraction portion configured to obtain a user’s voice features according to the voice data for training; wherein the voice features comprise at least one of tone color, pitch, or volume (¶ 0108 sentence 2: “Requesting at various reproduction levels, for example, the audio of the normal conversation, i.e., the audio in which the contents of the uttered sentence and the intonation (the speed, the pitch, the volume, the intonation, and the like) are not missing” (features of a user voice such as “pitch” “volume” “intonation” (tone) and “speed” are extracted).

Regarding claim 13, MIYAMOTO does teach the mobile terminal of claim 12, further comprising a speech synthesis portion configured to synthesize voice data with the user’s voice features according to the obtained voice features and the identified content (¶ 0121: “The audio synthesis unit 44 complements the input audio” “on the .

Claim 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over MIYAMOTO in view of Prasad et al., MADHAVANATH and Lee et al. , and further in view of Udodov (US 2017/0264830).
Regarding claim 14, MIYAMOTO in view of Prasad et al. MADHAVANATH and Lee et al. , do not specifically disclose the mobile terminal of claim 13, wherein the acquisition portion comprises an imaging device disposed at a bottom portion of the mobile terminal.
Udodov does teach a mobile terminal, wherein the acquisition portion comprises an imaging device disposed at a bottom portion of the mobile terminal (¶ 0038 sentence 1: a “smartphone” (a mobile terminal) where a “camera” (imaging device) “on the bottom surface of the device is/are located on the bottom side” (is at the bottom portion of the mobile terminal)).
.

Claims 15-17 is/are rejected under 35 U.S.C. 103 as being unpatentable over MIYAMOTO in view of Prasad et al. MADHAVANATH and Lee et al. ,  Udodov, and further in view of Gardos (US 2004/0243416).
Regarding claim 15, MIYAMOTO does teach the mobile terminal according to claim 14, further comprising:
a sending portion configured to encode the synthesized voice data and send the encoded synthesized voice data to a communication station wirelessly (Fig. 10 shows the “Sender Side: Worker who mainly works outside office” (sending portion) to send “communication data” wirelessly and comprising of “Audio” (synthesized voice) to a “Receiver side: business partner” (a communication station), where the “communication data” is “compressed” (encoded) as shown in Fig. 7 steps “ST106” or 
a receiving portion configured to receive a signal from the communication station and perform decoding and conversion into user-recognizable voice data (the “Audio” (recognizable voice data) is intended for someone on the “Receiver Side” (receiving portion of the communication station) which is to be decoded for the recipient).
MIYAMOTO in view of Prasad et al. MADHAVANATH and Lee et al. , and Udodov do not specifically disclose:
an earpiece configured to play the user-recognizable voice data decoded and converted by the receiving portion.
Gardos does teach:
an earpiece configured to play the user-recognizable voice data decoded and converted by the receiving portion (¶ 0024 lines 1-3: “”lip position sensor” “integrated into [an] earpiece” in a system that “recognize[es] speech of a user based on images of lip of the user obtained by a camera” (Claim 31)).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate an “earpiece” similar to Gardos into the “lip” sensor of MIYAMOTO in MIYAMOTO in view of Prasad et al. MADHAVANATH and Lee et al. ,  and Udodov would enable the combined systems and their associated methods to perform in combination as they do separately and to 

Regarding claim 16, MIYAMOTO in view of Prasad et al. MADHAVANATH and Lee et al. , do teach a non-transitory computer-readable medium having instructions stored on for execution by the mobile terminal of claim 15 for lip language recognition ( MIYAMOTO: ¶ 0222 sentence 1: “That is, the information processing method and the program according to the present technology may be executed not only in a computer system configured by a single computer but also in a computer system in which a plurality of computers cooperatively operate”) , 
the instructions comprising:
an imaging device capturing the lip images for training in a voice communication; a microphone collecting the voice data corresponding to the lip images for training (MIYAMOTO: FIG. 10 middle of lower row: “Received communication data” (data used for training comprises) “image” “audio: uttered sentence” (voice data), where the “image” is attributed to “lip sync” (lip images), and in so doing it utilizes a “Camera” (Unit “14” Fig. 2) as the imaging device, and a “Microphone” (microphone) for capturing voice data);

saving training results to guide the lip image recognition in the silent mode (Prasad et al.: Col. 20 lines 61+: “At every 50 epochs of training, the weights” (training results) “were recorded” (were saved) “training epochs used VO” (in silent mode since “VO” means “visual data only” (Col. 14 line 36) which does not require voice; “weights” according to Col. 8 lines 7-8 “[are] assigned to the” “pixels” (correspond to lip images));
 intonation, and the like) are not missing” (features of a user voice such as “pitch” “volume” “intonation” (tone) and “speed” are extracted, and according to ¶ 0121 lines 5-6: “audio to which the intonation” (voice features) “is added” (i.e., “Intonation” was saved when extracted so that it can later be “added” to “audio”)).

Regarding claim 17, MIYAMOTO in view of Prasad et al. MADHAVANATH and Lee et al. ,  Udodov and Gardos  do teach the non-transitory computer-readable medium of claim 16, wherein the instructions further comprise:
the processing portion identifying the text encoding from the user's image data using the trained deep neural network (Prasad et al.: Col. 3 lines 1+: “The neural network classifying” (applying the deep neural network) “converts the dynamic acoustic and visual feature vectors” (on the image data) “into a conditional probability distribution” (to identify text encoding data corresponding to the lip images) “that describes the probability of each candidate utterance” “spoken”), 
and transmitting the recognized text encoding to the speech synthesis portion;
the voice features being based on one or more voice features saved in the sound mode and the recognized text encoding (MIYAMOTO: ¶ 0121: “The audio synthesis unit 44 complements the input audio” “on the basis of auxiliary information” “Typically the complemented audio data is synthesized” (synthesizing based on) “audio to which the intonation” (voice features or text encoding) “is added and  audio expressed by the contents of the uttered sentence” “as communication data”; ¶ 0125 last sentence: “The synthesis” (synthesizing also) “of those facial expression” (based on content corresponding to the user’s lip images) “and gesture are executed also in a case where the missing facial expression and gesture are to be complemented”);
the sending portion encoding and sending the voice data having the voice features to a communication station wirelessly (MIYAMOTO: Fig. 10 shows the “Sender Side: Worker who mainly works outside office” (sending portion) to send “communication data” wirelessly and comprising of “Audio” (synthesized voice) to a “Receiver side: business partner” (a communication station), where the “communication data” is “compressed” (encoded) as shown in Fig. 7 steps “ST106” or “ST107”; ¶ 0075: the “communication module” (e.g. the mobile terminal) communicates “wireless” only); and
the receiving portion receiving from the communication station the voice for decoding (the “Audio” (recognizable voice data) is intended for someone on the .

Claim 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over MIYAMOTO in view of Prasad et al. MADHAVANATH and Lee et al. ,, Udodov and Gardos and further in view of Freeland et al. (US 2003/0028380).
Regarding claim 18, MIYAMOTO in view of Prasad et al. MADHAVANATH and Lee et al. ,  Udodov and Gardos do not specifically disclose:
The non-transitory computer-readable medium of claim 17, wherein the instructions further comprise:
downloading sound recording;
the feature extraction portion extracting sound features from the downloaded sound recording; and
mixing the extracted sound features with the saved voice features prior to the synthesizing.
Freeland et al. do teach:
downloading sound recording (¶ 0343 page 21 lines 1-3: “a new character” (e.g. “Elvis” in ¶ 0111 (sound recording)) “can preferably be virtually downloaded” (downloaded));

mixing the extracted sound features with the saved voice features prior to the synthesizing (¶ 0111 sentence 1: “The text to audio conversion operation converts the text message” “to an audio format message representing” “one of several well known character voices (for example, Elvis Presley” and this is done using “character TTS” (¶ 0118) by “prosody adjustment algorithms” (¶ 0118 line 6: e.g. by changing the “prosody” (pitch speech feature) of the saved voice in accordance with the corresponding “prosody” of the downloaded character or the extracted features)).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the “character TTS” function of Freeland et al. into the “Audio synthesis unit” (unit “44” Fig. 3) of MIYAMOTO in MIYAMOTO in view of Prasad et al. MADHAVANATH and Lee et al. ,    Udodov and Gardos would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable MIYAMOTO in view of Prasad et al. MADHAVANATH and Lee et al. ,   Udodov and Gardos to present their synthesized audio “into a voice representative of a particular character” as disclosed in Freeland et . 

Claim 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over MIYAMOTO in view of Prasad et al. MADHAVANATH and Lee et al. ,   Udodov Gardos Freeland et al. , and further in view of APPLEYARD et al. (US 2018/0205550).
Regarding claim 19, MIYAMOTO in view of Prasad et al. MADHAVANATH and Lee et al. ,  Udodov Gardos and Freeland et al. do not specifically disclose the non-transitory computer-readable medium of claim 18, wherein the instructions further comprise:
obtaining user feedbacks on the text encoding for training; and
training the deep neural network with the obtained user feedbacks.
APPLEYARD et al. do teach:
obtaining user feedbacks on the text encoding for training; and training the deep neural network with the obtained user feedbacks (¶ 0017 lines 6+ do teach using a “neural network” to conduct e.g. “measurements between facial features, such as distances between” “lips”, and in so doing according to ¶ 0018 2nd column lines 11+ it tries “to validate the neural network training” (i.e. does train the neural network) “by prompting the individual” (obtaining a user feedback) “to indicate whether those photos were properly characterized by the neural network”).
.
Conclusion

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARZAD KAZEMINEZHAD whose telephone number is (571)270-5860. The examiner can normally be reached 10:30 am to 11:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, DANIEL C WASHBURN can be reached on (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Farzad Kazeminezhad/
Art Unit 2657
February 2nd 2022.