DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
In response to the office action from 10/4/2021, the applicant has submitted an amendment, filed 12/28/2021, amending claims 1, 7, 10, 18-20, cancelling claims 6 and 14, while arguing to traverse the prior art and other rejections. Applicant’s arguments have been fully considered but the previous grounds of rejections are maintained for the reasons explained in the response to arguments.
Response to Arguments
In what follows applicant’s arguments and comments will be addressed in the order presented with each argument presented in a given ¶, to be followed by one or more ¶’s of respective examiner’s responses.
Following a broad overview of the latest amendments on page 7 ¶s 1-2, on ¶ 3 the double patenting rejection is discussed; it is asserted that they “will consider filing a terminal disclaimer if and when the claims are otherwise allowed”.
Since original claims 6 and 14 (now absorbed by claims (1,18) and 10 respectively) were not subject to this double patenting rejection, therefore the 
Page 7 ¶ 4 discusses the previous 112(b) rejections.
Due to the latest amendments the said rejections are withdrawn.
Following a broad overview of the last office action from the last ¶ on page 7 through the first ¶ on page 8, the remainder of pages 8-9 provide some broad overviews of the applicant’s understandings of the teachings of Prasad et al. (US Patent 5,680,481). Using those then on page 10 it is concluded that: “whereas the position operation in Prasad is directed to the facial features such as eye, nose, lips. The positioning operation in claim 1 aims to acquire azimuth of the object to be identified, and further, determine a position of a face region of the object to be identified in each frame of image in the sequence of images according to the positioned azimuth of the object to be identified, whereas the positioning operation in Prasad aims to acquire visual feature vectors. And further, Prasad is completely silent about “cropping an image of the face region of the object to be identified from each frame of the images””.
RE “azimuth”:
Respectfully the claim’s “azimuth” was mapped to Prasad’s “angle”; i.e., according to Prasad Col. 10 lines 5+ (office action page 16 ¶ 1): considering “small angle” (azimuth) “of the vertical” considered with respect to the “speaker’s head axis of symmetry” (of the object acquired). This is consistent with how “azimuth” is defined in may be positioned by means of sound localization”;  “An angle between the position of the object to be identified and the central axis may be served as the azimuth of the object to be identified” (¶ 0079 lines 11+ specification). Also according to the plain meaning of the word “azimuth”: “horizontal direction expressed as an angular distance from a fixed point” (Merriam Webster). Based on these definitions the quoted part of Prasad et al. matches one on one with the claim’s azimuth and other details in that limitation pertaining to “azimuth” determination, acquiring and/or identification using a positioning of an object (in this case face of a speaker).
RE: “determine a position of a face region of the object to be identified in each frame of image in the sequence of images according to the positioned azimuth of the object to be identified, whereas the positioning operation in Prasad aims to acquire visual feature vectors”:
According to Prasad et al. Col. 9 lines 47+ (office action page 16 ¶ 3): “The pixels belonging to ROI” (“ROI”=”region of interest”, e.g. mouth or lip of the object “pixels” (positions)) “may be found by defining two coordinate systems (x,y) and (x’, y’)” (i.e. “(x,y)” or the position of pixels of the object are determined according to the equation in Col. 9 line 55 which in summary shows how “(x,y)” is obtained by applying a rotation matrix to “(x’,y’)” plus “(x0,y0)” (“mouth line”). Since the claim is silent on which “position” of which “face region” “of the object” it is concerned, therefore since 
This is further consistent with how the spec. defines these operations: spec. ¶ 0079 lines 9+: “the position of the object to be identified may take a central axis of the field of view range of the camera device as a reference position” “An angle between the position of the object to be identified and the central axis may be served as the azimuth of the object to be identified, and then the position of the face region of the object to be identified in the image may be further positioned according to the azimuth of the object to be identified”. Given these definitions therefore the above quoted teaching in Prasad maps one on one with the disclosure and the claim.
RE: the “cropping” operation:
Here the claim is silent on the scope and even meaning of “cropping”. According to spec. ¶ 0078: “For example, for lip-language identification, only a region containing a face of an object to be identified is required. In order to further improve identification speed, a partial image of the face region of the object to be identified may be cropped from each frame of images so as to generate a sequence of face images. For example, each frame face image is a partial image taken from the entire image of an object to be identified, and the partial image includes a face region”.
Prasad’s Figs. 3, 4, 6, and 9 show respectively sequence of images in time for different regions in the face a, e.g., “MOUTH-LINE 51” (Fig. 3), “MOUTH-LINE 51” again but in Fig. 4 which corresponds to a different frame, likewise “MOUNT-LINE 51” in Fig. 6 which shows the same lip image at two different frames, and again “MOUTH LINE 51” in Fig. 9. These certainly show at least with respect to the “lip” and “mouth” a sequence of images corresponding to different “frames” (times) and identified by virtue of their positions with respect to an “AXIS-OF-SYMMETRY”. Therefore these images together show what amounts to a “cropping an image of the face region” as defined in the disclosure quoted above.
Since these arguments did not address the office action at all, therefore: Applicant's arguments fail to comply with 37 CFR 1.111(b) because they amount to a general allegation that the claims define a patentable invention without specifically pointing out how the language of the claims patentably distinguishes them from the references.
On page 10 the last ¶ it is argued that “Claims 10 and 18 are similarly [sic] to Claim 1 and thus are patentable for at least the reasons discussed above”.
The same reasoning as above applied to these claims as well.

Since applicants have not argued the merits of these dependent claims, but assert patentability solely through their dependence on the allegedly patentable parent claims, they stand or fall with said parent claims and hence no further response to applicant’s arguments is necessary.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1-5, 7-8, 11, 15, 17-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Vartanian et al. (US 2012/0242865), and in view of Bailey et al. (US 2014/0129207) and further in view of Prasad et al. (US Patent 5,680,481).

Regarding claim 1, Vartanian et al. do teach a lip-language identification method (¶ 0054 lines 3-7: “detect” “lip, mouth” “movement” “for speech recognition” and used by  “device 100” to “automatically augment” according to ¶ 0045 lines 4-5), 

acquiring a sequence of face images for an object to be identified (¶ 0054 lines 5+: “Lip, mouth, or tongue movement may be detected when the user is speaking with sound or silently speaking without sound” “Images captured by camera in I/O devices” (acquiring a sequences of face images of user (object) being detected by the “device 100” camera));
performing lip-language identification based on the sequence of face images, so as to determine semantic information of speech content of the object to be identified corresponding to lip actions in the face images(¶ 0054 last 7 lines: “Images captured by camera” (the sequence of facial images) “processed” “to determine user input” “object device 100 may use lip or tongue  movement” “for inputting text” “to assist with an existing speech or voice recognition system to interpret spoken language” (to determine i.e. the “text” (semantic information) of the “input” (speech content) corresponding to the lip or mouth movements of the “user” (object) being identified);
wherein acquiring the sequence of face images for the object to be identified, comprises:
acquiring a sequence of images including the object to be identified ( ¶ 0054 lines 5+: “Lip, mouth, or tongue movement may be detected when the user is speaking with sound or silently speaking without sound” “Images captured by camera” (acquiring a sequence of images of the face of the “user” “lip, mouth” (object to be identified)).

and outputting the semantic information.
Bailey et al. do teach:
Outputting the semantic information (Abstract lines 6+: “the commencement of lip movement by one of the potential speakers and reception of the utterance” “The utterance can be converted to text” (semantic information associated also with lip movement determined) “converted text can then be displayed to the user” (“text” (semantic information) is outputted) “in an augmented reality environment” (to the augmented reality device)).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the methods associated with perceptions in the augmented device of Bailey et al. into the corresponding ones associated with the augmented object device of Vartanian et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable a user wearing the augmented device in Vartainian et al. to be able to determine “which of” “potential speakers the converted text should be attributed” when there are plurality of users in the field of view of the user wearing the augmented device as disclosed in Bailey et al. abstract last sentence.
Vartanian et al. in view of Bailey et al. do not specifically disclose:
positioning the object to be identified and acquiring the azimuth of the object to be identified; and
determining a position of a face region of the object to be identified in each frame of image in the sequence of images according to the positioned azimuth of the object to be identified; and generating the sequence of face images by cropping an image of the face region of the object to be identified from each frame of the images.
Prasad et al. do teach:
positioning the object to be identified and acquiring the azimuth of the object to be identified (Col. 10 lines 5+: “speakers’ head axis of symmetry is constrained to be within a small angle of the vertical” (determining an “angle” (azimuth) of a “speakers” “lips” object while he is speaking, see Fig. 3, 6, 9)); 
and determining a position of a face region of the object to be identified in each frame of image in the sequence of images according to the positioned azimuth of the object to be identified; and generating the sequence of face images by cropping an image of the face region of the object to be identified from each frame of the images (Col. 9 lines 47+: “The pixels belonging to ROI” (i.e., “region of interest” (see Fig. 3 e.g., mouth or lip (the object)) “may be found by defining two coordinate systems (x,y) and (x’,y’)” (the “(x,y)” (position of pixels of the object are according to the equation in Col. 9 lines 55+  determined in terms of “θ” or the “angle” (azimuth); furthermore as Figs. 3 
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the mathematical methods using the pixel analysis pertaining to lip and mouth positions in Prasad et al. into the lip image analysis and processing of Vartanian et al. in Vartanian et al. in view of Bailey et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable Vartanian et al. in view of Bailey et al. to benefit from a more “effective speech” “recognition” “by using the five points shown in Fig. 9” (i.e., based on the pixel analysis of the face region using the formalism thereon) as disclosed in Prasad et al. Col. 10 lines 47+.

Regarding claim 2, Vartanian et al. do teach the lip-language identification method according to claim 1, wherein the performing lip-language identification based on the sequence of face images, so as to determine the semantic information of the speech content of the object to be identified corresponding to the lip actions in the face image, comprises:
sending the sequence of face images to a server, and performing, by the server, the lip-language identification so as to determine the semantic information of the speech content of the object to be identified corresponding to the lip actions in the face 

Regarding claim 3, Vartanian et al. do teach the lip-language identification method according to claim 2, further comprising: 
receiving semantic information sent by the server, 
Vartanian et al. do not specifically disclose:
receiving semantic information, in prior to the outputting the semantic information.
Bailey et al. do teach:

For obviousness to combine Vartanian et al. and Bailey et al. see claim 1.

Regarding claim 4, Vartanian et al. do teach the lip-language identification method according to claim 1,
 wherein the semantic information is semantic text information (¶ 0054 last 7 lines: “Images captured by camera” (the sequence of facial images) “processed” “to determine user input” “object device 100 may use lip or tongue  movement” “for inputting text” (semantic information is textual) “to interpret spoken language”)
and/or semantic audio information.

Regarding claim 5, Vartanian et al. do not specifically disclose the lip-language identification method according to claim 4,  wherein outputting the semantic information comprises:
displaying the semantic text information within a visual field of a user wearing an augmented reality device ; 

Bailey et al. do teach:
displaying the semantic text information within a visual field of a user wearing an augmented reality device (¶ 0111 last sentence: “the destination language text” (the semantic text information) “can be displayed” (displayed) “on viewing surface 148 of prism 144 and superimposed on the user's field of view” (within user’s visual field) “thereby achieving augmented reality” (in an augmented reality device) “functionality” (see Fig. 9 the texts associated with each user); as Fig. 1 top left shows the device is wearable like a glass). 
For obviousness to combine Vartanian et al. and Bailey et al. see claim 1.

Regarding claim 7, Vartanian et al. in view of Bailey et al. do not specifically disclose the lip-language identification method according to claim 1, wherein positioning the azimuth of the object to be identified, comprises:
positioning the azimuth of the object to be identified according to a voice signal emitted when the object to be identified is speaking.
Prasad et al. do teach:
positioning the azimuth of the object to be identified according to a voice signal emitted when the object to be identified is speaking (Col. 10 lines 5+: “speakers’ head 
For obviousness to combine Vartanian et al. in view of Bailey et al. and Prasad et al. see claim 1.


Regarding claim 8, Vartanian et al. do teach the lip-language identification method according to claim 2, further comprising saving the sequence of face images, after acquiring the sequence of face images for the object to be identified (¶ 0042 last 3 lines: “The other user” “information” (e.g., his sequence of images) “may be stored” (saved) “and accessed on storage device 110”; furthermore, ¶ 0054 lines 4+: “read lip” by “Lip, mouth, or tongue movement” when “silently speaking” “Images captured by camera” (i.e., they are saved because) “processed by” (they are processed later by a) “software” “to determine user input” (by a program; i.e. speech here is recognized not by audio but by processing a sequences of images which requires saving them to enable “read[ing]” “lips”  by later processing).

Regarding claim 11, Vartanian et al. do not specifically disclose the lip-language identification apparatus according to claim 10, further comprising:
an output unit, configured to output semantic information.

An output unit, configured to output semantic information (Abstract lines 6+: “the commencement of lip movement by one of the potential speakers and reception of the utterance” “The utterance can be converted to text” (semantic information associated also with lip movement determined) “converted text can then be displayed to the user” (“text” (semantic information) is outputted) “in an augmented reality environment” (to the augmented reality device (output unit))).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the methods associated with perceptions in the augmented device of Bailey et al. into the corresponding ones associated with the augmented object device of Vartanian et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable a user wearing the augmented device in Vartainian et al. to be able to determine “which of” “potential speakers the converted text should be attributed” when there are plurality of users in the field of view of the user wearing the augmented device as disclosed in Bailey et al. abstract last sentence.

Regarding claim 15, Vartanian et al. do teach a lip-language identification apparatus, comprising:
a processor (¶ 0059 line 9: “processor”); 

a machine-readable storage medium, storing instructions that are executed by the processor (¶ 0059 line 5+: “The methods, processes, or flow charts provided herein may be implemented” “in a computer-readable storage medium for execution by a general purpose computer or a processor”) ;
for performing the lip-language identification method according claim 1 (it is rejected under similar rationale as claim 1).

Regarding claim 17, Vartanian et al. do teach the augmented reality device according to claim 16, further comprising a camera device, a display device or a play device (¶ 0054 line 3 “camera in I/O devices 118” (a camera or display device); ¶ 0051 lines 1-3: “For” “augmented audio” “augmented reality” (a play device) “used to play a song associated with another user”);
wherein the camera device is configured to capture an image of the object to be identified (¶ 0054 lines 5+: “Lip, mouth, or tongue movement may be detected when the user is speaking with sound or silently speaking without sound” “Images captured by camera in I/O devices” (acquiring a sequences of face images of user (object) being detected by the “device 100” camera)).
Vartanian et al. do not specifically disclose:
the display device is configured to display semantic information; and

Bailey et al. do teach:
the display device is configured to display the semantic text information (¶ 0111 last sentence: “the destination language text” (the semantic text information) “can be displayed” (displayed) “on viewing surface 148 of prism 144 and superimposed on the user's field of view” (within user’s visual field) “thereby achieving augmented reality” (in an augmented reality device) “functionality” (see Fig. 9 the texts associated with each user));
and
the play subunit is configured to play the semantic information (¶ 0088 lines 1+: “the destination language text” (the semantic information) “can be converted to an audio signal” (converted to audio) “and output to the user via a speaker” (and played)).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the methods associated with perceptions in the augmented device of Bailey et al. into the corresponding ones associated with the augmented object device of Vartanian et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable a user wearing the augmented device in Vartainian et al. to be able to determine “which of” “potential speakers the converted text should be attributed” using display associated with the “text” and its associated converted audio 

Regarding claim 18, Vartanian et al. do teach a lip-language identification method (¶ 0054 lines 3-7: “detect” “lip, mouth” “movement” “for speech recognition” and used by  “device 100” to “automatically augment” according to ¶ 0045 lines 4-5), 
comprising:
receiving a sequence of face images for an object to be identified sent by an augmented reality device (¶ 0054 lines 5+: “Lip, mouth, or tongue movement may be detected when the user is speaking with sound or silently speaking without sound” “Images captured by camera in I/O devices” (acquiring a sequences of face images of user (object) being detected by the “device 100” camera), this corresponds to step “502” “Detect other user or object” in Fig. 5 which according to ¶ 0011 is associated with “an augmented reality environment” (received from an augmented reality device));
determining semantic information of speech content of the object to be identified corresponding to lip actions in the face images, by  performing lip-language identification based on the sequence of face images (¶ 0054 last 7 lines: “Images captured by camera” (the sequence of facial images) “processed” “to determine user input” “object device 100 may use lip or tongue  movement” “for inputting text” “to assist with an existing speech or voice recognition system to interpret spoken language” 
wherein receiving the sequence of face images for the object to be identified, comprises:
acquiring a sequence of images including the object to be identified ( ¶ 0054 lines 5+: “Lip, mouth, or tongue movement may be detected when the user is speaking with sound or silently speaking without sound” “Images captured by camera” (acquiring a sequence of images of the face of the “user” “lip, mouth” (object to be identified)).
Vartanian et al. do not specifically disclose:
Sending the semantic information to the augmented reality device.
Bailey et al. do teach:
Sending the semantic information to the augmented reality device (Abstract lines 6+: “the commencement of lip movement by one of the potential speakers and reception of the utterance” “The utterance can be converted to text” (semantic information associated also with lip movement determined) “converted text can then be displayed to the user” (“text” (semantic information) is outputted) “in an augmented reality environment” (to the augmented reality device)).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the methods associated with perceptions in the augmented device of Bailey et al. into the corresponding ones 
Vartanian et al. in view of Bailey et al. do not specifically disclose:
positioning the object to be identified and acquiring the azimuth of the object to be identified; and
determining a position of a face region of the object to be identified in each frame of image in the sequence of images according to the positioned azimuth of the object to be identified; and generating the sequence of face images by cropping an image of the face region of the object to be identified from each frame of the images.
Prasad et al. do teach:
positioning the object to be identified and acquiring the azimuth of the object to be identified (Col. 10 lines 5+: “speakers’ head axis of symmetry is constrained to be within a small angle of the vertical” (determining an “angle” (azimuth) of a “speakers” “lips” object while he is speaking, see Fig. 3, 6, 9)); 
and determining a position of a face region of the object to be identified in each frame of image in the sequence of images according to the positioned azimuth of the object to be identified; and generating the sequence of face images by cropping an image of the face region of the object to be identified from each frame of the images (Col. 9 lines 47+: “The pixels belonging to ROI” (i.e., “region of interest” (see Fig. 3 e.g., mouth or lip (the object)) “may be found by defining two coordinate systems (x,y) and (x’,y’)” (the “(x,y)” (position of pixels of the object are according to the equation in Col. 9 lines 55+  determined in terms of “θ” or the “angle” (azimuth); furthermore as Figs. 3 and/or 4, 6, 9 show, this corresponds to cropping an image of the face region (object) of the speaker to be identified in a given frame)).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the mathematical methods using the pixel analysis pertaining to lip and mouth positions in Prasad et al. into the lip image analysis and processing of Vartanian et al. in Vartanian et al. in view of Bailey et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable Vartanian et al. in view of Bailey et al. to benefit from a more “effective speech” “recognition” “by using the five points shown in Fig. 9” (i.e., based on the pixel analysis of the face region using the formalism thereon) as disclosed in Prasad et al. Col. 10 lines 47+.

Regarding claim 19, Vartanian et al. in view of Bailey et al. do teach a storage medium that stores non-transitorily computer readable instructions that, when s instructions for the lip-language identification method according to claim 1 (Vartanian et al.: ¶ 0059 line 5+: “The methods, processes, or flow charts provided herein may be implemented” “in a computer-readable storage medium for execution by a general purpose computer or a processor”, and rejected under similar rationale as claim 1).

Regarding claim 20, Vartanian et al. in view of Bailey et al. do teach a storage medium that stores non-transitorily computer readable instructions that, when executed by a computer, the computer executes instructions for the lip-language identification method according to the lip-language identification method according to claim 18 (Vartanian et al.: ¶ 0059 line 5+: “The methods, processes, or flow charts provided herein may be implemented” “in a computer-readable storage medium for execution by a general purpose computer or a processor”, and rejected under similar rationale as claim 18).

Claim 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Vartanian et al. in view of Bailey et al. and Prasad et al., and further in view of YOSHIGAHARA et al. (US 2016/0078318).

sending the saved sequence of face images to the server upon receiving a sending instruction.
YOSHIGAHARA et al. do teach:
sending the saved sequence of face images to the server upon receiving a sending instruction (¶ 0084 lines 4+: “Transmission of the input image” (sending an image) “in response to a user instruction” (by a sending instruction) so “an object displayed on the screen be identified or tracked” “when input image is transmitted in response” “a feature dictionary is provided from the dictionary server” (to a server), where the “instruction” “is from a user via the input unit 106” according to ¶ 0080; ¶ 0005 lines 4-6: “One application of such object identification is an augmented reality (AR) application”   ).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the “user instruction” capability of YOSHIGAHARA et al. in his “augmented reality” device into the “augmented reality” device of Vartanian et al. in Vartanian et al. in view of Bailey et al. and Prasad et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable a user of Varanian et al. in view .


Claims 12-13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Vartanian et al. in view of Bailey et al. and Prasad et al., and further in view of Shpigelman (US 2015/0302651).
Regarding claim 12, Vartanian et al. in view of Bailey et al. do not specifically disclose the lip-language identification apparatus according to claim 11, wherein the output unit comprises:
an output mode instruction generation subunit, configured to generate a display mode instruction, wherein the output mode instruction includes a display mode instruction and an audio mode instruction.
Shpigelman does teach:
an output mode instruction generation subunit, configured to generate a display mode instruction, wherein the output mode instruction includes a display mode instruction and an audio mode instruction (¶ 0042 last 4 lines and ¶ 0043 lines 1-4: “the file or streamable content represents” “visual scenes” “as well as an audio stream, for playback on the VR/AR headset” (generating visual scenes and audio playback by an augmented reality headset) “These steps may be followed by user selection of a “PLAY” 
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the “PLAY” ”button” functionality of the augmented reality (“AR”) device of Shpigelman into the augmented reality devices of Vartanian et al. in view of Bailey et al. and Prasad et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable Vartanian et al. in view of Bailey et al. and Prasad et al. to give a user of their augmented reality device the freedom to choose whether or not they utilize their augmented reality’s supplemental contributions for a given scene or not and thereby save in the “AR” devices resource utilization.

Regarding claim 13, Vartanian et al. in view of Bailey et al. and Prasad et al. do teach the lip-language identification apparatus according to claim 12, wherein the semantic information is semantic text information and/or semantic audio information, and the output unit further comprises:
a display subunit, configured to display the semantic text information within a visual field of a user wearing an augmented reality device 
and
a play subunit, configured to play the semantic audio information 
Vartanian et al. in view of Bailey et al. and Prasad et al. do not specifically disclose:
Displaying upon receiving the display mode instruction, and play upon receiving the audio mode instruction.
Shpigelman does teach:
Displaying upon receiving the display mode instruction, and play upon receiving the audio mode instruction (¶ 0042 last 4 lines and ¶ 0043 lines 1-4: “the file or streamable content represents” “visual scenes” “as well as an audio stream, for playback on the VR/AR headset” “These steps may be followed by user selection of a 
For obviousness to combine Vartanian et al. in view of Bailey et al. and Prasad et al. and Shpigelman see claim 12.

Claims 10, 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Vartanian et al., and further in view of Prasad et al..
Regarding claim 10, Vartanian et al. do teach a lip-language identification apparatus (¶ 0054 lines 3-7: “detect” “lip, mouth” “movement” “for speech recognition” and used by  “device 100” to “automatically augment” according to ¶ 0045 lines 4-5), 
comprising:
a face image sequence acquiring unit, configured to acquire a sequence of face images for an object to be identified (¶ 0054 lines 5+: “Lip, mouth, or tongue movement may be detected when the user is speaking with sound or silently speaking without sound” “Images captured by camera in I/O devices” (acquiring a sequences of face images of user (object) being detected by the “device 100” “camera” (face image sequence acquiring unit));
a sending unit, configured to send the sequence of face images to a server (¶ 0046 lines 5+: “using well-known techniques for text recognition, character recognition, image recognition” (e.g. recognition of images comprising of lip movements above) 
wherein the server determines semantic information corresponding to lip actions in the face images by performing lip-language identification (¶ 0054 last 7 lines: “Images captured by camera” (the sequence of facial images) “processed” (i.e., by the “server” ((¶ 0046 lines 5+)) “to determine user input” “object device 100” (or “server” (¶ 0046 lines 5+)) “may use lip or tongue  movement” “for inputting text” “to assist with an existing speech or voice recognition system to interpret spoken language” (to determine i.e. the “text” (semantic information) of the “input” (speech content) corresponding to the lip or mouth movements of the “user” (object) being identified)); and
a receiving unit, configured to receive semantic information from the server (¶ 0046 lines 7-8: “speech” “or” “voice recognition processed” “remotely by server” (determining “text” (semantic information) by the server which is then sent back to the requesting party namely the “device 100”, i.e., ¶ 0042 lines 1-3: “The other user” “information” (e.g. his “text” (semantic information)) “may be received” (received) “wirelessly using one or more network adapters 128 in a message from a server” (from the server) and this is received in the “object device 100” (a receiving unit)) ,
wherein the face image sequence acquiring unit  comprises:
An image sequence acquiring subunit, configured to acquire a sequence of images for an object to be identified (¶ 0054 lines 5+: “Lip, mouth, or tongue movement may be detected when the user is speaking with sound or silently speaking without sound” “Images captured by camera” (acquiring a sequence of images of the face of the “user” “lip, mouth” (object to be identified)).
Vartanian et al. do not specifically disclose:
A positioning subunit, configured to position an azimuth of the object to be identified; and
A face image sequence generation subunit, configured to determine a position of a face region of the object to be identified in each frame of image in the sequence of images according to the positioned azimuth of the object to be identified; and crop an image of the face region of the object to be identified from each frame image so as to generate the sequence of face images.
Prasad et al. do teach:
A positioning subunit, configured to position an azimuth of the object to be identified (Col. 10 lines 5+: “speakers’ head axis of symmetry is constrained to be within a small angle of the vertical” (determining an “angle” (azimuth) of a “speakers” “lips” object while he is speaking, see Fig. 3, 6, 9)); 
and a face image sequence generation subunit, configured to determine a position of a face region of the object to be identified in each frame of image in the sequence of images according to the positioned azimuth of the object to be identified; and crop an image of the face region of the object to be identified from each frame image so as to generate the sequence of face images (Col. 9 lines 47+: “The pixels belonging to ROI” (i.e., “region of interest” (see Fig. 3 e.g., mouth or lip (the object)) “may be found by defining two coordinate systems (x,y) and (x’,y’)” (the “(x,y)” (position of pixels of the object are according to the equation in Col. 9 lines 55+  determined in terms of “θ” or the “angle” (azimuth) and each “θ” is associated with a given frame and therefore this angle is parameter defining a sequence of images; furthermore as Figs. 3 and/or 4, 6, 9 show, this corresponds to cropping an image of the face region (object) of the speaker to be identified in a given frame)).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the mathematical methods using the pixel analysis pertaining to lip and mouth positions in Prasad et al. into the lip image analysis and processing of Vartanian et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable Vartanian et al. to benefit from a more “effective speech” “recognition” “by using the five points shown in Fig. 9” (i.e., based on the pixel analysis of the face region using the formalism thereon) as disclosed in Prasad et al. Col. 10 lines 47+.



an augmented reality device (title and abstract: e.g. Abstract lines 1-2: “providing augmented or mixed reality environments based on other user or third party information”), 
comprising the lip- language identification apparatus according to claim 10 (it is rejected under the same rationale as claim 10).


Conclusion

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the 


Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARZAD KAZEMINEZHAD whose telephone number is (571)270-5860. The examiner can normally be reached 10:30 am to 11:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, DANIEL C WASHBURN can be reached on (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about 





/Farzad Kazeminezhad/
Art Unit 2657
February 16th 2022.