DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed 9/22/2021 have been fully considered but they are not persuasive.
Applicant argues that the combination of Li and Olszewski does not explicitly teach generating a first dataset comprising the first left image, the first right image and a first representation of a gaze-related parameter, the first representation being correlated with the first stimulus.
In response, the examiner respectfully disagrees.  Li teaches FIG. 2 is a functional diagram of a training system 200 for high-fidelity facial and speech animation for VR and AR head mounted displays. The training system 200 includes reference animation 201, reference data 202, training data 203, DTW alignment 204, viseme dataset 205, a convolutional neural network 206, and mouth and eye FACS datasets 207. The results of the training generates a mouth regression model 208 and an eye regression model 209.
The training functions occur before the system 300 (FIG. 3) operates to create animations in real-time. First, a set of synchronized audio and video recordings for a series of individuals reciting a list of predetermined training sentences is captured as training data 203. The predetermined training sentences are chosen to vary the approach and retreat from various phonemes to provide a broad and varied dataset of 
Using the recorded audio and video recordings and the FACS expressions for a first performance, professional animators create a reference animation/facial position for each frame of the audio and video recordings as a reference animation 201. This animation takes time and is manually completed to ensure that it is of sufficient quality to accurately represent the facial features. This first reference animation 201 and training data 203 combination may be stored as a part of the reference data 202.
Once the reference animation 201 is complete, then other training data 203 for other subjects may be more-quickly added to the reference data 202 by exploiting the use of the same set of predetermined training sentences and FACS facial expressions that were used for the reference animation 201 and that were spoken and performed by each user.
The audio component of the training data 203 may be used to synchronize the reference data 202's professionally-generated animations to subsequent performances. Specifically, dynamic time warping (DTW) alignment 204 reliant upon the audio and/or video recordings may be used to fairly-precisely map different portions of the overall animation to portions of each performance with limited manual input. This process is dramatically faster than performing complete human-aided animations of each new 
The resulting animation, when combined with visual and audio reference data, the reference animation and the associated training data is used to create a viseme dataset 205. A “viseme” as used herein means the visual facial expressions of one pronouncing a given phoneme.
Next, the viseme dataset 205 can be provided to a convolutional neural network 206 along with mouth and eye FACS datasets 207 (which may be stored separately from one another) to generate a mouth regression model 208 and an eye regression model 209. These models 208 and 209 may be used, following training, by the convolutional neural network(s) to derive a probable animated character for a given set of input visual data from an eye image and/or mouth image that are a part of an overall video stream of a wearer of a head mounted display. In particular, the animation created by professional animators is used as “ground truth” for training the convolutional neural network(s) based upon the dynamically time scaled video data to teach the networks animations that should result from image data. Associated blendshape weightings for various animation parameters (for both eye and mouth) may also be derived from this training process. Over the course of multiple individuals and the same phonemes, the resulting training is quite robust.  [0039] – [0045].
Applicant argues that the combination of Li and Olszewski does not explicitly teach establishing a data connection between the head-wearable device and the database.

Applicant’s argument that there is a step of manual labor in between in the step of data recording and data transfer in Li is unpersuasive.  The claim uses the transitional phrase “comprising” that is open-ended and does not exclude additional, unrecited elements or method steps.  See MPEP 2111.03.  Thus, even though there is a step of manual labor in between the step of data recording and data transfer in Li, Li still teaches the limitation.  Furthermore, the claim does not recite that the steps are carried out automatically.

Applicant argues that the combination of Li, Olszewski, and Furuta does not explicitly teach the first stimulus instructing the user to gaze at a respective given object defining a respective given gaze direction relative to a co-ordinate system fixed with the respective head-wearable device.
In response, the examiner respectfully disagrees.  Furuta teaches the image generating unit 15 generates an image on which the line-of-sight of the user object 21A controlled by the line-of-sight control unit 14 is reflected (an image displayed in each 
  Applicant argues that the combination of Li, Olszewski, and Furuta does not explicitly teach the respective corresponding gaze-related parameter being selected from a list consisting of: a gaze direction, a cyclopean gaze direction, a 3D gaze point, a 2D gaze point, a visual axis orientation, an optical axis orientation, a pupil axis orientation, a line of sight orientation and an eye vergence
In response, the examiner respectfully disagrees.  Furuta teaches the image generating unit 15 generates an image on which the line-of-sight of the user object 21A controlled by the line-of-sight control unit 14 is reflected (an image displayed in each HMD 1). For example, the image generating unit 15 generates an image that is adjusted such that the line-of-sight (the direction of the line-of-sight and/or the viewpoint) of the user object 21A controlled by the line-of-sight control unit 14 coincides with a line-of-sight (the direction of a line-of-sight and/or a viewpoint) identified in accordance with the direction of the face, the eyes (the positions of irises), and the like of the user object .
 Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 31-44, 47-48, and 52 is/are rejected under 35 U.S.C. 103 as being unpatentable over Li et al. (US 2017/0243387 A1) in view of Olszewski, Kyle, et al. "High-fidelity facial and speech animation for VR HMDs." ACM Transactions on Graphics (TOG) 35.6 (2016): 1-14 (hereinafter “Olszewski”) and Furuta et al. (US 2020/0335065 A1).
Consider claim 31, Li teaches a method for creating and updating a database for training a neural network (FIG. 2 is a functional diagram of a training system 200 for high-fidelity facial and speech animation for VR and AR head mounted displays. The training system 200 includes reference animation 201, reference data 202, training data 203, DTW alignment 204, viseme dataset 205, a convolutional neural network 206, and mouth and eye FACS datasets 207. The results of the training generates a mouth regression model 208 and an eye regression model 209. [0039] and Fig. 2.  See also Fig. 5) the method comprising: presenting a first stimulus to a first user wearing a head-wearable device (The training functions occur before the system 300 (FIG. 3) operates to create animations in real-time. First, a set of synchronized audio and video recordings for a series of individuals reciting a list of predetermined training sentences is captured as training data 203. The predetermined training sentences are chosen to vary the approach and retreat from various phonemes to provide a broad and varied dataset of phoneme and visual correspondence in the training data 203. Likewise, more commonly-used phonemes are chosen for repetition to ensure more accuracy. The same set of individuals are also asked to perform a series of facial actions corresponding to the facial action coding system (FACS) for two iterations of each expression. Using this, the rest of the training system may operate [0040].  See also Fig. 5), the head-wearable device comprising a first camera arranged next to a left eye of the first user and a second camera arranged next to a right eye of the first user when the first user is wearing the head-wearable device (The IR light and camera 112 may be mounted within a headset, such as the VR/ AR HMD 120 such that a wearer's eye movements, iris position, and other image ; using the first camera of the head-wearable device to generate, when the first user is expected to respond to the first stimulus or expected to have responded to the first stimulus, a first left image of at least a portion of the left eye of the first user, and using, when the first user is expected to respond to the first stimulus or expected to have responded to the first stimulus, the second camera of the head-wearable device to generate a first right image of at least a portion of the right eye of the first user (The training functions occur before the system 300 (FIG. 3) operates to create animations in real-time. First, a set of synchronized audio and video recordings for a series of individuals reciting a list of predetermined training sentences is captured as training data 203. The predetermined training sentences are chosen to vary the approach and retreat from various phonemes to provide a broad and varied dataset of phoneme and visual correspondence in the training data 203. Likewise, more commonly-used phonemes are chosen for repetition to ensure more accuracy. The ; establishing a data connection between the head-wearable device and the database (the training system includes reference animation, reference data, training data, etc.  [0039].  The first reference animation and training data combination may be stored as part of the reference data.  [0041].  The resulting animation, when combined with visual and audio reference data, the reference animation and the associated training data is used to create a viseme dataset.  [0044].  The data captured is used to train the neural network.  Thus, the data is stored in a database and there is a connection between the database and the head-wearable device.  See also [0039] – [0045] and Fig. 5); generating a first dataset comprising the first left image, the first right image and a first representation of a gaze-related parameter, the first representation being correlated with the first stimulus (Using the recorded audio and video recordings and the FACS expressions for a first performance, professional animators create a reference animation/facial position for each frame of the audio and video recordings as a reference animation 201. This animation takes time and is manually completed to ensure that it is of sufficient quality to accurately represent the facial features. This first reference animation 201 and training data 203 combination may be stored as a part of the reference data 202. [0039] – [0045].  See also Fig. 5); and adding the first dataset to the database (The resulting animation, when combined with visual and audio reference data, the reference animation and the associated training data is used to create a viseme dataset 205. A "viseme" as used 
	However, Li does not explicitly teach using two cameras to generate a first left image of at least a portion of the left eye of the first user and a first right image of at least a portion of the right eye of the right user.
	Olszewski teaches using two cameras to generate a first left image of at least a portion of the left eye of the first user and a first right image of at least a portion of the right eye of the right user (Our system is based on a prototype of the
FOVE VR HMD, with integrated eye tracking cameras and our custom mounted camera for mouth tracking. The HMD contains infrared (IR) cameras directed at the user’s eyes 
	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the known elements of two cameras because such incorporation would allow for tracking movements such as blink as well as movements of the region surrounding the eye.  Section 3.
However, the combination of Li and Olszewski does not explicitly teach the first stimulus instructing the user to gaze at a respective given object defining a respective given gaze direction relative to a co-ordinate system fixed with the respective head-wearable device ([0047] – [0050] and [0081] – [0088] of Furuta).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the known technique of causing the respective user to gaze at a respective given object because such incorporation would improve communication among users in the virtual space through the user object such that the communication is performed more smoothly.  [0008].
	Consider claim 32, Li teaches presenting a second stimulus to the first user wearing the head-wearable device (The training functions occur before the system 300 (FIG. 3) operates to create animations in real-time. First, a set of synchronized audio and video recordings for a series of individuals reciting a list of predetermined training sentences is captured as training data 203. The predetermined training sentences are chosen to vary the approach and retreat from various phonemes to provide a broad and varied dataset of phoneme and visual correspondence in the training data 203. Likewise, more commonly-used phonemes are chosen for repetition ; using the first camera of the head-wearable device to generate, when the first user is expected to respond to the second stimulus or expected to have responded to the second stimulus, a second left image of at least a portion of the left eye of the first user, and using, when the first user is expected to respond to the second stimulus or expected to have responded to the second stimulus, the second camera of the head-wearable device to generate a second right image of at least a portion of the right eye of the first user (The training functions occur before the system 300 (FIG. 3) operates to create animations in real-time. First, a set of synchronized audio and video recordings for a series of individuals reciting a list of predetermined training sentences is captured as training data 203. The predetermined training sentences are chosen to vary the approach and retreat from various phonemes to provide a broad and varied dataset of phoneme and visual correspondence in the training data 203. Likewise, more commonly-used phonemes are chosen for repetition to ensure more accuracy. The same set of individuals are also asked to perform a series of facial actions corresponding to the facial action coding system (FACS) for two iterations of each expression. Using this, the rest of the training system may operate [0040].  See also Fig. 5.  Fig. 5 shows a loop that goes back to the beginning of the process); generating a second dataset comprising the second left image, the second right image and a second representation of the gaze-related parameter, the second representation being correlated with the second stimulus (Using the recorded audio and video recordings and the FACS expressions for a first performance, professional animators create a reference animation/facial position for each frame of the audio and video recordings as a reference animation 201. This animation takes time and is manually completed to ensure that it is of sufficient quality to accurately represent the facial features. This first reference animation 201 and training data 203 combination may be stored as a part of the reference data 202. [0039] – [0045].  See also Fig. 5); and adding the second dataset to the database (The resulting animation, when combined with visual and audio reference data, the reference animation and the associated training data is used to create a viseme dataset 205. A "viseme" as used herein means the visual facial expressions of one pronouncing a given phoneme. Next, the viseme dataset 205 can be provided to a convolutional neural network 206 along with mouth and eye FACS datasets 207 (which may be stored separately from one another) to generate a mouth regression model 208 and an eye regression model 209. These models 208 and 209 may be used, following training, by the convolutional neural network(s) to derive a probable animated character for a given set of input visual data from an eye image and/or mouth image that are a part of an overall video stream of a wearer of a head mounted display. In particular, the animation created by professional animators is used as "ground truth" for training the convolutional neural network(s) based upon the dynamically time scaled video data to teach the networks animations that should result from image data. Associated blend shape weightings for various animation parameters (for both eye and mouth) may also be derived from this training 
	Olszewski teaches using two cameras to generate a first left image of at least a portion of the left eye of the first user and a first right image of at least a portion of the right eye of the right user (Our system is based on a prototype of the
FOVE VR HMD, with integrated eye tracking cameras and our custom mounted camera for mouth tracking. The HMD contains infrared (IR) cameras directed at the user’s eyes and 6 IR LEDs (940nm wavelength) surrounding each eye, allowing the cameras to observe the user’s eyes despite the occlusion from ambient illumination.  Section 3).
	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the known elements of two cameras because such incorporation would allow for tracking movements such as blink as well as movements of the region surrounding the eye.  Section 3.
	Consider claim 33, Li teaches presenting a third stimulus to a second user wearing the head-wearable device (The training functions occur before the system 300 (FIG. 3) operates to create animations in real-time. First, a set of synchronized audio and video recordings for a series of individuals reciting a list of predetermined training sentences is captured as training data 203. The predetermined training sentences are chosen to vary the approach and retreat from various phonemes to provide a broad and varied dataset of phoneme and visual correspondence in the training data 203. Likewise, more commonly-used phonemes are chosen for repetition to ensure more accuracy. The same set of individuals are also asked to perform a series of facial actions corresponding to the facial action coding system (FACS) for two ; using the first camera of the head-wearable device to generate, when the second user is expected to respond to the third stimulus or expected to have responded to the third stimulus, a third left image of at least a portion of a left eye of the second user, and using, when the second user is expected to respond to the third stimulus or expected to have responded to the third stimulus, the second camera of the head-wearable device to generate a third right image of at least a portion of a right eye of the second user (The training functions occur before the system 300 (FIG. 3) operates to create animations in real-time. First, a set of synchronized audio and video recordings for a series of individuals reciting a list of predetermined training sentences is captured as training data 203. The predetermined training sentences are chosen to vary the approach and retreat from various phonemes to provide a broad and varied dataset of phoneme and visual correspondence in the training data 203. Likewise, more commonly-used phonemes are chosen for repetition to ensure more accuracy. The same set of individuals are also asked to perform a series of facial actions corresponding to the facial action coding system (FACS) for two iterations of each expression. Using this, the rest of the training system may operate [0040].  See also Fig. 5.  Fig. 5 shows a loop that goes back to the beginning of the process); generating a third dataset comprising the third left image, the third right image and a third representation of the gaze-related parameter, the third representation being correlated with the third stimulus (Using the recorded audio and video recordings and the FACS expressions for a first performance, professional ; and adding the third dataset to the databases (The resulting animation, when combined with visual and audio reference data, the reference animation and the associated training data is used to create a viseme dataset 205. A "viseme" as used herein means the visual facial expressions of one pronouncing a given phoneme. Next, the viseme dataset 205 can be provided to a convolutional neural network 206 along with mouth and eye FACS datasets 207 (which may be stored separately from one another) to generate a mouth regression model 208 and an eye regression model 209. These models 208 and 209 may be used, following training, by the convolutional neural network(s) to derive a probable animated character for a given set of input visual data from an eye image and/or mouth image that are a part of an overall video stream of a wearer of a head mounted display. In particular, the animation created by professional animators is used as "ground truth" for training the convolutional neural network(s) based upon the dynamically time scaled video data to teach the networks animations that should result from image data. Associated blend shape weightings for various animation parameters (for both eye and mouth) may also be derived from this training process. Over the course of multiple individuals and the same phonemes, the resulting training is quite robust.   [0039] – [0045].    See also Fig. 5).
using two cameras to generate a first left image of at least a portion of the left eye of the first user and a first right image of at least a portion of the right eye of the right user (Our system is based on a prototype of the FOVE VR HMD, with integrated eye tracking cameras and our custom mounted camera for mouth tracking. The HMD contains infrared (IR) cameras directed at the user’s eyes and 6 IR LEDs (940nm wavelength) surrounding each eye, allowing the cameras to observe the user’s eyes despite the occlusion from ambient illumination.  Section 3).
	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the known elements of two cameras because such incorporation would allow for tracking movements such as blink as well as movements of the region surrounding the eye.  Section 3.
Consider claim 34, Li teaches presenting a fourth stimulus to the first user or the second user wearing a further head-wearable device (The training functions occur before the system 300 (FIG. 3) operates to create animations in real-time. First, a set of synchronized audio and video recordings for a series of individuals reciting a list of predetermined training sentences is captured as training data 203. The predetermined training sentences are chosen to vary the approach and retreat from various phonemes to provide a broad and varied dataset of phoneme and visual correspondence in the training data 203. Likewise, more commonly-used phonemes are chosen for repetition to ensure more accuracy. The same set of individuals are also asked to perform a series of facial actions corresponding to the facial action coding system (FACS) for two iterations of each expression. Using this, the rest of the training system may operate [0040].  See also Fig. 5.  Fig. 5 shows a loop that goes back to the ; the further head-wearable device comprising a first camera arranged next to the left eye of the respective user and a second camera arranged next to the right eye of the respective user when the respective user is wearing the further head-wearable device (The IR light and camera 112 may be mounted within a headset, such as the VR/ AR HMD 120 such that a wearer's eye movements, iris position, and other image data related to the wearer's eye may be ascertained. The IR light may be used so as to maintain the visual darkness to the naked eye, while still enabling image processing of eye region images created by the associated IR camera to take place. The IR light and camera 112 are described as a single IR light and camera, but may be two or more IR lights and/or cameras, with at least one of each for each eye. In some VR/AR HMD's 120, the entire region of the eyes may be visible to a single camera within the VR/AR HMD 120. In others, individual IR lights and cameras, or multiple IR lights and/or IR cameras, may be necessary to enable adequate capture of both user's eye regions within the VR/AR HMD 120. Examples of IR images captured in the present system may be seen in FIG. 7  [0027].  See also Fig. 5); using the first camera of the further head-wearable device to generate, when the first user or the second user is expected to respond to the fourth stimulus or expected to have responded to the fourth stimulus, a fourth left image of at least a portion of the left eye of the respective user, and using, when the first user or the second user is expected to respond to the fourth stimulus or expected to have responded to the fourth stimulus, the second camera of the further head-wearable device to generate a fourth right image of at least a portion of the right eye of the respective user (The training functions occur before the system ; establishing a data connection between the further head-wearable device and the database (the training system includes reference animation, reference data, training data, etc.  [0039].  The first reference animation and training data combination may be stored as part of the reference data.  [0041].  The resulting animation, when combined with visual and audio reference data, the reference animation and the associated training data is used to create a viseme dataset.  [0044].  The data captured is used to train the neural network.  Thus, the data is stored in a database and there is a connection between the database and the head-wearable device.  See also [0039] – [0045] and Fig. 5); generating a fourth dataset comprising the fourth left image, the fourth right image and a fourth representation of the gaze-related parameter, the fourth representation being correlated with the fourth stimulus (Using the recorded audio and video recordings and the FACS expressions for a first performance, professional animators create a reference animation/facial ; and adding the fourth dataset to the database (The resulting animation, when combined with visual and audio reference data, the reference animation and the associated training data is used to create a viseme dataset 205. A "viseme" as used herein means the visual facial expressions of one pronouncing a given phoneme. Next, the viseme dataset 205 can be provided to a convolutional neural network 206 along with mouth and eye FACS datasets 207 (which may be stored separately from one another) to generate a mouth regression model 208 and an eye regression model 209. These models 208 and 209 may be used, following training, by the convolutional neural network(s) to derive a probable animated character for a given set of input visual data from an eye image and/or mouth image that are a part of an overall video stream of a wearer of a head mounted display. In particular, the animation created by professional animators is used as "ground truth" for training the convolutional neural network(s) based upon the dynamically time scaled video data to teach the networks animations that should result from image data. Associated blend shape weightings for various animation parameters (for both eye and mouth) may also be derived from this training process. Over the course of multiple individuals and the same phonemes, the resulting training is quite robust.   [0039] – [0045].    See also Fig. 5).
Olszewski teaches using two cameras to generate a first left image of at least a portion of the left eye of the first user and a first right image of at least a portion of the right eye of the right user (Our system is based on a prototype of the FOVE VR HMD, with integrated eye tracking cameras and our custom mounted camera for mouth tracking. The HMD contains infrared (IR) cameras directed at the user’s eyes and 6 IR LEDs (940nm wavelength) surrounding each eye, allowing the cameras to observe the user’s eyes despite the occlusion from ambient illumination.  Section 3).
	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the known elements of two cameras because such incorporation would allow for tracking movements such as blink as well as movements of the region surrounding the eye.  Section 3.
	Consider claim 35, Li teaches using the first camera of the respective head-wearable device to generate a further left image of at least a portion of the left eye of the respective user and using a second camera of the head-wearable device to generate a further right image of at least a portion of the right eye of the respective user (The training functions occur before the system 300 (FIG. 3) operates to create animations in real-time. First, a set of synchronized audio and video recordings for a series of individuals reciting a list of predetermined training sentences is captured as training data 203. The predetermined training sentences are chosen to vary the approach and retreat from various phonemes to provide a broad and varied dataset of phoneme and visual correspondence in the training data 203. Likewise, more commonly-used phonemes are chosen for repetition to ensure more accuracy. The same set of individuals are also asked to perform a series of facial actions corresponding to the facial action coding system (FACS) for two iterations of each expression. Using this, the rest of the training system may operate [0040].  See also ; generating a further dataset comprising the further left image and the further right image (Using the recorded audio and video recordings and the FACS expressions for a first performance, professional animators create a reference animation/facial position for each frame of the audio and video recordings as a reference animation 201. This animation takes time and is manually completed to ensure that it is of sufficient quality to accurately represent the facial features. This first reference animation 201 and training data 203 combination may be stored as a part of the reference data 202. [0039] – [0045].  See also Fig. 5.  Fig. 5 shows a loop that goes back to the beginning of the process); and adding the further dataset to the database (The resulting animation, when combined with visual and audio reference data, the reference animation and the associated training data is used to create a viseme dataset 205. A "viseme" as used herein means the visual facial expressions of one pronouncing a given phoneme. Next, the viseme dataset 205 can be provided to a convolutional neural network 206 along with mouth and eye FACS datasets 207 (which may be stored separately from one another) to generate a mouth regression model 208 and an eye regression model 209. These models 208 and 209 may be used, following training, by the convolutional neural network(s) to derive a probable animated character for a given set of input visual data from an eye image and/or mouth image that are a part of an overall video stream of a wearer of a head mounted display. In particular, the animation created by professional animators is used as "ground truth" for training the convolutional neural network(s) based upon the dynamically time scaled video data to teach the networks animations that should result from image data. Associated blend shape weightings for various 
Olszewski teaches using two cameras to generate a first left image of at least a portion of the left eye of the first user and a first right image of at least a portion of the right eye of the right user (Our system is based on a prototype of the FOVE VR HMD, with integrated eye tracking cameras and our custom mounted camera for mouth tracking. The HMD contains infrared (IR) cameras directed at the user’s eyes and 6 IR LEDs (940nm wavelength) surrounding each eye, allowing the cameras to observe the user’s eyes despite the occlusion from ambient illumination.  Section 3).
	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the known elements of two cameras because such incorporation would allow for tracking movements such as blink as well as movements of the region surrounding the eye.  Section 3.
Consider claim 40, Li teaches the neural network is a convolutional neural network ([0039] – [0045]).
Consider claim 41, Li teaches generating or adding the respective dataset comprises storing a respective representation of a further gaze-related parameter different to the gaze-related parameter, a respective user ID, a respective user-group ID and/or a device ID of the respective head-wearable device, and/or wherein the respective representation comprises and/or is a respective value of the respective gaze-related parameter ([0039] – [0045]).
Consider claim 42, Olszewski teaches using a right IR-light source of the respective head-wearable device to illuminate the right eye of the respective user and a left IR-light source of the respective head-wearable device to illuminate the left eye of the respective user (Our system is based on a prototype of the FOVE VR HMD, with integrated eye tracking cameras and our custom mounted camera for mouth tracking. The HMD contains infrared (IR) cameras directed at the user’s eyes and 6 IR LEDs (940nm wavelength) surrounding each eye, allowing the cameras to observe the user’s eyes despite the occlusion from ambient illumination.  Section 3).
	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the known elements of two cameras because such incorporation would allow for tracking movements such as blink as well as movements of the region surrounding the eye.  Section 3.
	Consider claim 43, the combination of Li and Olszewski teaches generating or adding the respective dataset comprises concatenating the respective left image and the respective right image, and/or wherein the respective left image and the respective right image are grayscale images, and/or wherein the respective left image and the respective right image are IR images, and/or wherein a pixel number of the respective left image and/or the respective right image is at most 40000 (The IR light and camera 112 may be mounted within a headset, such as the VR/ AR HMD 120 such that a wearer's eye movements, iris position, and other image data related to the wearer's eye may be ascertained. The IR light may be used so as to maintain the visual darkness to the naked eye, while still enabling image processing of eye region images created by the associated IR camera to take place. The IR light and 
Our system is based on a prototype of the FOVE VR HMD, with integrated eye tracking cameras and our custom mounted camera for mouth tracking. The HMD contains infrared (IR) cameras directed at the user’s eyes and 6 IR LEDs (940nm wavelength) surrounding each eye, allowing the cameras to observe the user’s eyes despite the occlusion from ambient illumination.  Section 3 of Olszewski).
	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the known elements of two cameras because such incorporation would allow for tracking movements such as blink as well as movements of the region surrounding the eye.  Section 3 of Olszewski.
Consider claim 44, Li teaches a method for training a neural network (FIG. 2 is a functional diagram of a training system 200 for high-fidelity facial and speech animation for VR and AR head mounted displays. The training system 200 includes reference animation 201, reference data 202, training data 203, DTW alignment 204, viseme dataset 205, a convolutional neural network 206, and mouth and eye FACS datasets 207. The results of the training generates a mouth regression model 208 and an eye regression model 209. [0039] and Fig. 2.  See also Fig. 5) the method providing a database comprising a plurality of datasets, the datasets comprising a respective left image, a respective right image and a respective corresponding representation of a gaze-related parameter, the database being created or updated (Using the recorded audio and video recordings and the FACS expressions for a first performance, professional animators create a reference animation/facial position for each frame of the audio and video recordings as a reference animation 201. This animation takes time and is manually completed to ensure that it is of sufficient quality to accurately represent the facial features. This first reference animation 201 and training data 203 combination may be stored as a part of the reference data 202. The resulting animation, when combined with visual and audio reference data, the reference animation and the associated training data is used to create a viseme dataset 205. A "viseme" as used herein means the visual facial expressions of one pronouncing a given phoneme. Next, the viseme dataset 205 can be provided to a convolutional neural network 206 along with mouth and eye FACS datasets 207 (which may be stored separately from one another) to generate a mouth regression model 208 and an eye regression model 209. These models 208 and 209 may be used, following training, by the convolutional neural network(s) to derive a probable animated character for a given set of input visual data from an eye image and/or mouth image that are a part of an overall video stream of a wearer of a head mounted display. In particular, the animation created by professional animators is used as "ground truth" for training the convolutional neural network( s) based upon the dynamically time scaled video data to teach the networks animations that should result from image data. Associated blend shape weightings for various animation parameters using a method comprising: presenting a first stimulus to a first user wearing a head-wearable device (The training functions occur before the system 300 (FIG. 3) operates to create animations in real-time. First, a set of synchronized audio and video recordings for a series of individuals reciting a list of predetermined training sentences is captured as training data 203. The predetermined training sentences are chosen to vary the approach and retreat from various phonemes to provide a broad and varied dataset of phoneme and visual correspondence in the training data 203. Likewise, more commonly-used phonemes are chosen for repetition to ensure more accuracy. The same set of individuals are also asked to perform a series of facial actions corresponding to the facial action coding system (FACS) for two iterations of each expression. Using this, the rest of the training system may operate [0040].  See also Fig. 5), the head-wearable device comprising a first camera arranged next to a left eye of the first user and a second camera arranged next to a right eye of the first user when the first user is wearing the head-wearable device (The IR light and camera 112 may be mounted within a headset, such as the VR/ AR HMD 120 such that a wearer's eye movements, iris position, and other image data related to the wearer's eye may be ascertained. The IR light may be used so as to maintain the visual darkness to the naked eye, while still enabling image processing of eye region images created by the associated IR camera to take place. The IR light and camera 112 are described as a single IR light and camera, but may be two or more IR lights and/or cameras, with at least one of each for ; using the first camera of the head-wearable device to generate, when the first user is expected to respond to the first stimulus or expected to have responded to the first stimulus, a first left image of at least a portion of the left eye of the first user, and using, when the first user is expected to respond to the first stimulus or expected to have responded to the first stimulus, the second camera of the head-wearable device to generate a first right image of at least a portion of the right eye of the first user (The training functions occur before the system 300 (FIG. 3) operates to create animations in real-time. First, a set of synchronized audio and video recordings for a series of individuals reciting a list of predetermined training sentences is captured as training data 203. The predetermined training sentences are chosen to vary the approach and retreat from various phonemes to provide a broad and varied dataset of phoneme and visual correspondence in the training data 203. Likewise, more commonly-used phonemes are chosen for repetition to ensure more accuracy. The same set of individuals are also asked to perform a series of facial actions corresponding to the facial action coding system (FACS) for two iterations of each expression. Using this, the rest of the training system may operate [0040].  See also Fig. 5); establishing a data connection between the head-wearable device and the database (the training system includes reference animation, reference data, training ; generating a first dataset comprising the first left image, the first right image and a first representation of a gaze-related parameter, the first representation being correlated with the first stimulus (Using the recorded audio and video recordings and the FACS expressions for a first performance, professional animators create a reference animation/facial position for each frame of the audio and video recordings as a reference animation 201. This animation takes time and is manually completed to ensure that it is of sufficient quality to accurately represent the facial features. This first reference animation 201 and training data 203 combination may be stored as a part of the reference data 202. [0039] – [0045].  See also Fig. 5); and adding the first dataset to the database (The resulting animation, when combined with visual and audio reference data, the reference animation and the associated training data is used to create a viseme dataset 205. A "viseme" as used herein means the visual facial expressions of one pronouncing a given phoneme. Next, the viseme dataset 205 can be provided to a convolutional neural network 206 along with mouth and eye FACS datasets 207 (which may be stored separately from one another) to generate a mouth regression model 208 and an eye regression model 209. These models 208 and 209 may be used, following training, by the convolutional neural providing a neural network with a given architecture (a convolutional neural network [0039] – [0045]); and determining parameters of the neural network using the respective left images and the respective right images of a sub-set or of all datasets as input and the respective corresponding representations of the gaze-related parameter of the sub-set or of all datasets as desired output of the neural network ([0039] – [0046]).
	However, Li does not explicitly teach using two cameras to generate a first left image of at least a portion of the left eye of the first user and a first right image of at least a portion of the right eye of the right user.
	Olszewski teaches using two cameras to generate a first left image of at least a portion of the left eye of the first user and a first right image of at least a portion of the right eye of the right user (Our system is based on a prototype of the
FOVE VR HMD, with integrated eye tracking cameras and our custom mounted camera for mouth tracking. The HMD contains infrared (IR) cameras directed at the user’s eyes 
	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the known elements of two cameras because such incorporation would allow for tracking movements such as blink as well as movements of the region surrounding the eye.  Section 3.
	However, the combination of Li and Olszewski does not explicitly teach the respective corresponding gaze-related parameter being selected from a list consisting of: a gaze direction, a cyclopean gaze direction, a 3D gaze point, a 2D gaze point, a visual axis orientation, an optical axis orientation, a pupil axis orientation, a line of sight orientation and an eye vergence.
Furata teaches the respective corresponding gaze-related parameter being selected from a list consisting of: a gaze direction, a cyclopean gaze direction, a 3D gaze point, a 2D gaze point, a visual axis orientation, an optical axis orientation, a pupil axis orientation, a line of sight orientation and an eye vergence (([0047] – [0050] and [0081] – [0088] of Furuta).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the known technique of causing the respective user to gaze at a respective given object because such incorporation would improve communication among users in the virtual space through the user object such that the communication is performed more smoothly.  [0008].
	
	
Consider claim 47, Li teaches the method is at least partly performed and/or controlled by one or more processors of the respective head-wearable device, and/or wherein the respective head-wearable device is one of a spectacles device, a goggles, an AR head-wearable display, and a VR head-wearable display, or by one or more processors of a local computer connected with the head-wearable device ([0028] – [0031]).
Consider claim 36, the combination of Li and Olszewski teaches all the limitations in claim 31 but does not explicitly teach establishing the data connection comprises connecting the respective head-wearable device with at least one of a desktop computer, a tablet, a laptop, a server, and a smartphone.
Furuta teaches establishing the data connection comprises connecting the respective head-wearable device with at least one of a desktop computer, a tablet, a laptop, a server, and a smartphone ([0022] – [0023]; [0097] – [0098]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the known technique of connecting the head-wearable device with a server because such incorporation would allow data transmission between the head-wearable device and the server.  [0022].
Consider claim 37, the combination of Li, Olszewski, and Furuta teaches the respective stimulus comprises a visual stimulus, and/or wherein the respective stimulus comprises an acoustical stimulus (The training functions occur before the system 300 (FIG. 3) operates to create animations in real-time. First, a set of synchronized audio and video recordings for a series of individuals reciting a list of predetermined training sentences is captured as training data 203. The predetermined , wherein the respective user is caused by the respective stimulus to gaze at a respective given gaze point in the co-ordinate system, and/or wherein the respective left image and the respective right image are generated when the respective user is expected to gaze at the respective given object, into the respective given direction and/or at the respective given gaze point ([0047] – [0050] and [0081] – [0088] of Furuta).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the known technique of causing the respective user to gaze at a respective given object because such incorporation would improve communication among users in the virtual space through the user object such that the communication is performed more smoothly.  [0008].
Consider claim 38, Furuta teaches the method, further comprising at least one of: using a gaze-related parameter determination unit of the respective head-wearable device to determine a respective given or resulting value of the gaze-related parameter as respective representation of the gaze-related parameter; capturing a field of view of the respective user wearing the respective head-wearable device using a scene camera of the respective gaze-related parameter determination unit; and displaying the respective given object on a display of the respective head-wearable device ([0047] – [0050] and [0081] – [0088] of Furuta).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the known technique of causing the respective user to gaze at a respective given object because such incorporation would improve communication among users in the virtual space through the user object such that the communication is performed more smoothly.  [0008].
Consider claim 39, Furuta teaches the respective given or resulting value is determined with respect to a co-ordinate system which is fixed with the respective head-wearable device and/or is at least one of a respective given or resulting gaze direction, and a respective given or resulting gaze point for the respective user ([0047] – [0050] and [0081] – [0088] of Furuta).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the known technique of causing the respective user to gaze at a respective given object because such incorporation would improve communication among users in the virtual space through the user object such that the communication is performed more smoothly.  [0008].
Consider claim 48, Li teaches the respective gaze-related parameter is related to at least one element of a list and/or selected from the list, the list consisting of: a gaze direction, a cyclopean gaze direction, a 3D gaze point, a 2D gaze point, a visual axis orientation, an optical axis orientation, a pupil axis orientation, a line of sight orientation, an orientation and/or a position and/or an eyelid closure, a pupil area, a pupil size, a pupil diameter, a sclera characteristic, an iris diameter, a characteristic of a blood vessel, a cornea characteristic of at least one eye, a cornea radius, an eyeball radius, a distance pupil-center to cornea-center, a distance cornea-center to eyeball-center, a distance pupil-center to limbus center, a cornea keratometric index of refraction, a cornea index of refraction, a vitreous humor index of refraction, a distance crystalline lens to eyeball-center, to cornea center and/or to corneal apex, a crystalline lens index of refraction, a degree of astigmatism, an orientation angle of a flat and/or a steep axis, a limbus major and/or minor axes orientation, an eye cyclo-torsion, an eye intra-ocular distance, an eye vergence, a statistics over eye adduction and/or eye abduction, a statistics over eye elevation and/or eye depression, a blink event, a drowsiness and/or awareness of the user, and, a parameter for the user iris verification and/or identification ([0047] – [0050] and [0081] – [0088] of Furuta).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the known technique of causing the respective user to gaze at a respective given object because such incorporation would improve communication among users in the virtual space through the user object such that the communication is performed more smoothly.  [0008].
Consider claim 50, Furuta teaches the method, further comprising at least one of: using the updated database for retraining the trained neural network to obtain or improve a user-specific neural network; uploading the user-specific neural network to the head-wearable device and/or a computing unit connectable with the head-wearable device; using a gaze-related parameter determination unit of the at least one head-wearable device to determine a given or resulting gaze direction of the user and/or a given or resulting gaze point as the desired value of the gaze-related parameter; capturing a field of view of the user wearing the head-wearable device using a scene camera of the gaze-related parameter determination unit; and displaying the given object on a display of the head-wearable device ([0047] – [0050] and [0081] – [0088] of Furuta).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the known technique of causing the respective user to gaze at a respective given object because such incorporation would improve communication among users in the virtual space through the user object such that the communication is performed more smoothly.  [0008].
Consider claim 52, Li teaches the database comprises at least one of: datasets from a plurality of different users, datasets referring to a particular device, datasets referring to a device class, datasets referring to a user ID, and datasets referring to a user group, and/or wherein the parameters of the neural network are specifically determined for one of the particular device, the device class, the user ID, and the user group ([0039] – [0045]).

Allowable Subject Matter
Claims 45, 50, and 51 are allowed.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP 
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TAT CHI CHIO whose telephone number is (571)272-9563. The examiner can normally be reached Monday-Thursday 10am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, JAMIE J ATALA can be reached on 571-272-7384. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, 





/TAT C CHIO/Primary Examiner, Art Unit 2486