DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Claims 1-20 are pending under this Office action.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Lee, etc. (US 20200126557 A1) in view of Benhaim, etc. (US 20140343945 A1), further in view of Boegelund, etc. (US 20150073803 A1).
Regarding claim 1, Lee teaches that a computer-implemented method (See Lee: Figs. 1-4, and [0046], “FIG. 1 is a view illustrating a sensor part of a speech intention expression system according to a first embodiment of the present invention, FIG. 2 is a view illustrating a position of the sensor part of the speech intention expression system according to the first embodiment of the present invention, and FIG. 3 is a view illustrating the speech intention expression system according to the first embodiment of the present invention”) comprising:
receiving audio data of a user voice as the user vocalizes during a period of time (See Lee: Fig. 36, and [0208], “As illustrated in FIG. 36, the sensor part 100 is disposed at the actual articulators and measures physical characteristics of the articulators according to a speaker's speech and transmits the measured physical characteristics to the data interpretation part 200, and the data interpretation part 200 interprets the received physical characteristics as speech data. The interpreted speech data is transmitted to the data expression part 500. It can be seen that the database part 350 operates in linkage with the data interpretation part 200 and the data expression part 500 in the interpretation and expression processes for the speech data”);
receiving spatial data of a face of the user during the period of time (See Lee: Fig. 2, and [0148], “More specifically, the oral tongue sensor 110, the facial sensors 120, the voice acquisition sensor 130, the vocal cord sensor 140, and the teeth sensor 150, which are located in the head and neck, provide data related to a sensor part position 210 at which each sensor is disposed, articulatory features 220 according to speech of a speaker 10, a speaker's voice 230, speech history information 240, and articulatory variations 250”);
identifying, using the spatial data, positions of elements of the face relative to other elements of the face during the period of time (See Lee: Fig. 15, and [0166], “As illustrated in FIG. 15, in the speech intention expression system according to the second embodiment of the present invention, a sensor part 100 in the vicinity of head and neck articulators that includes an oral tongue sensor 110, facial sensors 120, a voice acquisition sensor 130, a vocal cord sensor 140, and a teeth sensor 150 grasps a sensor part position 210 at which each sensor is disposed, articulatory features 220 according to speech, a speaker's voice 230 according to speech, and speech history information 240 including a start of speech, a pause of speech, and an end of speech”), wherein relative positions of the elements cause a plurality of qualities of the user voice (See Lee: Figs. 16-18, and [0173], “FIG. 16 is a view illustrating a principle by which a data interpretation part of the speech intention expression system according to the second embodiment of the present invention grasps articulatory features, FIG. 17 is a view illustrating a principle by which the data interpretation part of the speech intention expression system according to the second embodiment of the present invention grasps measured physical characteristics of articulators as articulatory features, FIG. 18 is a view illustrating a standard articulatory feature matrix related to vowels that is utilized by the data interpretation part of the speech intention expression system according to the second embodiment of the present invention, and FIG. 19 is a view illustrating a standard articulatory feature matrix related to consonants that is utilized by the data interpretation part of the speech intention expression system according to the second embodiment of the present invention”);
identifying that a subset of the positions of one or more of the elements cause a detected first quality of the plurality of qualities during the period of time (See Lee: Figs. 46-48, and [0233], “Further, the ANN in which pieces of data, which are physical characteristics of articulators, are grouped according to the degree of similarity and prediction data is generated to classify the pieces of data was utilized. In this way, the speaker may grasp the degree of rightness/wrongness, degree of contiguity/similarity, and intention of speech regarding the initial speech of the speaker himself/herself, as compared with the standard speech. On the basis of the grasped degree of rightness/wrongness, degree of contiguity/similarity, and intention of speech, the speaker obtains feedback regarding his or her speech and continuously re-performs the speech for speech correction. By the method of repetitively inputting pieces of data on physical characteristics of articulators, the pieces of data on the physical characteristics of the articulators are gathered, and accuracy of the ANN is improved”);
determining alternate positions of the one or more of the elements that are determined to cause the user voice to have a second quality of the plurality of qualities rather than the first quality (See Lee: Figs. 2-4, and [0168], “The data interpretation part 200 grasps the articulatory variations 250, which occur according to the speaker's gender, race, age, and native language, from the physical characteristics of articulators of the speaker that are measured by the sensor part 100 in the vicinity of the head neck articulators that is formed of the oral tongue sensor 110, the facial sensors 120, the voice acquisition sensor 130, the vocal cord sensor 140, and the teeth sensor 150”); and
providing, to the user, a graphical representation of the face that depicts one or more adjustments from the subset of the positions to the alternate positions (See Lee: Fig. 3, and [0241], “In this case, the data expression part 300 provides speech guidance (image) in order to intuitively show the speaker how the speaker should manipulate which articulators. The speech guidance (image) proposed by the data expression part 300 performs speech correction and guidance on the basis of a sensor part which is attached to or adjacent to articulators for pronouncing the [custom-character]. rework For example, in the case of the [kicustom-character], in order to pronounce [k], the speaker should speak /custom-character(keu)/ through the mouth by producing a plosive sound by raising the tongue body or tongue root toward the soft palate and attaching and detaching the tongue body or tongue root to and from the soft palate and producing a voiceless sound without trembling of the vocal cords”; and [0075], “On the basis of at least one piece of information among time duration of speech, frequency according to the speech, amplitude of the speech, electromyogram of head-and-neck muscles according to the speech, a change in positions of the head-and-neck muscles according to the speech, and a change in a position of the oral tongue due to bending and rotation, the database part may, form at least one speech data index among a consonant-and-vowel phoneme unit index, a syllable unit index, a word unit index, a phrase unit index, a sentence unit index, a consecutive speech unit index, and a pronunciation height index”).
However, Lee fails to explicitly disclose that spatial data of a face of the user; and a subset of the positions of one or more of the elements.
However, Benhaim teaches that spatial data of a face of the user (See Benhaim: Figs. 7-8, and [0082], “This approach has the advantage to combine both the local visual characteristics (those of the points of interest) and the spatial relations between the points of the considered tuple (i.e. the deformation of the figure formed by the pair of triplets, of quadruplets . . . of points of interest). The way to construct these tuples and to select the most discriminant ones for the visual voice analysis will be described hereinafter, in relation with FIGS. 7 and 8”).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention was effectively filed to modify Lee to have spatial data of a face of the user as taught by Benhaim in order to allow a much simpler and efficient analysis, without critical information loss and maintain the time consistency of the speech (See Benhaim: Fig. 1, and [0020], “They are characteristics about the way to describe the vicinity of a point chosen on the image of the speaker's mouth, hereinafter referred to as "point of interest" (a notion that is also known as "landmark" or "point of reference"). These structured characteristics (also known as features in the scientific community) are generally described by characteristic vectors or "feature vectors" of great size, which are complex to process. The invention proposes to apply to these vectors a transformation that makes it possible both to simplify the expression thereof and to efficiency encode the variability induced by the visual language, allowing a much simpler analysis, and yet as efficient, without critical information loss and keeping the time consistency of the speech”). Lee teaches a method and system that may convert the position and the speech characteristic of the sensor unit into language data based on the measurements of the sensors and the data representation; while Benhaim teaches a system and method that may recognize the spoken language by analysis of visual voice activity with the spatial data of the face involving a local gradient descriptor calculation. Therefore, it is obvious to one of ordinary skill in the art to modify Lee by Benhaim to have the spatial data of the face to analyze the voice data. The motivation to modify Lee by Benhaim is “Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results”.
However, Lee, modified by Benhaim, fails to explicitly disclose that a subset of the positions of one or more of the elements.
However, Boegelund teaches that a subset of the positions of one or more of the elements (See Boegelund: Fig. 3, and [0027], “At 355, the phoneme at the earliest temporal position in the identified word that has not been previously assigned a significance factor is selected. This selected phoneme may be the phoneme at the beginning of the word. In some embodiments, this selected phoneme may be a different phoneme, such as the phoneme immediately following the beginning phoneme. The significance of the selected phoneme in the selected word may be based on the number of alternates in a subset of the set identified at 345, where each alternate in the subset has the identical phoneme at a similar temporal position. A larger subset may indicate that the phoneme is less significant, since there may be less opportunity to confuse the word with another valid word on the basis of the selected phoneme”).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention was effectively filed to modify Lee to have a subset of the positions of one or more of the elements as taught by Boegelund in order to be enable improving intelligibility of speech to provide benefit when attempting to distinguish word from sound-alike words while avoiding unnecessary playback delays that result from lengthening duration of an entire audio signal (See Boegelund: Fig. 1, and [0016], “As shown in Table 1, information density may not be uniform across a spoken word. Smoothening the information density can be accomplished by lengthening the duration of information-dense phonemes while shortening the duration of information-sparse phonemes. Such smoothening may improve intelligibility of speech and may therefore provide an advantage when attempting to distinguish a word from its sound-alike words, while avoiding unnecessary playback delays that result from lengthening the duration of an entire audio signal. Improved intelligibility may be particularly useful, for example, in e-learning courses. Depending on the ratio of information-dense phonemes to information-sparse phonemes, a smoothened audio signal may be shorter than the original audio signal, thus providing numerous benefits in addition to improving word distinction. Such benefits may include shortening the time required to listen to or otherwise process the audio signal, as well as reducing the rate requirements for storing, transmitting, and processing the audio signal. Further benefits may include improving the accuracy of automatic speech recognition technologies, since providing each word with individual non-uniform time scaling based on sound-alike words, rather than scaling phonemes independently, may improve reliability in the pattern recognition process”). Lee teaches a method and system that may convert the position and the speech characteristic of the sensor unit into language data based on the measurements of the sensors and the data representation; while Boegelund teaches a system and method that may generate a set and subset of alternate spoken word satisfied phonetic similarity adapted for use in modifying the temporal duration in an audio signal based on the subset numbers and the total number. Therefore, it is obvious to one of ordinary skill in the art to modify Lee by Boegelund to have the phoneme or words in set or subsets to be adjusted differently. The motivation to modify Lee by Boegelund is “Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results”.
Regarding claim 2, Lee, Benhaim, and Boegelund teach all the features with respect to claim 1 as outlined above. Further, Lee teaches that the computer-implemented method of claim 1, wherein:
the identifying positions of elements includes identifying an initial vector diagram of the face of the user (See Lee: Figs. 22-23, and [0181], “For example, in FIGS. 22 and 23, when the oral tongue sensor 110 is driven as a sensor for grasping a change in vector quantity or a change in angle, a change in vector quantity and a change in angle are grasped by measuring speech of a speaker, and, in this way, the vowel having a high tongue height and tongue frontness is recognized”);
the determining alternate positions includes determining an adjusted vector diagram of the face of the user (See Lee: Fig. 2, and [0055], “The oral tongue sensor may be fixed to one side surface of the oral tongue, surround a surface of the oral tongue, or be inserted into the oral tongue and grasp a change in vector quantity with time based on x-axis, y-axis, and z-axis directions of the oral tongue according to speech so that at least one physical characteristic among the height, frontness or backness, degree of curve, degree of stretch, degree of rotation, degree of tension, degree of contraction, degree of relaxation, and degree of vibration of the oral tongue may be grasped”); and
the providing the graphical representation of the face includes providing both the initial vector diagram and the adjusted vector diagram (See Lee: Figs. 4-5, and [0151], “As illustrated in FIGS. 4 and 5, the oral tongue sensor 110 is fixed to one side surface of an oral tongue 12, surrounds a surface of the oral tongue 12, or is inserted into the oral tongue 12 and grasps one or more independent physical characteristics among the height, frontness or backness, degree of curve, degree of stretch, degree of rotation, degree of tension, degree of contraction, degree of relaxation, and degree of vibration of the oral tongue itself”). 
Regarding claim 3, Lee, Benhaim, and Boegelund teach all the features with respect to claim 1 as outlined above. Further, Benhaim teaches that the computer-implemented method of claim 1, wherein the graphical representation is provided in real time (See Benhaim: Fig. 1, and [0008], “Therefore, there still exists a real need to have visual voice recognition algorithms that are both robust and calculation-resource saving for their implementation, especially when the matter is to be able to perform this voice recognition "on the fly", almost in real time”).
Regarding claim 4, Lee, Benhaim, and Boegelund teach all the features with respect to claim 3 as outlined above. Further, Benhaim teaches that the computer-implemented method of claim 3, further comprising using an augmented reality device to provide the graphical representation in real time over a current image of the face of the user (See Benhaim: Figs. 1-4, and [0117], “To characterize the variability of the movement of the lips, due to different articulations and to the different classes of the visual speech, it is proposed to perform a selection by observing the statistics of velocity of the points of interest of the face around the lips. This method of selection begins by the smallest order (i.e., among the set of tuples, the singletons) and follows an incremental "gluttonous approach" (greedy algorithm) to form new tuples of higher order by aggregating an additional tuple to the tuples of the current selection of tuples, and by operating a new selection based on a relevancy score calculation (block 34), for example by a Variance Maximization Criterion VMC, as will be described hereinafter, in particular in relation with FIG. 8”).
Regarding claim 5, Lee, Benhaim, and Boegelund teach all the features with respect to claim 1 as outlined above. Further, Lee teaches that the computer-implemented method of claim 1, further comprising gathering data on a position of a tongue of the user within a mouth of the user, wherein the positions of elements of the face includes elements on the tongue of the user and the one or more adjustments includes an alternate position of at least one element of the elements on the tongue (See Lee: Fig. 1, and [0075], “On the basis of at least one piece of information among time duration of speech, frequency according to the speech, amplitude of the speech, electromyogram of head-and-neck muscles according to the speech, a change in positions of the head-and-neck muscles according to the speech, and a change in a position of the oral tongue due to bending and rotation, the database part may, form at least one speech data index among a consonant-and-vowel phoneme unit index, a syllable unit index, a word unit index, a phrase unit index, a sentence unit index, a consecutive speech unit index, and a pronunciation height index”).
Regarding claim 6, Lee, Benhaim, and Boegelund teach all the features with respect to claim 5 as outlined above. Further, Lee teaches that the computer-implemented method of claim 5, wherein one of an ultrasound sensor, mouthguard, retainer, or tongue sleeve gathers the data on the position of the tongue (See Lee: Figs. 1-3, and [0147], “As illustrated in FIGS. 1, 2, and 3, in the speech intention expression system according to the first embodiment of the present invention, a sensor part 100 includes an oral tongue sensor 110, facial sensors 120, a voice acquisition sensor 130, a vocal cord sensor 140, and a teeth sensor 150 which are located in the head and neck”). 
Regarding claim 7, Lee, Benhaim, and Boegelund teach all the features with respect to claim 1 as outlined above. Further, Lee and Boegelund teach that the computer-implemented method of claim 1, further comprising: 
identifying an age and a language of the user (See Lee: Figs. 1-3, and [0168], “The data interpretation part 200 grasps the articulatory variations 250, which occur according to the speaker's gender, race, age, and native language, from the physical characteristics of articulators of the speaker that are measured by the sensor part 100 in the vicinity of the head neck articulators that is formed of the oral tongue sensor 110, the facial sensors 120, the voice acquisition sensor 130, the vocal cord sensor 140, and the teeth sensor 150”); and 
identifying a machine learning model that has been trained specifically for the age and the language (See Lee: Figs. 21-23, and [0180], “In this case, the algorithm may be based on one or more algorithms among the K-nearest neighbors (KNN) algorithm, the artificial neural network (ANN) algorithm, the convolutional neural network (CNN) algorithm, the recurrent neural network (RNN) algorithm, the restricted Boltzmann machine (RBM) algorithm, and the hidden Markov model (HMM) algorithm”), wherein the machine learning model does each of:
the receiving the audio data (See Lee: Fig. 36, and [0208], “As illustrated in FIG. 36, the sensor part 100 is disposed at the actual articulators and measures physical characteristics of the articulators according to a speaker's speech and transmits the measured physical characteristics to the data interpretation part 200, and the data interpretation part 200 interprets the received physical characteristics as speech data. The interpreted speech data is transmitted to the data expression part 500. It can be seen that the database part 350 operates in linkage with the data interpretation part 200 and the data expression part 500 in the interpretation and expression processes for the speech data”);
the receiving the spatial data (See Lee: Fig. 51, and [0249], “As illustrated in FIG. 51, on the basis of the sensor part position 210 measured by the sensor part 100 and the head-and-neck facial expression change information 162 obtained by the imaging sensor 160, the data conversion part 300 generates first base data 211 of object head-and-neck data 320. A data matching part 600 performs matching by generating static basic coordinates 611 at matching positions 610, where the object head-and-neck data may be matched, of one or more objects 20 of an image object's head-and-neck 21 and a robot object's head-and-neck 22”);
the identifying the positions of elements (See Lee: Figs. 52-53, and [0252], “As illustrated in FIGS. 52 and 53, in order to match the object head-and-neck data 320 to the image object's head-and-neck 21, the data matching part 600 generates the static basic coordinates 611 by utilizing the first base data 211 which indicates positions of the facial sensors 120 attached to the speaker's head and neck”);
the identifying how the subset of the positions cause the first quality (See Lee: Fig. 1, and [0256], “In this case, as described above, the facial sensors 120 measure an electromyogram of the head and neck which move according to the speech of the speaker to grasp physical characteristics of the head and neck articulators. The reference sensor 121, the positive electrode sensor 122, and the negative electrode sensor 123, which are the facial sensors 120, attached while the speaker speaks grasp the electromyogram of the head-and-neck muscles which changes according to speech such that the reference sensor 121, the positive electrode sensor 122, and the negative electrode sensor 123 each have variable positions, e.g., coordinates (0, −1), (−1, −1), and (1, −1). Such positions become the dynamic variable coordinates 621”); and
the determining the alternate positions of the one or more of the elements (See Boegelund: Fig. 3, and [0027], “At 355, the phoneme at the earliest temporal position in the identified word that has not been previously assigned a significance factor is selected. This selected phoneme may be the phoneme at the beginning of the word. In some embodiments, this selected phoneme may be a different phoneme, such as the phoneme immediately following the beginning phoneme. The significance of the selected phoneme in the selected word may be based on the number of alternates in a subset of the set identified at 345, where each alternate in the subset has the identical phoneme at a similar temporal position. A larger subset may indicate that the phoneme is less significant, since there may be less opportunity to confuse the word with another valid word on the basis of the selected phoneme”). 
Regarding claim 8, Lee, Benhaim, and Boegelund teach all the features with respect to claim 1 as outlined above. Further, Lee teaches that the computer-implemented method of claim 1, further comprising identifying a face shape of the use by analyzing the spatial data, wherein the determining the alternate positions includes comparing the face against a corpus of faces with similar face shapes (See Lee: Fig. 1, and [0090], “Further, by imaging the exterior of the head and neck articulators of a speaker that change according to speech, correlation between the speech and external changes to the articulators according to the speech may be grasped, and, in this way, the present invention may be utilized in linguistics, complementary communication, and implementation of faces of humanoids”).
Regarding claim 9, Lee, Benhaim, and Boegelund teach all the features with respect to claim 1 as outlined above. Further, Lee teaches that the computer-implemented method of claim 1, further comprising determining a severity of the first quality, wherein the adjustments account for the determined severity (See Lee: Fig. 1, and [0040], “There are various types of speech disorders. Speech disorders may be mainly classified as functional speech disorders and organic speech disorders. In most of the types, an abnormality occurs in the vocal cords which are part of the larynx. In many cases, speech disorders are caused by swelling or tearing of the vocal cords or occurrence of abnormal substances in the vocal cords which occurs due to external environmental factors”).
Regarding claim 10, Lee, Benhaim, and Boegelund teach all the features with respect to claim 1 as outlined above. Further, Lee, Benhaim, and Boegelund teach that a system (See Lee: Figs. 1-4, and [0046], “FIG. 1 is a view illustrating a sensor part of a speech intention expression system according to a first embodiment of the present invention, FIG. 2 is a view illustrating a position of the sensor part of the speech intention expression system according to the first embodiment of the present invention, and FIG. 3 is a view illustrating the speech intention expression system according to the first embodiment of the present invention”) comprising: 
a processor (See Boegelund: Fig. 5, and [0034], “FIG. 5 depicts a high-level block diagram of an example system for implementing disclosed embodiments. The mechanisms and apparatus of embodiments apply equally to any appropriate computing system. The major components of the computer system 001 comprise one or more CPUs 002, a memory subsystem 004, a terminal interface 012, a storage interface 014, an I/O (Input/Output) device interface 016, and a network interface 018, all of which are communicatively coupled, directly or indirectly, for inter-component communication via a memory bus 003, an I/O bus 008, and an I/O bus interface unit 010”); and 
a memory in communication with the processor, the memory containing instructions that, when executed by the processor (See Boegelund: Fig. 5, and [0035], “The computer system 001 may contain one or more general-purpose programmable central processing units (CPUs) 002A, 002B, 002C, and 002D, herein generically referred to as the CPU 002. In an embodiment, the computer system 001 may contain multiple processors typical of a relatively large system; however, in another embodiment the computer system 001 may alternatively be a single CPU system. Each CPU 002 executes instructions stored in the memory subsystem 004 and may comprise one or more levels of on-board cache”), cause the processor to: 
receive audio data of a user voice as the user vocalizes during a period of time (See Lee: Fig. 36, and [0208], “As illustrated in FIG. 36, the sensor part 100 is disposed at the actual articulators and measures physical characteristics of the articulators according to a speaker's speech and transmits the measured physical characteristics to the data interpretation part 200, and the data interpretation part 200 interprets the received physical characteristics as speech data. The interpreted speech data is transmitted to the data expression part 500. It can be seen that the database part 350 operates in linkage with the data interpretation part 200 and the data expression part 500 in the interpretation and expression processes for the speech data”); 
receive spatial data of a face of the user (See Benhaim: Figs. 7-8, and [0082], “This approach has the advantage to combine both the local visual characteristics (those of the points of interest) and the spatial relations between the points of the considered tuple (i.e. the deformation of the figure formed by the pair of triplets, of quadruplets . . . of points of interest). The way to construct these tuples and to select the most discriminant ones for the visual voice analysis will be described hereinafter, in relation with FIGS. 7 and 8”) during the period of time (See Lee: Fig. 2, and [0148], “More specifically, the oral tongue sensor 110, the facial sensors 120, the voice acquisition sensor 130, the vocal cord sensor 140, and the teeth sensor 150, which are located in the head and neck, provide data related to a sensor part position 210 at which each sensor is disposed, articulatory features 220 according to speech of a speaker 10, a speaker's voice 230, speech history information 240, and articulatory variations 250”); 
identify, using the spatial data, positions of elements of the face relative to other elements of the face during the period of time (See Lee: Fig. 15, and [0166], “As illustrated in FIG. 15, in the speech intention expression system according to the second embodiment of the present invention, a sensor part 100 in the vicinity of head and neck articulators that includes an oral tongue sensor 110, facial sensors 120, a voice acquisition sensor 130, a vocal cord sensor 140, and a teeth sensor 150 grasps a sensor part position 210 at which each sensor is disposed, articulatory features 220 according to speech, a speaker's voice 230 according to speech, and speech history information 240 including a start of speech, a pause of speech, and an end of speech”), wherein relative positions of the elements cause a plurality of qualities of the user voice (See Lee: Figs. 16-18, and [0173], “FIG. 16 is a view illustrating a principle by which a data interpretation part of the speech intention expression system according to the second embodiment of the present invention grasps articulatory features, FIG. 17 is a view illustrating a principle by which the data interpretation part of the speech intention expression system according to the second embodiment of the present invention grasps measured physical characteristics of articulators as articulatory features, FIG. 18 is a view illustrating a standard articulatory feature matrix related to vowels that is utilized by the data interpretation part of the speech intention expression system according to the second embodiment of the present invention, and FIG. 19 is a view illustrating a standard articulatory feature matrix related to consonants that is utilized by the data interpretation part of the speech intention expression system according to the second embodiment of the present invention”); 
identify that a subset of the positions of one or more of the elements (See Boegelund: Fig. 3, and [0027], “At 355, the phoneme at the earliest temporal position in the identified word that has not been previously assigned a significance factor is selected. This selected phoneme may be the phoneme at the beginning of the word. In some embodiments, this selected phoneme may be a different phoneme, such as the phoneme immediately following the beginning phoneme. The significance of the selected phoneme in the selected word may be based on the number of alternates in a subset of the set identified at 345, where each alternate in the subset has the identical phoneme at a similar temporal position. A larger subset may indicate that the phoneme is less significant, since there may be less opportunity to confuse the word with another valid word on the basis of the selected phoneme”) cause a detected first quality of the plurality of qualities during the period of time (See Lee: Figs. 46-48, and [0233], “Further, the ANN in which pieces of data, which are physical characteristics of articulators, are grouped according to the degree of similarity and prediction data is generated to classify the pieces of data was utilized. In this way, the speaker may grasp the degree of rightness/wrongness, degree of contiguity/similarity, and intention of speech regarding the initial speech of the speaker himself/herself, as compared with the standard speech. On the basis of the grasped degree of rightness/wrongness, degree of contiguity/similarity, and intention of speech, the speaker obtains feedback regarding his or her speech and continuously re-performs the speech for speech correction. By the method of repetitively inputting pieces of data on physical characteristics of articulators, the pieces of data on the physical characteristics of the articulators are gathered, and accuracy of the ANN is improved”); 
determine alternate positions of the one or more of the elements that are determined to cause the user voice to have a second quality of the plurality of qualities rather than the first quality (See Lee: Figs. 2-4, and [0168], “The data interpretation part 200 grasps the articulatory variations 250, which occur according to the speaker's gender, race, age, and native language, from the physical characteristics of articulators of the speaker that are measured by the sensor part 100 in the vicinity of the head neck articulators that is formed of the oral tongue sensor 110, the facial sensors 120, the voice acquisition sensor 130, the vocal cord sensor 140, and the teeth sensor 150”); and 
provide, to the user, a graphical representation of the face that depicts one or more adjustments from the subset of the positions to the alternate positions (See Lee: Fig. 3, and [0241], “In this case, the data expression part 300 provides speech guidance (image) in order to intuitively show the speaker how the speaker should manipulate which articulators. The speech guidance (image) proposed by the data expression part 300 performs speech correction and guidance on the basis of a sensor part which is attached to or adjacent to articulators for pronouncing the [custom-character]. rework For example, in the case of the [kicustom-character], in order to pronounce [k], the speaker should speak /custom-character(keu)/ through the mouth by producing a plosive sound by raising the tongue body or tongue root toward the soft palate and attaching and detaching the tongue body or tongue root to and from the soft palate and producing a voiceless sound without trembling of the vocal cords”; and [0075], “On the basis of at least one piece of information among time duration of speech, frequency according to the speech, amplitude of the speech, electromyogram of head-and-neck muscles according to the speech, a change in positions of the head-and-neck muscles according to the speech, and a change in a position of the oral tongue due to bending and rotation, the database part may, form at least one speech data index among a consonant-and-vowel phoneme unit index, a syllable unit index, a word unit index, a phrase unit index, a sentence unit index, a consecutive speech unit index, and a pronunciation height index”). 
Regarding claim 11, Lee, Benhaim, and Boegelund teach all the features with respect to claim 10 as outlined above. Further, Lee teaches that the system of claim 10, wherein:
the identifying positions of elements includes identifying an initial vector diagram of the face of the user (See Lee: Figs. 22-23, and [0181], “For example, in FIGS. 22 and 23, when the oral tongue sensor 110 is driven as a sensor for grasping a change in vector quantity or a change in angle, a change in vector quantity and a change in angle are grasped by measuring speech of a speaker, and, in this way, the vowel having a high tongue height and tongue frontness is recognized”);
the determining alternate positions includes determining an adjusted vector diagram of the face of the user (See Lee: Fig. 2, and [0055], “The oral tongue sensor may be fixed to one side surface of the oral tongue, surround a surface of the oral tongue, or be inserted into the oral tongue and grasp a change in vector quantity with time based on x-axis, y-axis, and z-axis directions of the oral tongue according to speech so that at least one physical characteristic among the height, frontness or backness, degree of curve, degree of stretch, degree of rotation, degree of tension, degree of contraction, degree of relaxation, and degree of vibration of the oral tongue may be grasped”); and 
the providing the graphical representation of the face includes providing both the initial vector diagram and the adjusted vector diagram (See Lee: Figs. 4-5, and [0151], “As illustrated in FIGS. 4 and 5, the oral tongue sensor 110 is fixed to one side surface of an oral tongue 12, surrounds a surface of the oral tongue 12, or is inserted into the oral tongue 12 and grasps one or more independent physical characteristics among the height, frontness or backness, degree of curve, degree of stretch, degree of rotation, degree of tension, degree of contraction, degree of relaxation, and degree of vibration of the oral tongue itself”).
Regarding claim 12, Lee, Benhaim, and Boegelund teach all the features with respect to claim 10 as outlined above. Further, Benhaim teaches that the system of claim 10, wherein the graphical representation is provided in real time (See Benhaim: Fig. 1, and [0008], “Therefore, there still exists a real need to have visual voice recognition algorithms that are both robust and calculation-resource saving for their implementation, especially when the matter is to be able to perform this voice recognition "on the fly", almost in real time”).
Regarding claim 13, Lee, Benhaim, and Boegelund teach all the features with respect to claim 12 as outlined above. Further, Benhaim teaches that the system of claim 12, the memory containing further instructions that, when executed by the processor, cause the processor to use an augmented reality device to provide the graphical representation in real time over a current image of the face of the user (See Benhaim: Figs. 1-4, and [0117], “To characterize the variability of the movement of the lips, due to different articulations and to the different classes of the visual speech, it is proposed to perform a selection by observing the statistics of velocity of the points of interest of the face around the lips. This method of selection begins by the smallest order (i.e., among the set of tuples, the singletons) and follows an incremental "gluttonous approach" (greedy algorithm) to form new tuples of higher order by aggregating an additional tuple to the tuples of the current selection of tuples, and by operating a new selection based on a relevancy score calculation (block 34), for example by a Variance Maximization Criterion VMC, as will be described hereinafter, in particular in relation with FIG. 8”).
Regarding claim 14, Lee, Benhaim, and Boegelund teach all the features with respect to claim 10 as outlined above. Further, Lee teaches that the system of claim 10, the memory containing further instructions that, when executed by the processor, cause the processor to gather data on a position of a tongue of the user within a mouth of the user, wherein the positions of elements of the face includes elements on the tongue of the user and the one or more adjustments includes an alternate position of at least one element of the elements on the tongue (See Lee: Fig. 1, and [0075], “On the basis of at least one piece of information among time duration of speech, frequency according to the speech, amplitude of the speech, electromyogram of head-and-neck muscles according to the speech, a change in positions of the head-and-neck muscles according to the speech, and a change in a position of the oral tongue due to bending and rotation, the database part may, form at least one speech data index among a consonant-and-vowel phoneme unit index, a syllable unit index, a word unit index, a phrase unit index, a sentence unit index, a consecutive speech unit index, and a pronunciation height index”).
Regarding claim 15, Lee, Benhaim, and Boegelund teach all the features with respect to claim 14 as outlined above. Further, Lee teaches that the system of claim 14, wherein one of an ultrasound sensor, mouthguard, retainer, or tongue sleeve gathers the data on the position of the tongue (See Lee: Figs. 1-3, and [0147], “As illustrated in FIGS. 1, 2, and 3, in the speech intention expression system according to the first embodiment of the present invention, a sensor part 100 includes an oral tongue sensor 110, facial sensors 120, a voice acquisition sensor 130, a vocal cord sensor 140, and a teeth sensor 150 which are located in the head and neck”). 
Regarding claim 16, Lee, Benhaim, and Boegelund teach all the features with respect to claim 1 as outlined above. Further, Lee, Benhaim, and Boegelund teach that a computer program product comprising one or more computer readable storage media, and program instructions collectively stored on the one or more computer readable storage media (See Lee: Figs. 1-4, and [0046], “FIG. 1 is a view illustrating a sensor part of a speech intention expression system according to a first embodiment of the present invention, FIG. 2 is a view illustrating a position of the sensor part of the speech intention expression system according to the first embodiment of the present invention, and FIG. 3 is a view illustrating the speech intention expression system according to the first embodiment of the present invention”), the program instructions executable by one or more processors to cause the one or more processors (See Boegelund: Fig. 5, and [0035], “The computer system 001 may contain one or more general-purpose programmable central processing units (CPUs) 002A, 002B, 002C, and 002D, herein generically referred to as the CPU 002. In an embodiment, the computer system 001 may contain multiple processors typical of a relatively large system; however, in another embodiment the computer system 001 may alternatively be a single CPU system. Each CPU 002 executes instructions stored in the memory subsystem 004 and may comprise one or more levels of on-board cache”) to:
receive audio data of a user voice as the user vocalizes during a period of time (See Lee: Fig. 36, and [0208], “As illustrated in FIG. 36, the sensor part 100 is disposed at the actual articulators and measures physical characteristics of the articulators according to a speaker's speech and transmits the measured physical characteristics to the data interpretation part 200, and the data interpretation part 200 interprets the received physical characteristics as speech data. The interpreted speech data is transmitted to the data expression part 500. It can be seen that the database part 350 operates in linkage with the data interpretation part 200 and the data expression part 500 in the interpretation and expression processes for the speech data”);
receive spatial data of a face of the user (See Benhaim: Figs. 7-8, and [0082], “This approach has the advantage to combine both the local visual characteristics (those of the points of interest) and the spatial relations between the points of the considered tuple (i.e. the deformation of the figure formed by the pair of triplets, of quadruplets . . . of points of interest). The way to construct these tuples and to select the most discriminant ones for the visual voice analysis will be described hereinafter, in relation with FIGS. 7 and 8”) during the period of time (See Lee: Fig. 2, and [0148], “More specifically, the oral tongue sensor 110, the facial sensors 120, the voice acquisition sensor 130, the vocal cord sensor 140, and the teeth sensor 150, which are located in the head and neck, provide data related to a sensor part position 210 at which each sensor is disposed, articulatory features 220 according to speech of a speaker 10, a speaker's voice 230, speech history information 240, and articulatory variations 250”);
identify, using the spatial data, positions of elements of the face relative to other elements of the face during the period of time (See Lee: Fig. 15, and [0166], “As illustrated in FIG. 15, in the speech intention expression system according to the second embodiment of the present invention, a sensor part 100 in the vicinity of head and neck articulators that includes an oral tongue sensor 110, facial sensors 120, a voice acquisition sensor 130, a vocal cord sensor 140, and a teeth sensor 150 grasps a sensor part position 210 at which each sensor is disposed, articulatory features 220 according to speech, a speaker's voice 230 according to speech, and speech history information 240 including a start of speech, a pause of speech, and an end of speech”), wherein relative positions of the elements cause a plurality of qualities of the user voice (See Lee: Figs. 16-18, and [0173], “FIG. 16 is a view illustrating a principle by which a data interpretation part of the speech intention expression system according to the second embodiment of the present invention grasps articulatory features, FIG. 17 is a view illustrating a principle by which the data interpretation part of the speech intention expression system according to the second embodiment of the present invention grasps measured physical characteristics of articulators as articulatory features, FIG. 18 is a view illustrating a standard articulatory feature matrix related to vowels that is utilized by the data interpretation part of the speech intention expression system according to the second embodiment of the present invention, and FIG. 19 is a view illustrating a standard articulatory feature matrix related to consonants that is utilized by the data interpretation part of the speech intention expression system according to the second embodiment of the present invention”);
identify that a subset of the positions of one or more of the elements (See Boegelund: Fig. 3, and [0027], “At 355, the phoneme at the earliest temporal position in the identified word that has not been previously assigned a significance factor is selected. This selected phoneme may be the phoneme at the beginning of the word. In some embodiments, this selected phoneme may be a different phoneme, such as the phoneme immediately following the beginning phoneme. The significance of the selected phoneme in the selected word may be based on the number of alternates in a subset of the set identified at 345, where each alternate in the subset has the identical phoneme at a similar temporal position. A larger subset may indicate that the phoneme is less significant, since there may be less opportunity to confuse the word with another valid word on the basis of the selected phoneme”) cause a detected first quality of the plurality of qualities during the period of time (See Lee: Figs. 46-48, and [0233], “Further, the ANN in which pieces of data, which are physical characteristics of articulators, are grouped according to the degree of similarity and prediction data is generated to classify the pieces of data was utilized. In this way, the speaker may grasp the degree of rightness/wrongness, degree of contiguity/similarity, and intention of speech regarding the initial speech of the speaker himself/herself, as compared with the standard speech. On the basis of the grasped degree of rightness/wrongness, degree of contiguity/similarity, and intention of speech, the speaker obtains feedback regarding his or her speech and continuously re-performs the speech for speech correction. By the method of repetitively inputting pieces of data on physical characteristics of articulators, the pieces of data on the physical characteristics of the articulators are gathered, and accuracy of the ANN is improved”);
determine alternate positions of the one or more of the elements that are determined to cause the user voice to have a second quality of the plurality of qualities rather than the first quality (See Lee: Figs. 2-4, and [0168], “The data interpretation part 200 grasps the articulatory variations 250, which occur according to the speaker's gender, race, age, and native language, from the physical characteristics of articulators of the speaker that are measured by the sensor part 100 in the vicinity of the head neck articulators that is formed of the oral tongue sensor 110, the facial sensors 120, the voice acquisition sensor 130, the vocal cord sensor 140, and the teeth sensor 150”); and
provide, to the user, a graphical representation of the face that depicts one or more adjustments from the subset of the positions to the alternate positions (See Lee: Fig. 3, and [0241], “In this case, the data expression part 300 provides speech guidance (image) in order to intuitively show the speaker how the speaker should manipulate which articulators. The speech guidance (image) proposed by the data expression part 300 performs speech correction and guidance on the basis of a sensor part which is attached to or adjacent to articulators for pronouncing the [custom-character]. rework For example, in the case of the [kicustom-character], in order to pronounce [k], the speaker should speak /custom-character(keu)/ through the mouth by producing a plosive sound by raising the tongue body or tongue root toward the soft palate and attaching and detaching the tongue body or tongue root to and from the soft palate and producing a voiceless sound without trembling of the vocal cords”; and [0075], “On the basis of at least one piece of information among time duration of speech, frequency according to the speech, amplitude of the speech, electromyogram of head-and-neck muscles according to the speech, a change in positions of the head-and-neck muscles according to the speech, and a change in a position of the oral tongue due to bending and rotation, the database part may, form at least one speech data index among a consonant-and-vowel phoneme unit index, a syllable unit index, a word unit index, a phrase unit index, a sentence unit index, a consecutive speech unit index, and a pronunciation height index”).
Regarding claim 17, Lee, Benhaim, and Boegelund teach all the features with respect to claim 16 as outlined above. Further, Lee teaches that the computer program product of claim 16, wherein:
the identifying positions of elements includes identifying an initial vector diagram of the face of the user (See Lee: Figs. 22-23, and [0181], “For example, in FIGS. 22 and 23, when the oral tongue sensor 110 is driven as a sensor for grasping a change in vector quantity or a change in angle, a change in vector quantity and a change in angle are grasped by measuring speech of a speaker, and, in this way, the vowel having a high tongue height and tongue frontness is recognized”);
the determining alternate positions includes determining an adjusted vector diagram of the face of the user (See Lee: Fig. 2, and [0055], “The oral tongue sensor may be fixed to one side surface of the oral tongue, surround a surface of the oral tongue, or be inserted into the oral tongue and grasp a change in vector quantity with time based on x-axis, y-axis, and z-axis directions of the oral tongue according to speech so that at least one physical characteristic among the height, frontness or backness, degree of curve, degree of stretch, degree of rotation, degree of tension, degree of contraction, degree of relaxation, and degree of vibration of the oral tongue may be grasped”); and
the providing the graphical representation of the face includes providing both the initial vector diagram and the adjusted vector diagram (See Lee: Figs. 4-5, and [0151], “As illustrated in FIGS. 4 and 5, the oral tongue sensor 110 is fixed to one side surface of an oral tongue 12, surrounds a surface of the oral tongue 12, or is inserted into the oral tongue 12 and grasps one or more independent physical characteristics among the height, frontness or backness, degree of curve, degree of stretch, degree of rotation, degree of tension, degree of contraction, degree of relaxation, and degree of vibration of the oral tongue itself”). 
Regarding claim 18, Lee, Benhaim, and Boegelund teach all the features with respect to claim 16 as outlined above. Further, Benhaim teaches that the computer program product of claim 16, wherein the graphical representation is provided in real time (See Benhaim: Fig. 1, and [0008], “Therefore, there still exists a real need to have visual voice recognition algorithms that are both robust and calculation-resource saving for their implementation, especially when the matter is to be able to perform this voice recognition "on the fly", almost in real time”).
Regarding claim 19, Lee, Benhaim, and Boegelund teach all the features with respect to claim 18 as outlined above. Further, Benhaim teaches that the computer program product of claim 18, the one or more computer readable storage media containing further instructions that, when executed by the one or more processors, cause the one or more processors to use an augmented reality device to provide the graphical representation in real time over a current image of the face of the user (See Benhaim: Figs. 1-4, and [0117], “To characterize the variability of the movement of the lips, due to different articulations and to the different classes of the visual speech, it is proposed to perform a selection by observing the statistics of velocity of the points of interest of the face around the lips. This method of selection begins by the smallest order (i.e., among the set of tuples, the singletons) and follows an incremental "gluttonous approach" (greedy algorithm) to form new tuples of higher order by aggregating an additional tuple to the tuples of the current selection of tuples, and by operating a new selection based on a relevancy score calculation (block 34), for example by a Variance Maximization Criterion VMC, as will be described hereinafter, in particular in relation with FIG. 8”).
Regarding claim 20, Lee, Benhaim, and Boegelund teach all the features with respect to claim 16 as outlined above. Further, Lee teaches that the computer program product of claim 16, the one or more computer readable storage media containing further instructions that, when executed by the one or more processors, cause the one or more processors gather data on a position of a tongue of the user within a mouth of the user, wherein the positions of elements of the face includes elements on the tongue of the user and the one or more adjustments includes an alternate position of at least one element of the elements on the tongue (See Lee: Fig. 1, and [0075], “On the basis of at least one piece of information among time duration of speech, frequency according to the speech, amplitude of the speech, electromyogram of head-and-neck muscles according to the speech, a change in positions of the head-and-neck muscles according to the speech, and a change in a position of the oral tongue due to bending and rotation, the database part may, form at least one speech data index among a consonant-and-vowel phoneme unit index, a syllable unit index, a word unit index, a phrase unit index, a sentence unit index, a consecutive speech unit index, and a pronunciation height index”).


Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to GORDON G LIU whose telephone number is (571)270-0382. The examiner can normally be reached Monday - Friday 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Mehmood can be reached on 571-272-2976. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/GORDON G LIU/Primary Examiner, Art Unit 2612