Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-14 are pending. Claims 1, 8 and 14 are independent.
This Application was published as U.S. 2022/0084522.
            Apparent priority: 16 September 2020.
35 U.S.C. 112(f) Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 
The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. 

1. An electronic apparatus comprising: 
a communication device configured to receive a signal from each of a plurality of acceleration sensors attached to a face of a user; 
a memory configured to store a classification learning model that classifies a word based on a plurality of sensor output values; and 
a processor configured to determine a word corresponding to a mouth shape of the user by input a value of the received signal to the classification learning model, when the signal is received from each of the plurality of acceleration sensors. 

Such claim limitation(s) is/are: “communication device” in Claim 1. These limitations are generic in the context of the art and don’t refer to any specific structure and only serve as placeholders for the structure that performs the associated function(s) without providing any information about what that structure is. MPEP 2181 I A says:
For a term to be considered a substitute for "means," and lack sufficient structure for performing the function, it must serve as a generic placeholder and thus not limit the scope of the claim to any specific manner or structure for performing the claimed function. It is important to remember that there are no absolutes in the determination of terms used as a substitute for "means" that serve as generic placeholders. The examiner must carefully consider the term in light of the specification and the commonly accepted meaning in the technological art. Every application will turn on its own facts.
Based on the Specification:  “[0044] The communication device 110 is configured to connect the electronic apparatus 100 to an external apparatus, and not only connection to a mobile device through a local area network (LAN) and the Internet, but also connection through a universal serial bus (USB) port may be possible.”
PLEASE NOTE: This is NOT a rejection. Please don’t address it as a rejection. If the Applicant does not agree with the INTERPRETATION, he may argue or amend to replace the terms interpreted under 112(f) with structural terms such as “USB port” or “ethernet port” or “Wi-Fi chip” as appropriately supported by the Specification. In the alternative, he may let the interpretation stand if the intent was to include a means plus function limitation in the Claim.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-2, 4-9, and 11-14 are rejected under 35 U.S.C. 103 as being unpatentable over Rameau (U.S. 2022/0208194) in view of McVicker (U.S. 10621973).
Regarding Claim 1, Rameau teaches:
1. An electronic apparatus comprising: [Rameau, Figure 1, the “computing device 110” which includes the model and is in communication with the “personal device 160” which includes the sensors.  The two of which may be integrated in one device:  “[0046] Referring to FIG. 1, in various embodiments, a system 100 may include a computing device 110 (or multiple computing devices, co-located or remote to each other) and a personal device 160 (which may be, for example, a wearable device for sensing sEMG signals). In potential embodiments, the personal device 160 may be integrated with the computing device 110 or components thereof….”]
a communication device configured to receive a signal from each of a plurality of acceleration sensors attached to a face of a user; [Rameau, Figure 1, “transceiver 140” receiving signals from the “communicator 190” which communicates signals of the sensors to the “computing device 110.” The “cutaneous sensor unit 165” as further shown in Figure 5, teaches the “sensors attached to a face of a user.”  “[0048] Personal device 160 may include a cutaneous sensor unit 165 for detecting signals, and a control module for processing and/or transmitting signals to computing device 110. ….”  “[0057] Referring to FIGS. 5, 6, and 7, example embodiments of personal device 700, 800 according to various potential embodiments are illustrated. The personal device 700, 800, 900 may comprise or be, for example, a facial electrode tattoo. Personal device 700, 800, 900 may be a hemi-face device with sensors on one side of the subject's face, or may include sensors for both sides of the subject's face…..”  See [0007] for location of electrodes.]
a memory configured to store a classification learning model that classifies a word based on a plurality of sensor output values; and [Rameau, Figure 1, “predictive model training module 120” and “predictive model application module 130” stored on the “computing device 110.”  “[0046] … The computing device 110 may include one or more processors and one or more volatile and non-volatile memories for storing computing code and data that are captured, acquired, recorded, and/or generated….”]
a processor configured to determine a word corresponding to a mouth shape of the user by input a value of the received signal to the classification learning model, when the signal is received from each of the plurality of acceleration sensors. [Rameau, Figure 1, “controller 115.” Figure 4, “application model” branch of the Figure and “generate predicted words or phrases using the model 660.”  “10. …  the computing device comprises a processor configured to receive data from the control module and generate predictions of words uttered by the subject.”  The signals that are recorded by the sensor correspond to “mouthing” of “words” and therefore correspond to “a mouth shape” of the user:  “[0029] …In one example, using seven sEMG sensors on a subject's face and neck and two grounding electrodes, the system recorded EMG data while the subject was mouthing "Tedd" and "Ed." In example embodiments, 92% accuracy was achieved in the recognition of 10 digits, 100 utterance of each digit from 2 subjects. The patient's silent mouthed speech may be translated into text and synthesized speech as an alternative means of communication.”]
The sensors of Rameau are sEMG sensors:  “[0006] … a system for recognizing speech by detecting surface electromyographic (sEMG) signals from a face and/or a neck of a subject….” Further the sEMG are connected to the face and neck of the speaker on a membrane surface such that they cover the face and can detect movements of the mouth:  “[0008] …  The housing or body may be, for example, a membrane. The electrodes and/or electrical pathways may be embedded in the membrane. The housing or body may be contoured so as to mate with a surface contour of the subject's face and/or neck.”  
However, “acceleration sensors” are not taught by Rameau.
McVicker teaches:
a communication device configured to receive a signal from each of a plurality of acceleration sensors attached to a face of a user; [McVicker teaches the use of Inertial Measurement Units including accelerometers to register the silent speech and sensors of McVicker on in contact with a user’s skin around the chin and neck: “Fig. 4 is a perspective view of the headset 400 mounted on the user. … . In one embodiment the IMU is a motion-sensing system-in-a-chip, housing a 3-axis accelerometer, 3-axis gyroscope, and 3-axis magnetometer for nine degrees of freedom (9DOF) in a single IC.”  Col. 4 line 51 to col. 5 line 7.  See also Figure 4 and Abstract provided below.]
…
a processor configured to determine a word corresponding to a mouth shape of the user by input a value of the received signal to the classification learning model, when the signal is received from each of the plurality of acceleration sensors. [McVicker teaches the use of its sensors (including accelerometer sensors) to detect movement of the user’s muscles when a user “mouths words.”  The IMUs which include the accelerometer sensor detect muscle movement corresponding to words and thus teach determining words corresponding the shape of the mouth.   “A sub-vocal speech recognition (SVSR) apparatus includes a headset that is worn over an ear and electromyography (EMG) electrodes and an Inertial Measurement Unit (IMU) in contact with a user's skin in a position over the neck, under the chin and behind the ear. When a user speaks or mouths words, the EMG and IMU signals are recorded by sensors and amplified and filtered, before being divided in multi-millisecond time windows. These time windows are then transmitted to the interface computing device for Mel Frequency Cepstral Coefficients (MFCC) conversion into aggregated vector representation (AVR). The AVR is the input to the SVSR system, which utilizes a neural network, CTC function, and language model to classify the phoneme. The phonemes are then combined into words and sent back to the interface computing device, where they are played either as audible output, such as from a speaker, or non-audible output, such as text.”  Abstract.]
Rameau and McVicker pertain to detection of silent speech using sensors that are placed on the face and neck of the user.  It would have been obvious to substitute the accelerometer sensors of McVicker for the sEMG sensors of Rameau to arrive at the claimed invention.  Further as McVicker’s system of having both EMG electrodes and IMU (accelerometer) demonstrates, having two types of sensor generates more accurate outcomes and adding of the sensor of McVicker to the sensors of Rameau would have also been obvious. This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 2, Rameau teaches:
2. The electronic apparatus as claimed in claim 1, 
wherein the classification learning model is a model trained by using the value of the signal received from each of the plurality of acceleration sensors in a process of uttering each of a plurality of predetermined words. [Rameau, Figures 2 and 4.  Figure 2 shows training for words that are similar such as Ted and Ed.  “11. …  apply a predictive machine learning model to the data received from the control module, the predictive machine learning model trained using recordings corresponding to discrete words or phrases spoken by one or more subjects.”  See [0017]-[0018] and [0051].]
Rameau does not teach the use of acceleration sensors and such sensors were brought in from McVicker under substitution rationale.

Regarding Claim 4, Rameau teaches:
4. The electronic apparatus as claimed in claim 1, 
wherein the plurality of acceleration sensors are attached to different portions around a mouth that move the most at the time of speech utterance of the user. [Rameau, Figure 5 shows the positioning of the sensors on the face and around the mouth of the user.  See [0007]-[0009] and [0057]-[0058].]
Rameau does not teach the use of acceleration sensors and such sensors were brought in from McVicker under substitution rationale.

Regarding Claim 5, Rameau teaches:
5. The electronic apparatus as claimed in claim 1, 
wherein the plurality of acceleration sensors include three to five acceleration sensors. [Rameau, Figures 5-8, teaches the use of 4 to 16 sensors one or two of which may be reference sensors which give s a 2 to15 range for the number of electrodes/sensors and teaches the “three to five … sensors” of the Claim.  “[0058] The personal device 700, 800, 900 comprises a set of sensors 720, 820, 920, which may be electrodes capable of detecting sEMG signals non-invasively. In the versions shown in FIGS. 5 and 7, seven and eight electrodes are depicted, respectively, although the number of electrodes (and/or other sensors) may vary from, for example, four to 16 in various embodiments, for one or both sides of a subject's face and/or neck. One or two of the electrodes may be reference electrodes which may also be secured to the subject's skin. The reference electrodes may be positioned in various locations away from the articulatory muscles of interest, such as areas with lower or undetectable muscle activity (such as the wrist or ear). Electrode positions may be custom-fabricated for each patient, or may be fabricated with more generic configurations suited to multiple patients.”]

Regarding Claim 6, Rameau
McVicker teaches:
6. The electronic apparatus as claimed in claim 1, 
wherein each of the plurality of acceleration sensors is a 3-axis accelerometer. [McVicker teaches that one type of sensor used for detecting silent speech is a “3-axis accelerometer.”  “FIG. 4 is a perspective view of the headset 400 being mounted on the user. The headset compromises the upper section 400 that also is the mounting position for the upper EMG electrodes that mount behind the ear 404. The bend and stay wires 401 lead to both the chin electrode pair and the IMU 402 and the neck EMG electrode 403, holding these in place. In other embodiments, any other coupling can be used to hold the electrodes in place including adhesives. The bend and stay wires are flexible wires with adhesive on the chin electrode pair and the IMU 402 and the neck electrodes 403. …. In one embodiment the IMU (inertial measurement unit) is a motion-sensing system-in-a-chip, housing a 3-axis accelerometer, 3-axis gyroscope, and 3-axis magnetometer for nine degrees of freedom (9DOF) in a single IC.”  Col. 4 line 51 to col. 5 line 7.]

    PNG
    media_image1.png
    310
    261
    media_image1.png
    Greyscale

Rationale for combination as provided for Claim 1.  The accelerometer was brought from McVicker and the types and specifics also come from McVicker.

Regarding Claim 7, Rameau teaches and suggests:
7. The electronic apparatus as claimed in claim 1, 
wherein the processor is configured to perform an operation corresponding to the determined word. [Rameau, Figure 1, teaches that the “personal device 160” may provide a “control command” to the “computing device 110.”  Rameau also teaches that the “transceiver 140” receives sensor data that correspond to “words” of silent speech by the user.  The two teachings together suggest that the words mouthed by the user may be “control commands” that would be subsequently executed.  “[0047] A transceiver 140 allows the computing device 110 to receive and/or exchange readings, control commands, and/or other data with personal device 160….”] 
(Note also Madhvanath [0009] in the Conclusion section for an express teaching of silent speech detected by sensors as command for execution of a function.)

Claim 8 is a method claim with limitations corresponding to the limitations of Claim 1 and is rejected under similar rationale.
8. A method for recognizing silent speech, the method comprising: 
receiving a signal from each of a plurality of acceleration sensors attached to a face of a user; and 
determining a word corresponding to a mouth shape of the user by input a value of the received signal to a classification learning model that classifies a word based on a plurality of sensor output values. 

Claim 9 is a method Claim with limitations similar to the limitations of Claim 2 and is rejected under similar rationale.
Claim 11 is a method Claim with limitations similar to the limitations of Claim 4 and is rejected under similar rationale.
Claim 12 is a method Claim with limitations similar to the limitations of Claim 5 and is rejected under similar rationale.
Claim 13 is a method Claim with limitations similar to the limitations of Claim 7 and is rejected under similar rationale.

Claim 14 is a computer program product system claim with limitations corresponding to the limitations of method Claim 1 and is rejected under similar rationale. Additionally, Rameau teaches: 
14. A non-transitory computer-readable recording medium including a program for performing a method for recognizing silent speech, the method including:  [Rameau: “[0046] … The computing device 110 may include one or more processors and one or more volatile and non-volatile memories for storing computing code and data that are captured, acquired, recorded, and/or generated. The computing device 110 may include a controller 115 that may be configured to exchange control signals with personal device 160 and/or control the analysis of data and interaction with users (e.g., so as to provide text or synthesized speech)….”]
receiving a signal from each of a plurality of acceleration sensors attached to a face of a user; and 
determining a word corresponding to a mouth shape of the user by input a value of the received signal to a classification learning model that classifies a word based on a plurality of sensor output values.

Claims 3 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Rameau and McVicker and further in view Wieman (U.S. 20210174783) and further in view of Joseph (U.S. 20210134062).
Regarding Claim 3, Rameau teaches:
3. The electronic apparatus as claimed in claim 1, 
wherein the classification learning model is a convolutional neural network-long short-term memory (1D CNN-LSTM) model. [Rameau, “13. T…  wherein the predictive model uses one or more artificial neural networks.”  “[0050] Feature engineering may be applied to the dataset (by or via, e.g., computing device 110, such as a predictive model training module 120).  … These artificial neural networks are multilayered, with convolutional layer combining an initial input from large databases and communicating an output to deep processing layers acting as filters. These filters recognize patterns in the original data, creating hierarchic estimations of patterns, called concepts. Voice and speech recognition may rely on artificial neural networks (ANNs) for further refinement of recognition capabilities. ANNs can create nonlinear models that best match nonlinear phenomena.”]
Rameau does not teach the use of a 1D CNN LSTM model.
McVicker teaches that it trains a neural network model which may be a CNN or RNN but does not teach a combination CNN-LSTM model.  See Figure 9, and Col. 7, lines 32-50.
Weiman teaches and suggests:
wherein the classification learning model is a convolutional neural network-long short-term memory (1D CNN-LSTM) model. [Weiman teaches the use of a CNN-LSTM-DNN for speech recognition which teaches combining CNN and LSTM and suggests the CNN-LSTM model of the Claim: “[0102] Some approaches to implementing neural speech-to-meaning recognizers are CNN-LSTM-DNN or seq-to-seq or RNN transducer model including attention. FIG. 8 shows an example that has 4 layers and can be used for an intent recognizer. More or fewer layers are possible. It has a lowest input layer that is a convolutional layer that operates on a frame of audio samples or a spectrogram of such. It computes a set of layer-1 feature probabilities. A second layer is a recurrent layer. Recurrent nodes are shown with a double circle. The recurrence may be a long short-term memory (LSTM) type. The layer 2 features are input to a smaller third layer that is also recurrent and also may be an LSTM layer. The layer 3 features are used by a feed-forward layer that takes an input from an external recognizer, shown with an X circle. The external recognition may be from one or more variable recognizers and/or a domain recognizer. In such an architecture, when used as an intent recognizer, the combination of the top layer nodes produces a final output that indicts that an API hit should occur.”]

    PNG
    media_image2.png
    662
    481
    media_image2.png
    Greyscale

Rameau and McVicker and Weiman pertain to speech recognition.  It would have been obvious to use the combination of CNN and LSTM from Weiman as one type of model for speech recognition in place of the models of Rameau and McVicker both which teach the use of neural networks. This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Weiman teaches a CNN-LSTM-CNN model and not a CNN-LSTM model.

Joseph teaches:
wherein the classification learning model is a convolutional neural network-long short-term memory (1D CNN-LSTM) model. [Joseph teaches a model that classifies the mood and emotion of a speaker based on several types of data collected from the speaker including his speech and teaches that the model used is a CNN-LSTM model:  “[0037] These neural networks may include one or more CNN layers and, optionally, one or more fully connected layers that help reduce variations in dimensions of data fed to the LSTM layers that are effective for temporal modeling. The output of CNN layer is, thus, fed to the LSTM neural network (including one or more LSTM layers) that effectively disentangles underlying variations within the input data. The combination of CNN-LSTM layers into one unified framework facilitates achieving accuracy in extraction process; the extracted output then fed to the AI engine 130 for further analysis and correlations (as discussed in subsequent paragraphs).”]
Rameau/McVicker/Weiman and Joseph pertain to classification of data including speech.  It would have been obvious to use the combination of CNN and LSTM in a CNN-LSTM model as the classification model for the reasons and benefits stated by Joseph at [0037] for using this particular combination. This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 10 is a method Claim with limitations similar to the limitations of Claim 3 and is rejected under similar rationale.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 

Green (U.S. 20170263237) teaches: Use of tri-axial magnetic sensor on and around lips and articulators to detect silent speech.

Madhvanath (U.S. 20120075184): 
[0009] Embodiments of the present solution provide a method and system for executing a command on a computing device. More specifically, the embodiments propose using silent speech (lip movement) in a multimodal command scenario, where silent speech may act as one of the commands to a computer system. In an embodiment, silent speech (lip movement) may act as a qualifier to another command.

Ogawa (U.S. 20200357382):
[0071] In an example embodiment, a different example of a silent OCD 304 is used to record the language inputs of the user. The silent OCD 304 includes sensors that detects other user inputs, but which are not the voice. Examples of sensors in the silent OCD 304 include one or more of: brain signal sensors, nerve signal sensors, and muscle signal sensors. These sensors detect silent gestures, thoughts, micro movements, etc., which are translated to language (e.g. text data). In an example embodiment, these sensors include electrodes that touch parts of the face or head of the user. In other words, the user can provide language inputs without having to speaking into a microphone. The silent OCD 304, for example, is a wearable device that is worn on the head of the user. The silent OCD 304 is also sometimes called a silent speech interface or a brain computer interface. The silent OCD 304, for example, allows a user to interact with their device in a private manner while in a group setting (see FIG. 4A) or in public.
[0088] In another example aspect, private notes of a given user can be made using their own device (e.g. a device like the silent OCD 304 and the device 401), and public notes can be made based on the discussion recorded at threshold audible levels by the OCD 301. The private notes for example, can also be recorded orally or by silent speech using the silent OCD 304. For the given user, the data enablement platform, or their own user device, will compile and present a compilation of both the given user's private notes and public notes that are organized based on time. For example:
[0093] Also at Location A is a user 407 that is wearing another embodiment of an OCD 301a. This embodiment of the OCD 301a includes a microphone, audio speakers, a processor, a communication device, and other electronic devices to track gestures and movement of the user. For example, these electronic devices include one or more of a gyroscope, an accelerometer, and a magnetometer. These types of devices are all inertial measurement units, or sensors. However, other types of gesture and movement tracking can be used. In an example embodiment, the OCD 301a is trackable using triangulation computed from radio energy signals from the two OCD units 301 positioned at different locations (but both within Location A). In another example, image tracking from cameras is used track gestures.
[0289] In an example aspect, the oral computing device is a wearable device to dynamically interact with the data. For example, the wearable device includes inertial measurement sensors. In another example, the wearable device is a smart watch. In another example, the wearable device is a headset. In another example, the wearable device projects images to provide augmented reality.

Epstein (U.S. 2020/0234712):
[0031] Whether the invention is embodied in an earbud or other form factor, portable embodiments may allow the user to have easier access to silent or near-silent speech recognition or recognition in loud environments. The invention does not require that the close signal capture encode precisely mappable or discriminable “speech” information. For example, if there is ambiguity in the stream based on the mode of employment where mouth shapes are similar (p/b and t/d for voicing, s/sh for tongue shape), adaptive learning (or even just context) can often make heads or tails out of it. For example, such techniques are understood in the art of automated lip reading. Processing blocks based on multiple-possibility probability adaptation can retroactively update the posterior probabilities of ambiguous parts of the stream by surrounding (present or future) less ambiguous data. A detected stream that derives as “bodayduh” can be resolved as “potato” with nearly no necessary context, as there are few similar “shaped” words in the English language, such as when using motion or mouth shape detection from the visual or electrographic techniques described, so that even approximate pronunciation such as replacing the /oϑ/ sound at the end of the word with a schwa leaves little ambiguity. Multiple different close signal techniques together can help improve the accuracy of a given employment. And using syllable shape and sequence information through adaptive algorithms and probabilistic language tables are well known to the art.

Joseph (US 20210134062):
 [0035] Next, the extraction module 120 extracts dynamically varying facial expressions, bodily expressions, aural and other symptomatic characteristics from the image data, speech data and the physiological signals received by the input module 110. In one aspect of present disclosure, the extraction module 120 is operable to provide image data, speech data and physiological signals related data to a 3-dimenisonal convolutional neural network (CNN) to extract facial expressions, bodily expressions, aural and other symptomatic characteristics therefrom.
[0036] Following from above, the extraction module 120 now tracks the progress of extraction of facial expressions, bodily expressions, aural and other symptomatic characteristics corresponding to the image data, speech data and physiological signals related data based Long short term memory (LSTM) units over a period time. The two different fusion network models vis-à-vis CNN and LSTM are directly processed based on length of time the network is trained to give the output from the analyzed input data. In one other exemplary embodiment, in order to process sequence or time varying data such as speech, audio or physiological signals, a neural network combination of recurrent neural network (RNN) and LSTM is deployed.
[0037] These neural networks may include one or more CNN layers and, optionally, one or more fully connected layers that help reduce variations in dimensions of data fed to the LSTM layers that are effective for temporal modeling. The output of CNN layer is, thus, fed to the LSTM neural network (including one or more LSTM layers) that effectively disentangles underlying variations within the input data. The combination of CNN-LSTM layers into one unified framework facilitates achieving accuracy in extraction process; the extracted output then fed to the AI engine 130 for further analysis and correlations (as discussed in subsequent paragraphs).

    PNG
    media_image3.png
    420
    396
    media_image3.png
    Greyscale


Asaei (US 20170069306):
[0035] The speech processing may include an event analysis for analysing events in the signal. The event analysis may include a speech parametrization (such as formants, LPC, PLP, MFCC features) or visual clue extraction (such as a shape of mouths) or brain-computer interface feature extraction (such as electroencephalogram patterns) or ultrasound and optical camera and electromagnetic signals input of tongue and lip movements or electromyography of speech articulator muscles and the larynx.
[0053] The encoding device of FIG. 1 comprises an event analysis module 1 or a signal analysis module 1 for analysing a signal s, such as a speech signal, a video signal, a brain signal, an ultrasound and/or optical camera and electromagnetic signals representative of tongue and lip movements, or an electromyography signal representative of speech articulator muscles and of the larynx…..

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659