DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the
first inventor to file provisions of the AIA .

Response to Amendment
The amendment filed on July 7th, 2022 has been entered. Claims 1-5, 7-14, and 16
22 remain pending. Claims 21-22 are new claims added. 

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 13 July 2022 has been entered. 

Response to Arguments
Applicant's arguments filed July 7th, 2022 have been fully considered but they are
not persuasive. Applicant’s arguments with respect to claims 1-5, 7-14, and 16-22 have been fully considered.
Applicant submits on para. 4 of pg. 12 – para 1 of pg. 13, As understood by Applicant,
Amento discloses an Automatic Speech Recognizer (ASR) that uses a Word Confusion Network (WCN) to identify a confidence score for words in a transcript. Fig. 6 of Amento is cited as illustrating the use of a menu-type error correction tool that may be used to make corrections to a displayed transcript. A user may  select a word having a visual indicator indicating that the word has a confidence score that is less than a predetermined threshold. However, the user of in Amento selects the word having the visual indicator from the previously determined confidence score. In contrast to Amento, the claimed approach provides that the portion of the transcript is identified based on the gaze of the user and the output of one or more models that include classification functions trained with known gaze data. Therefore, Amento does not base the confidence score on gaze of the user and output of a model trained with known gaze data, nor does Amento display the text input interface responsive to identifying the portion of the transcript that has the likelihood of error. Rather, Amento relies on user input to select the word with the visual indicator.
Amento is used to cure the deficiencies of Thörn; furthermore, this is noted by the factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 highlighted below, specifically for claims 3-4, 9, 12-13, and 18. Therefore, argument has been viewed in a 35 U.S.C. 102 lens rather than 103 which is used in the office action. 

Applicant submits on para. 3 of pg. 12, Thörn fails to cure the deficiencies of Amento.
Thörn is cited as disclosing: "identifying a portion of the transcription that has a likelihood of transcription error ... based on a gaze of the user". In Thörn, as understood by Applicant, a user directs their eye gaze to a defined activation area to activate text editing functions for a word or interword space (see Fig. 7 and 11. 1-9 of Par. [0112] of Thörn, pp. 8-9 of Office action). However, Thörn does not disclose activating text editing functions responsive to identifying the portion of the transcription that has the likelihood of error. Par. [0095] of Thörn states that, "When the portable electronic equipment 1 determines that a word 42 may need to be edited, e.g. because there is an ambiguity in assigning the correct word to the received speech signal, the portable electronic equipment 1 may allow the user to activate a text editing function for editing the word by an eye gaze directed onto the word." However, Thörn does not disclose identifying a portion of the transcription that has a likelihood of transcription error based on the output of one or more models used in determining the transcription and based on the saccades going back-and-forth over the portion of the transcription that has the likelihood of transcription error while the gaze of the user dwells on the portion of the transcription that has the likelihood of transcription error, nor does Thörn display the text input interface responsive to identifying the portion of the transcript that has the likelihood of error.
Thörn in its plain language identify the elements highlighted in independent claim 1, 10, and 19. Lines 3 on paragraph 0095 – line 6 on paragraph 0097 indicates that a score may be associated with words generated from the speech-to-text module 123 where models are implied as they are needed for the speech to text conversion module to be operable and serve its purpose, and that the user may activate a text editing function by an eye gaze directed to the word, hence identifying a particular error in the text; based on the gaze of the user is any gaze of the user that gives indication on a likelihood of error, Thörn meets the limitations as Paras. 95 – 97 indicates that ambiguity in assigning the correct word to the received speech signal by the speech to text conversion i.e. models used; furthermore, where para. 93 indicates a score may be used to quantify whether the speech to text module determines that it is likely to have misinterpreted the word. As used herein, the term “score” refers to a numerical value which is a quantitative indication for a likelihood, e.g. for a likelihood of a spoken utterance being correctly converted into a word or for a likelihood that a special character has to be inserted at an interword space. Where the portable equipment 1 allows the user to activate a text editing function for editing the word by an eye gaze directed onto the word i.e. responsive to and corr. to the portion of the transcription that has a likelihood of error based on the output of one or models used in determining the transcription i.e. scoring determining ambiguity and the gaze of the user by focusing/dwelling is based on the gaze of the user giving indication that there is a likelihood of error; furthermore, para. 114 indicates The user's eye gaze direction may move rapidly between words at which the user intends to perform a text editing operation and/or interword spaces at which the user intends to perform a text editing operation as the rapid eye movements are going between words which meets back-and-forth language over the portion of the transcription that has the likelihood of transcription error based on the output of the model while the gaze dwell time is greatest in the activation area i.e. rapid eye movements while a gaze is dwelled. Therefore, it teaches that it identifies a portion of the transcription that has a likelihood of transcription error based on the output of one or more models as by the scoring used in the ASR transcription models used in determining the transcription and based on the saccades which are rapid eye movements in its broadest reasonable interpretation going between words in the transcription i.e. back-and-forth over the portion of the transcription that has a likelihood of transcription error as determined by the models and user while the gaze of the user dwells on the portion of the transcription that has a likelihood of error as by the gaze tracker. 

Applicant submits on para. 4 of pg. 12- para. 2 on pg. 13, With respect to claim 7, it is
alleged that 11. 1-9 of Par. [0114] of Thörn discuss that the user's eye gaze direction may move rapidly between words as rapid eye directions or Page  movements (see p. 12 of Office action). Par. [0114] of Thörn states "FIG. 8 shows a path 80 of the user's eye gaze direction on the display. The user's eye gaze direction may move rapidly between words at which the user intends to perform a text editing operation and/or interword spaces at which the user intends to perform a text editing operation. In the illustrated example, the gaze dwell time is greatest in the activation area 71. The text editing function may be activated to enable a user to edit the word or interword space associated with the activation area 71." However, Thörn does not disclose saccades as defined in the medical literature. As understood in the medical literature, saccades are rapid eye movements used in repositioning the fovea to a new location in the visual environment. The term comes from an old French word meaning "flick of a sail". Saccadic movements are both voluntary and reflexive. The movements can be voluntarily executed or they can be invoked as a corrective optokinetic or vestibular measure. Saccades range in duration from 10 to 100 ms, which is a sufficiently short duration to render the executor effectively blind during the transition.1 Applicant submits that the rapid movements between words of the user's eye gaze direction described by Thörn are not saccades. It will be appreciated that, while foveal fixation correlates with gaze direction and eye position, saccades are not rapid movements of the gaze direction. Indeed, saccades can happen in the eyes while the gaze direction is focused on one location.
Saccades in its broadest reasonable interpretation covers rapid eye movements and there are no medical literature definitions to correlate in the instant application’s specifications. Even in a gaze dwell as described in Thörn, there may be a particular area where the gaze is dwelled upon but the saccades which are rapid eye movements are moving between the words, as described in the above argument. The gaze dwell does not necessarily indicate each word but it may be a group encapsulating three words and the saccades are rapid eye movements between the three words of the transcription for instance.

Applicant submits on para. 3 of pg. 13, Thörn does not disclose the 'going back-and
forth over the portion of the transcription' limitation, as Thörn merely discloses moving rapidly between words and/or interword spaces.
Thörn discusses moving rapidly between words and that may be between words such as back-and-forth of the words (“Jack quickly”) where words are a portion of the overall transcription; therefore, the claim language and its teaching correlates to the citation of para. 114. 

Applicant submits on para. 4 of pg. 13 – para. 2 of pg. 14, Thomson is cited for reducing
the inaccuracy and time required to generate transcriptions with editing capabilities (see p. 25 of Office action). However, Thomson also does not disclose displaying a text input interface responsive to identifying a portion of the transcription that has a likelihood of transcription error based on the output of one or more models used in determining the transcription and based on the saccades going back-and- forth over the portion of the transcription that has the likelihood of transcription error while 1 Andrew T. Duchowski. Eye Tracking Methodology - Theory and Practice. Springer-Verlag London, 2003.   the gaze of the user dwells on the portion of the transcription that has the likelihood of transcription error (emphasis added). Prokofieva is cited as disclosing the one or more models executing classification functions trained with known gaze data (see pp. 9-10 of Office action). However, Prokofieva alone or in combination with the prior art references cited above, also does not disclose displaying a text input interface responsive to identifying a portion of the transcription that has a likelihood of transcription error based on the output of one or more models used in determining the transcription and based on the saccades going back-and- forth over the portion of the transcription that has the likelihood of transcription error while the gaze of the user dwells on the portion of the transcription that has the likelihood of transcription error (emphasis added).
Both Thomson and Proko are used to cure the deficiencies of Thörn; furthermore, the elements that are argued are not argued to be taught by Proko and Thomson but rather by Thörn as Paras. 95 – 97, Activation of the text editing function, an interface, is conducted after the scoring and focus/dwelling of the user’s gaze on a portion of text i.e. identifying the portion of the transcription that has the likelihood of error, where the text input interface corresponds to the portion selected as by the output of one or more models used in determining the transcription and based on the gaze of the user; furthermore, the saccades going back-and-forth over the portion of the transcription that has the likelihood of transcription error while the gaze of the user dwells on the portion of the transcription, which is referred in 35 U.S.C. 103 below in reference to Thörn and has been stated above.

Applicant submits on para. 3 of pg. 14, In view of the above, Applicant respectfully
submits that the subject matter of claim 1 would not have been obvious based on the cited references. Claims 10, and 19 are also amended similarly to claim 1 and thus are also believed to be patentable in view of the cited references for at least the reasons set forth above with respect to claim 1. Claims 2-5 and 7-9 depend from claim 1, claims 11-14 and 16-18 depend from claim 10, and claim 20 depends from claim 19. Thus, these dependent claims are patentable in view of the cited references at least for the reason of dependence from independent claims 1, 10, and 19. Applicant requests that the rejection of claims 1-5, 7-14, and 16-20 under 35 U.S.C. § 103 be withdrawn.
Independent claims 1, 10, and 19 are rejected under 35 U.S.C. 103 over
Thörn (US 2015/0364140 A1) in view of Prokofieva (WO 2016/049439 A1) hereinafter Proko, with dependent claims using the factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 as well. 

Applicant submits details regarding the new claims 21-22 with recited features.
Please refer to office action below regarding the factual inquiries for determining
obviousness establishing under 35 U.S.C. 103. Specifically, Thörn in view of Proko and further in view of Amento and further in view Peters et al. (US Pub. No. 2015/0070262 A1) hereinafter Peters.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35
U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness
rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries for establishing a background for determining obviousness under
35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-2, 7-8, 10-11, 16-17, and 19-20 are rejected under 35 U.S.C. 103 as being
unpatentable over Thörn (US 2015/0364140 A1) in view of Prokofieva (WO 2016/049439 A1) hereinafter Proko.
Regarding claim 1, Thörn teaches A method for revising a transcription output from an automatic speech recognition (ASR) system (Lines 1 on paragraph 0068- line 4 on paragraph 0069, Allows user to edit text generated by the speech to text conversion, interpreted as the preamble reciting an intended use), the method comprising:
receiving a voice input from a user (Lines 1-9 on paragraph 0068, Speech to text
conversion module may determine a textual representation of a spoken utterance, so it
receives voice input);
determining a transcription of the voice input (Lines 1-9 on paragraph 0068, Speech to
text conversion module may determine a textual representation of a spoken utterance);
displaying the transcription of the voice input (Lines 19-21 on paragraph 0074, Text from the speech to text conversion module is displayed on display 5);
determining saccades and a gaze of the user with a gaze tracker (Interpretation of gaze tracker is anything able to track eye direction and/or movement, Lines 4-6 on paragraph 0036, Gaze tracking device tracks eye gaze direction of a user; furthermore, determines rapid eye movements as broadest reasonable interpretation of saccades shown in para. 114 with the user’s eye gaze direction may move rapidly between words);
identifying a portion of the transcription that has a likelihood of transcription error based on the output of one or more models used in determining the transcription and based on the saccades going back-and-forth over the portion of the transcription that has the likelihood of transcription error while the gaze of the user dwells on the portion of the transcription that has the likelihood of transcription error (Lines 3 on paragraph 0095 – line 6 on paragraph 0097 indicates that a score may be associated with words generated from the speech-to-text module 123 where models are implied as they are needed for the speech to text conversion module to be operable and serve its purpose, and that the user may activate a text editing function by an eye gaze directed to the word, hence identifying a particular error in the text; based on the gaze of the user is any gaze of the user that gives indication on a likelihood of error, Thörn meets the limitations as Paras. 95 – 97 indicates that ambiguity in assigning the correct word to the received speech signal by the speech to text conversion i.e. models used; furthermore, where para. 93 indicates a score may be used to quantify whether the speech to text module determines that it is likely to have misinterpreted the word. As used herein, the term “score” refers to a numerical value which is a quantitative indication for a likelihood, e.g. for a likelihood of a spoken utterance being correctly converted into a word or for a likelihood that a special character has to be inserted at an interword space. Where the portable equipment 1 allows the user to activate a text editing function for editing the word by an eye gaze directed onto the word i.e. responsive to and corr. to the portion of the transcription that has a likelihood of error based on the output of one or models used in determining the transcription i.e. scoring determining ambiguity and the gaze of the user by focusing/dwelling is based on the gaze of the user giving indication that there is a likelihood of error; furthermore, para. 114 indicates The user's eye gaze direction may move rapidly between words at which the user intends to perform a text editing operation and/or interword spaces at which the user intends to perform a text editing operation as the rapid eye movements are going between words which meets back-and-forth language over the portion of the transcription that has the likelihood of transcription error based on the output of the model while the gaze dwell time is greatest in the activation area i.e. rapid eye movements while a gaze is dwelled.)
responsive to identifying the portion of the transcription that has the likelihood of error, displaying a text input interface corresponding to the portion of the transcription that has the likelihood of error based on the output of one or more models used in determining the transcription and based on the gaze of the user (Paras. 95 – 97, Activation of the text editing function, an interface, is conducted after the scoring and focus/dwelling of the user’s gaze on a portion of text i.e. identifying the portion of the transcription that has the likelihood of error, where the text input interface corresponds to the portion selected as by the output of one or more models used in determining the transcription and based on the gaze of the user);
receiving a text input from the user via the text input interface indicating a revision to the identified portion of the transcription (Lines 1-8 on paragraph 0113, User may be allowed to edit the word by selecting among other candidate words and/or by using textual character input, where it may be the identified portion of the transcript); and
revising the transcription of the voice input in accordance with the text input (Lines 1-8
on paragraph 0113, User may select other candidate words and/or by using textual character
input as to edit the word or interword space with insertion of items such as punctuation marks or other special characters as an example).
	However, Thörn does not explicitly disclose:
the one or more models executing classification functions trained with known gaze data;
In a related field of endeavor (e.g. Techniques described herein leverage gaze input with gestures and/or speech input to improve spoken language understanding in computerized conversational systems, see abstract), Proko discloses, user utterances can include input transcribed from speech input 110, The user 102 can interact with the visual context without constraints on vocabulary, grammar, and/or choice of intent that can make up the user utterance. In some examples, user utterances can include errors based on transcription errors and/or particular speech patterns that can cause an error, see para. 19, examples of the transcription errors may be found in para. 22. Furthermore, the extraction module 216 receives speech input that is transcribed into user utterances, gaze input 304, and/or other forms of user 102 input where the extraction module can extract one or more lexical features. Lexical similarity describes a process for using words and associated semantics to determine a similarity between words in two or more word sets. Lexical features can determine lexical similarities between words that make up the text associated with one or more visual elements in a visual context and words in the speech input 302. The extraction module 216 can leverage automatic speech recognition ("ASR") models and/or general language models to compute the lexical features. The extraction module 216 can leverage various models and/or techniques depending on the visual context of the visual item, see para. 45. Lastly, the extraction module 216 can identify fixation points representing where a user’s 102 gaze lands in a visual context. The extraction module 216 can leverage a model to identify individual fixation points from the gaze input data 306. In at least one example, the extraction module 216 can leverage models such as velocity-threshold identification algorithms, hidden Markov model fixation identification algorithms, dispersion-threshold identification algorithms, minimum spanning trees identification algorithms, area-of-interest identification algorithms, and/or velocity-based, dispersion- based, and/or area-based algorithms to identify the fixation points from the gaze input data 306, see para. 49 i.e. the one or more models executing i.e. working together and in this case leveraging as well the classification functions as identified by the models and algorithms trained with known gaze data. 
Modifying Thörn to include techniques disclosed by Proko discloses:
the one or more models executing classification functions trained with known gaze data (e.g. Thörn’s method for revising a transcript, where it identifies a portion of the transcription that has a likelihood of transcription error based on at least the saccades going back-and-forth over the portion of the transcription that has the likelihood of transcription error while the gaze of the user dwells on the portion of the transcription that has the likelihood of transcription error determined by the gaze tracker, see paras. 95-97 now modified to include the feature that the one or more models executing classification functions trained with known gaze data as taught by Proko, see paras. 22, 45, and 49);
It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to apply the teachings of Proko to the method of Thörn. Doing so would have been predictable to one of ordinary skill in the art given the similar nature, for example, text transcript revision with multi-modal inputs. Further, Including Proko’s features would have improved the users of Thörn, with the benefits of Techniques for improving accuracy in understanding and resolving references to visual elements in visual contexts associated with conversational computing systems and Tracking user gaze and leveraging gaze input based on the user gaze with gestures and/or speech input can improve spoken language understanding in conversational systems by improving the accuracy by which the system can understand and resolve references to visual elements in a visual context as recognized by Proko, see para. 12.

Regarding claim 2, Thörn in view of Proko teaches the method of claim 1 (see claim 1 above), in addition, Thörn teaches:
 further comprising:
displaying a graphical indication of the portion of the transcription that has the likelihood of transcription error (Lines 1-4 on paragraph 0107, broken line where the boundary of the activation area may be displayed).

Regarding claim 7, Thörn in view of Proko teaches the method of claim 1 (see claim 1 above), in addition, Thörn teaches:
wherein the gaze of the user is a plurality of saccades over a given text (Interpretation of saccade are rapid eye directions or movements, Lines 1-9 on paragraph 0114 discusses the
user’s eye gaze direction may move rapidly between words), the method further comprising:
determining the given text as the portion of the transcription that has the likelihood of transcription error (Lines 3 on paragraph 0095 – line 6 on paragraph 0097 indicates that a user
may activate a text editing function by an eye gaze directed to the word, hence identifying a
particular error in the text).

Regarding claim 8, Thörn in view of Proko teaches the method of claim 7 (see claim 7 above), in addition, Thörn teaches:
 further comprising:
using the one or more models, determining a plurality of text candidates as replacements for the portion of the transcription that has the likelihood of transcription error (Lines 1-8 on paragraph 0113, where models are implied as they are needed for the speech to
text conversion module to be operable and serve its purpose, User is allowed to select among
other candidate words for activation areas identified to have a likelihood of error);
displaying the plurality of text candidates (Lines 1-8 on paragraph 0113, where the
candidate words are displayed within the user interface mentioned previously as to perform
the text editing function); and
receiving an input from the user selecting one of the plurality of text candidates to replace the portion of the transcription that has the likelihood of transcription error (Lines 1-8
on paragraph 0113, The user is allowed to select among other candidate words for activation
areas identified to have a likelihood of error from the conversion module).

Regarding claim 10, is directed to a system claim corresponding to the method claim presented in claim 1 and is rejected under the same grounds stated above regarding claim 1. Furthermore, Thörn teaches a computing system comprising (Lines 1-2 on paragraph 0068, Portable electronic equipment, where examples are a mobile phone, a cordless phone, a personal digital assistance (PDA) but not limited thereto, lines 1-3 on paragraph 0146):
a memory (e.g. portable electronic equipment comprises a non-memory storing rules,
which are used by the processing device when the text editing function is activated, Lines 1-6
on paragraph 0077; additionally consider the implication of storage in memory by virtue of the teachings of known devices, "cellular telephone," which inherently include stored instructions for execution); and
a processor configured to execute software instructions embodied within the memory
(e.g. portable electronic equipment comprises...a processing device performs processing and
control operations... executes/activates functions, Lines 1-13 on paragraph 0076). 

Regarding claim 11, is directed to a system claim corresponding to the method claim presented in claim 2 and is rejected under the same grounds stated above regarding claim 2.

Regarding claim 16, is directed to a system claim corresponding to the method claim presented in claim 7 and is rejected under the same grounds stated above regarding claim 7.

Regarding claim 17, is directed to a system claim corresponding to the method claim presented in claim 8 and is rejected under the same grounds stated above regarding claim 8.

Regarding claim 19, Thörn teaches A computing system (Lines 1-2 on paragraph 0068, Portable electronic equipment, where examples are a mobile phone, a cordless phone, a personal digital assistance (PDA) but not limited thereto, lines 1-3 on paragraph 0146) comprising:
a memory (e.g. portable electronic equipment comprises a non-memory storing rules,
which are used by the processing device when the text editing function is activated, Lines 1-6
on paragraph 0077; additionally consider the implication of storage in memory by virtue of the teachings of known devices, "cellular telephone," which include stored instructions for execution);
a gaze tracker (Interpretation of gaze tracker is anything able to track eye direction
and/or movement, Lines 4-6 on paragraph 0036, Gaze tracking device tracks eye gaze direction of a user); and
a processor configured to execute software instructions embodied within the memory (e.g. portable electronic equipment comprises...a processing device performs processing and
control operations... executes/activates functions, Lines 1-13 on paragraph 0076) to:
receive a voice input from a user (Lines 1-9 on paragraph 0068, Speech to text
conversion module may determine a textual representation of a spoken utterance, so it
receives voice input);
determine a transcription of the voice input (Lines 1-9 on paragraph 0068, Speech to text conversion module may determine a textual representation of a spoken utterance);
display the transcription of the voice input (Lines 19-21 on paragraph 0074, Text from
the speech to text conversion module is displayed on display 5);
determine saccades and a gaze of the user with the gaze tracker (Interpretation of gaze tracker is anything able to track eye direction and/or movement, Lines 4-6 on paragraph 0036, Gaze tracking device tracks eye gaze direction of a user; furthermore, determines rapid eye movements as broadest reasonable interpretation of saccades shown in para. 114 with the user’s eye gaze direction may move rapidly between words);
identifying a portion of the transcription that has a likelihood of transcription error based on the output of one or more models used in determining the transcription and based on the saccades going back-and-forth over the portion of the transcription that has the likelihood of transcription error while the gaze of the user dwells on the portion of the transcription that has the likelihood of transcription error (Lines 3 on paragraph 0095 – line 6 on paragraph 0097 indicates that a score may be associated with words generated from the speech-to-text module 123 where models are implied as they are needed for the speech to text conversion module to be operable and serve its purpose, and that the user may activate a text editing function by an eye gaze directed to the word, hence identifying a particular error in the text; based on the gaze of the user is any gaze of the user that gives indication on a likelihood of error, Thörn meets the limitations as Paras. 95 – 97 indicates that ambiguity in assigning the correct word to the received speech signal by the speech to text conversion i.e. models used; furthermore, where para. 93 indicates a score may be used to quantify whether the speech to text module determines that it is likely to have misinterpreted the word. As used herein, the term “score” refers to a numerical value which is a quantitative indication for a likelihood, e.g. for a likelihood of a spoken utterance being correctly converted into a word or for a likelihood that a special character has to be inserted at an interword space. Where the portable equipment 1 allows the user to activate a text editing function for editing the word by an eye gaze directed onto the word i.e. responsive to and corr. to the portion of the transcription that has a likelihood of error based on the output of one or models used in determining the transcription i.e. scoring determining ambiguity and the gaze of the user by focusing/dwelling is based on the gaze of the user giving indication that there is a likelihood of error; furthermore, para. 114 indicates The user's eye gaze direction may move rapidly between words at which the user intends to perform a text editing operation and/or interword spaces at which the user intends to perform a text editing operation as the rapid eye movements are going between words which meets back-and-forth language over the portion of the transcription that has the likelihood of transcription error based on the output of the model while the gaze dwell time is greatest in the activation area i.e. rapid eye movements while a gaze is dwelled.); and
revise the transcription of the voice input in accordance with the text input (Lines 1-8 on
paragraph 0113, User may select other candidate words and/or by using textual character input as to edit the word or interword space with insertion of items such as punctuation marks or other special characters as an example).
	However, Thörn does not explicitly disclose:
identify a portion of the transcription that has a likelihood of transcription error based at least on the gaze of the user determined by the gaze tracker using one or more models executing classification functions trained with known gaze data;
Thörn identifies a portion of the transcription that has a likelihood of transcription error based at least on the gaze of the user determined by gaze tracker (Lines 3 on paragraph 0095 – line 6 on paragraph 0097 indicates that a score may be associated with words generated from the speech-to-text module 123 where models are implied as they are needed for the speech to text conversion module to be operable and serve its purpose, and that the user may activate a text editing function by an eye gaze directed to the word, hence identifying a particular error in the text; based on the gaze of the user is any gaze of the user that gives indication on a likelihood of error, Thörn meets the limitations as Paras. 95 – 97 indicates that ambiguity in assigning the correct word to the received speech signal by the speech to text conversion i.e. models used; furthermore, where para. 93 indicates a score may be used to quantify whether the speech to text module determines that it is likely to have misinterpreted the word. As used herein, the term “score” refers to a numerical value which is a quantitative indication for a likelihood, e.g. for a likelihood of a spoken utterance being correctly converted into a word or for a likelihood that a special character has to be inserted at an interword space. Where the portable equipment 1 allows the user to activate a text editing function for editing the word by an eye gaze directed onto the word i.e. responsive to and corr. to the portion of the transcription that has a likelihood of error based on the output of one or models used in determining the transcription i.e. scoring determining ambiguity and the gaze of the user by focusing/dwelling is based on the gaze of the user giving indication that there is a likelihood of error).
In a related field of endeavor (e.g. Techniques described herein leverage gaze input with gestures and/or speech input to improve spoken language understanding in computerized conversational systems, see abstract), Proko discloses, user utterances can include input transcribed from speech input 110, The user 102 can interact with the visual context without constraints on vocabulary, grammar, and/or choice of intent that can make up the user utterance. In some examples, user utterances can include errors based on transcription errors and/or particular speech patterns that can cause an error, see para. 19, examples of the transcription errors may be found in para. 22. Furthermore, the extraction module 216 receives speech input that is transcribed into user utterances, gaze input 304, and/or other forms of user 102 input where the extraction module can extract one or more lexical features. Lexical similarity describes a process for using words and associated semantics to determine a similarity between words in two or more word sets. Lexical features can determine lexical similarities between words that make up the text associated with one or more visual elements in a visual context and words in the speech input 302. The extraction module 216 can leverage automatic speech recognition ("ASR") models and/or general language models to compute the lexical features. The extraction module 216 can leverage various models and/or techniques depending on the visual context of the visual item, see para. 45. Lastly, the extraction module 216 can identify fixation points representing where a user’s 102 gaze lands in a visual context. The extraction module 216 can leverage a model to identify individual fixation points from the gaze input data 306. In at least one example, the extraction module 216 can leverage models such as velocity-threshold identification algorithms, hidden Markov model fixation identification algorithms, dispersion-threshold identification algorithms, minimum spanning trees identification algorithms, area-of-interest identification algorithms, and/or velocity-based, dispersion- based, and/or area-based algorithms to identify the fixation points from the gaze input data 306, see para. 49 i.e. the one or more models executing i.e. working together and in this case leveraging as well the classification functions as identified by the models and algorithms trained with known gaze data. 
Modifying Thörn to include techniques disclosed by Proko discloses:
identify a portion of the transcription that has a likelihood of transcription error based at least on the gaze of the user determined by the gaze tracker using one or more models executing classification functions trained with known gaze data (e.g. Thörn’s method for revising a transcript, where it identifies a portion of the transcription that has a likelihood of transcription error based on at least the saccades going back-and-forth over the portion of the transcription that has the likelihood of transcription error while the gaze of the user dwells on the portion of the transcription that has the likelihood of transcription error determined by the gaze tracker, see paras. 95-97 now modified to include the feature using the one or more models executing classification functions trained with known gaze data as taught by Proko, see paras. 22, 45, and 49);
It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to apply the teachings of Proko to the method of Thörn. Doing so would have been predictable to one of ordinary skill in the art given the similar nature, for example, text transcript revision with multi-modal inputs. Further, Including Proko’s features would have improved the users of Thörn, with the benefits of Techniques for improving accuracy in understanding and resolving references to visual elements in visual contexts associated with conversational computing systems and Tracking user gaze and leveraging gaze input based on the user gaze with gestures and/or speech input can improve spoken language understanding in conversational systems by improving the accuracy by which the system can understand and resolve references to visual elements in a visual context as recognized by Proko, see para. 12.

Regarding claim 20, Thörn in view of Proko teaches the computing system of claim 19 (see claim 19 above), in addition Thörn teaches:
 wherein the processor is configured to identify the portion of the transcription that has the likelihood of transcription error based on an output of one or more models and based on the gaze of the user comprising a plurality of saccades over a given text (Lines 1-9 on paragraph
0075 discusses portable electronic equipment 1 comprises a processing device 4 coupled to the gaze tracking device, where the processing device may be one or more processors to perform processing and control operations, Lines 3 on paragraph 0095 – line 6 on paragraph 0097 indicates that a score may be associated with words generated from the speech-to-text module 123, where models are implied as they are needed for the speech to text conversion module to be operable and serve its purpose, and that the user may activate a text editing function by an eye gaze directed to the word, hence identifying a particular error in the text, Interpretation of saccade are rapid eye directions or movements, Lines 1-9 on paragraph 0114 discusses the user’s eye gaze direction may move rapidly between words); and
the processor is configured to determine the given text as the portion of the transcription that has the likelihood of transcription error (Lines 1-9 on paragraph 0095,
indicates that the ambiguity i.e. scores for the text outputted by module 123 and the use of eye gaze activates the text editing function; hence determining that there is a likelihood of error).

Claims 3-4, 9, 12-13, and 18 are rejected under 35 U.S.C. 103 as being
unpatentable over Thörn in view of Proko and further in view of Amento (WO 2007/101089 A1).
Regarding claim 3, Thörn in view of Proko teaches the method of claim 1 (see claim 1 above),
However, Thörn in view of Proko fails to explicitly disclose:
wherein the model is a general or specialized language model.
In a related field of endeavor (e.g. speech processing with confidence ranges, see abstract), Amento discloses wherein the model is a general or specialized language model (Lines 1-3 on paragraph 0032, ASR 202 may update its language and acoustical models to improve speech recognition accuracy, it is implied in which the use of combinations of the models are permissible depending on particular applications as there are no limiting statements within the disclosure, and whether the language model is meant to be general or specialized is interpreted as a modification, omission, addition, and/or substitution to what is considered to be general).
Modifying Thörn in view of Proko to include the techniques disclosed by Amento discloses:
wherein the model is a general or specialized language model (e.g. Thörn’s method for revising a transcript, where it uses models in its speech to text now also including the feature where the model is a general or specialized language model as taught by Amento, see para. 32).
It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to apply the teachings of Amento to the method of Thörn in view of Proko. Doing so would have been predictable to one of ordinary skill in the art given the similar nature, for example, speech processing with confidence values. Further, Including Amento’s features would have improved the users of Thörn in view of Proko, with the benefits of improving speech processing accuracy as recognized by Amento, see abstract. 

Regarding claim 4, Thörn in view of Proko teaches the method of claim 1 (see claim 1 above),
However, Thörn in view of Proko fails to explicitly disclose:
wherein the model is an acoustical language model.
In a related field of endeavor (e.g. speech processing with confidence ranges, see abstract), Amento discloses wherein the model is an acoustical language model (Lines 1-3 on paragraph 0032, interpretation of acoustical language model is both an acoustic and language model, where the ASR 202 may update its language and acoustical model to improve speech recognition accuracy).
Modifying Thörn in view of Proko to include the techniques disclosed by Amento discloses:
wherein the model is an acoustical language model (e.g. Thörn’s method for revising a transcript, where it uses models in its speech to text now also including the feature where the model is an acoustical language model as taught by Amento, see para. 32).
It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to apply the teachings of Amento to the method of Thörn in view of Proko. Doing so would have been predictable to one of ordinary skill in the art given the similar nature, for example, speech processing with confidence values. Further, Including Amento’s features would have improved the users of Thörn in view of Proko, with the benefits of improving speech processing accuracy as recognized by Amento, see abstract. 

Regarding claim 9, Thörn in view of Proko teaches the method of claim 1 (see claim 1 above),
However, Thörn in view of Proko fails to explicitly disclose:
wherein identifying the portion of the transcription that has the likelihood of transcription error is further based on stylus input received from a stylus.
In a related field of endeavor (e.g. speech processing with confidence ranges, see abstract), Amento discloses, wherein identifying the portion of the transcription that has the likelihood of transcription error is further based on stylus input received from a stylus (Lines 4-6 on paragraph 0030, Selecting input mechanism through pointing device essentially a stylus and
Lines 1-5 on paragraph 0031, Stylus input for the select and replace tool).
Modifying Thörn in view of Proko to include the techniques disclosed by Amento discloses:
wherein identifying the portion of the transcription that has the likelihood of transcription error is further based on stylus input received from a stylus (e.g. Thörn’s method for revising a transcript, now also including the feature where identifying the portion of the transcription that has a likelihood of error is further based on stylus input received from a stylus as taught by Amento, see paras. 31-32).
It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to apply the teachings of Amento to the method of Thörn in view of Proko. Doing so would have been predictable to one of ordinary skill in the art given the similar nature, for example, speech processing with confidence values. Further, Including Amento’s features would have improved the users of Thörn in view of Proko, with the benefits of other forms of inputs to select such as a keyboard, a pointing device, a stylus or finger on a touchscreen, see paras. 31-32. Furthermore, Proko discusses input peripheral devices (e.g. a keyboard, a mouse, a pen, a game controller a voice input device, a touch input device, gestural input device, eye and/or body tracking device and the like) i.e. touch input device encapsulates a stylus as a person of ordinary skill in the art would recognize that a stylus is a touch input device; furthermore, “and the like” involves other devices techniques alike by substitution. 

Regarding claim 12, is directed to a system claim corresponding to the method claim presented in claim 3 and is rejected under the same grounds stated above regarding claim 3.

Regarding claim 13, is directed to a system claim corresponding to the method claim presented in claim 4 and is rejected under the same grounds stated above regarding claim 4.

Regarding claim 18, is directed to a system claim corresponding to the method claim presented in claim 9 and is rejected under the same grounds stated above regarding claim 9. Furthermore, a stylus operatively coupled to the processor is interpreted as a processor that is capable of detecting operation of a pointer.

Claims 5 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Thörn in
view of Proko and further in view of Amento and further in view Thomson (U.S. Patent No. 10,388,272 B1).
Regarding claim 5, Thörn in view of Proko teaches the method of claim 1 (see claim 1 above),
However, Thörn in view of Proko fails to explicitly disclose:
wherein the model is a character language model.
In a related field of endeavor (e.g. generating transcriptions from user inputs, see abstract), Thomson discloses systems and methods for reducing the inaccuracy and time required to generate transcriptions with editing capabilities (Lines 55-58 on column 4). Furthermore, Thomson teaches wherein the model used for ASR may be a language model including subword probabilities where subwords may be phonemes, syllables, characters, or other subword units (Lines 56-61 on column 41), where ASR system 520 may be an example of the ASR systems 120 of FIG. 1 (Lines 25-29 on column 41).
Modifying Thörn in view of Proko to include the techniques disclosed by Thomson discloses:
wherein the model is a character language model (e.g. Thörn’s method for revising a transcript, now also including the feature where the model is a character language model as taught by Thomson, see lines 55-58 on col. 4).
It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to apply the teachings of Thomson to the method of Thörn in view of Proko. Doing so would have been predictable to one of ordinary skill in the art given the similar nature, for example, speech processing i.e. generating transcriptions from user inputs. Further, Including Thomson’s features would have improved the users of Thörn in view of Proko, with the benefits of providing a secondary language model that includes subword probabilities i.e. character language model, where it can handle out-of-vocabulary words that were not previously present in the limited language model (Line 56-67 on column 41) of Amento; therefore, improving technology with respect to automatic speech recognition, audio transcriptions, and real-time generation as recognized by Thomson
(Lines 65 on column 5 – line on column 6).

Regarding claim 14, is directed to a system claim corresponding to the method claim presented in claim 5 and is rejected under the same grounds stated above regarding claim 5.


Claims 21 and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Thörn in
view of Proko and further in view of Amento and further in view Peters et al. (US Pub. No. 2015/0070262 A1) hereinafter Peters. 
Regarding claim 21, Thörn in view of Proko teaches the computing system of claim 19 (see claim 19 above),
However, Thörn in view of Proko fails to explicitly disclose:
wherein the gaze tracker is configured to capture an image of the user including light reflected from eyes of the user along with detected or estimated pupil locations of the eyes of the user.
In a related field of endeavor (e.g. eye tracking associated with displaying an element of a message, see abstract), Peters teaches the user may have a computing device (e.g. a tablet computer, a head mounted gaze tracking device (e.g. Google Glass.RTM., etc.), a smart phone, and the like) that includes an outward facing, camera and/or a user facing camera with an eye-tracking tracking system, see para. 27. Where para. 37 indicates Eye-tracking module 340 may utilize information from at least one digital camera 320 (outward and/or user-facing) and/or an accelerometer 350 (or similar device that provides positional information of user device 310) to track the user's gaze 360. Eye-tracking module 340 may map eye-tracking data to information presented on display 330 i.e. light reflected from the light is how the camera captures an image of the user, that is how vision works. Furthermore, para. 40 indicates, high-resolution camera and other image processing tools may be used to detect the pupil i.e. is able to detect the pupil location as further imaged by figure 3. 
Modifying Thörn in view of Proko to include the features disclosed by Peters discloses:
wherein the gaze tracker is configured to capture an image of the user including light reflected from eyes of the user along with detected or estimated pupil locations of the eyes of the user (e.g. Thörn’s computerized method including the features of Proko now also including the features of the gaze tracker wherein the gaze tracker is configured to capture an image of the user including light reflected from eyes of the user along with detected or estimated pupil locations of the eyes of the user as taught by Peters, see para. 27, 37, and 40 relating to figure 3 and description of the gaze tracking device).
It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to apply the teachings of Peters to the method of Thörn in view of Proko. Doing so would have been predictable to one of ordinary skill in the art given the similar nature, for example, gaze tracking with displayed messages. Further, Including Peters’ features would have improved the users of Thörn in view of Proko, with the benefits of improving the relevance, timeliness, and overall quality of the results of parsing and annotating, text messages using bioresponse data, see para. 7 as recognized by Peters. 

Regarding claim 22, Thörn in view of Proko teaches the method of claim 1 (see claim 1 above),
However, Thörn in view of Proko fails to explicitly disclose:
wherein an image of the user including light reflected from eyes of the user is captured by the gaze tracker along with detected or estimated pupil locations of the eyes of the user, to determine the saccades and the gaze of the user.
In a related field of endeavor (e.g. eye tracking associated with displaying an element of a message, see abstract), Peters teaches the user may have a computing device (e.g. a tablet computer, a head mounted gaze tracking device (e.g. Google Glass.RTM., etc.), a smart phone, and the like) that includes an outward facing, camera and/or a user facing camera with an eye-tracking tracking system, see para. 27. Where para. 37 indicates Eye-tracking module 340 may utilize information from at least one digital camera 320 (outward and/or user-facing) and/or an accelerometer 350 (or similar device that provides positional information of user device 310) to track the user's gaze 360. Eye-tracking module 340 may map eye-tracking data to information presented on display 330 i.e. light reflected from the light is how the camera captures an image of the user, that is how vision works. Furthermore, para. 40 indicates, high-resolution camera and other image processing tools may be used to detect the pupil i.e. is able to detect the pupil location as further imaged by figure 3. Furthermore, the pupil locations determination give insight as to the gaze direction and saccading movements as described in para. 28. 
Modifying Thörn in view of Proko to include the features disclosed by Peters discloses:
wherein an image of the user including light reflected from eyes of the user is captured by the gaze tracker along with detected or estimated pupil locations of the eyes of the user, to determine the saccades and the gaze of the user.
 (e.g. Thörn’s method including the features of Proko now also including the features of wherein an image of the user including light reflected from eyes of the user is captured by the gaze tracker along with detected or estimated pupil locations of the eyes of the user, to determine the saccades and the gaze of the user as taught by Peters, see para. 27-28, 37, and 40 relating to figure 3 and description of the gaze tracking device).
It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to apply the teachings of Peters to the method of Thörn in view of Proko. Doing so would have been predictable to one of ordinary skill in the art given the similar nature, for example, gaze tracking with displayed messages. Further, Including Peters’ features would have improved the users of Thörn in view of Proko, with the benefits of improving the relevance, timeliness, and overall quality of the results of parsing and annotating, text messages using bioresponse data, see para. 7 as recognized by Peters. 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s
disclosure. 
DIAKO (WO 2016124668 A1) teaches, a gaze tracker and a computer-implemented
method for gaze tracking, comprising the steps of: recording video images of a being's eye such that an eye pupil and a glint on the eye ball caused by a light source () are recorded; processing the video images to compute an offset between the position of the predetermined spatial feature and a predetermined position with respect to the glint; by means of the light source such as a display, emitting light from a light pattern at a location selected among a multitude of preconfigured locations of light patterns towards the being's eye; wherein the location is controlled by a feedback signal; controlling the location of the light pattern from one location to another location among the predefined locations of light patterns, in response to the offset, such that the predetermined position with respect to the glint caused by the light source tracks the predetermined spatial feature of the being's eye; wherein the above steps are repeated to establish a control loop with the location of the light pattern being controlled via the feedback signal, see abstract.

Any inquiry concerning this communication or earlier communications from the
examiner should be directed to JONATHAN E AMAYA HERNANDEZ whose telephone number is (571)272-2484. The examiner can normally be reached Monday - Friday 7:30 am - 3:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andy Flanders can be reached on 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/J.E.A./             Examiner, Art Unit 2655         

/ANDREW C FLANDERS/             Supervisory Patent Examiner, Art Unit 2655