DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
In response to the Final Office Action mailed 4/1/2022, applicant has submitted an After-Final amendment filed 7/1/2022.
Claim(s) 5, 7, 9, 10, 23, and 29, has/have been amended.  Claim(s) 36 has/have been cancelled.  New Claim(s) 41 has/have been added.
Response to Arguments
Applicant’s arguments have been acknowledged.
Claim Interpretation
“the word or phrase defining the second text based command” (in lines 4-5 of claim 5 and in line 6 of claim 7) is not ambiguous because, even though lines 3-4 of claim 5 and lines 3-4 of claim 7 recite “a word or phrase defining the second text based command”, “a word or phrase defining the second text based command” is part of the complete phrase “the text to speech rendering of a word or phrase defining a second text based command” (i.e. “a word or phrase defining the second text based command” in lines 4-5 of claim 5 is the same “a word or phrase defining a second text based command” in lines 6-7 of claim 1)
Allowable Subject Matter
Claims 1, 5, 7, 9-10, 12, 17, 20, 23, 27, 29, 31, 33-35, and 37-41, are allowed.
The following is an examiner’s statement of reasons for allowance:

As per Claim 1 (and similarly claim 34, and consequently claims 5, 7, 9, 12, 23, 29, 33, 35-40 which depend on claims 1 and 34):
	The prior art of record does not teach or suggest the combination of all limitations in claim 1, including (i.e. in combination with the remaining limitations in claim 1) examining resources to generate a list that includes a plurality of resources names that can be referenced in text based commands that the computer system is configured to execute, wherein responsive to determining that a word or phrase of a second candidate directive invoking vocal utterance does not sound confusingly similar to electronically synthesized speech renderings of resource names of the plurality of resource names, (where the plurality of resource names are resource names “that can be referenced in text based commands that the computer system is configured to execute”) storing voice process configuration data that establishes a directive invoking vocal utterance based on the second candidate directive invoking vocal utterance as a registered directive invoking vocal utterance (for claim 1, a file/directory name by itself cannot fairly be interpreted as “candidate invoking” because there are multiple possible functions that can be performed with regard to a file/directory [save, delete, open] and the name by itself does not clearly invoke any directive to perform any particular function)
The prior art of record does not teach or suggest the combination of all limitations in claim 34, including (i.e. in combination with the remaining limitations in claim 34) examining resources to generate a list that includes a plurality of resource names that can be referenced in text based commands that the computer system is configured to execute and wherein the determining includes determining whether the word or phrase of the candidate directive invoking vocal utterance (which is “for invoking a directive to execute a first text based command to perform a first computer function of a computer system”) sounds confusingly similar to a speech rendering of a resource name of the plurality of resource names (where the plurality of resource names are resource names “that can be referenced in text based commands that the computer system is configured to execute”)
Agapi suggests wherein the method includes examining resources to generate a…that includes a plurality of…names…and wherein the determining includes determining whether the word or phrase of the candidate directive invoking vocal utterance sounds confusingly similar to…a…name of the plurality of…names (Figures 1-2; paragraphs 8-9, 24, 27-29, 31-33, 35, 38-44, 46-51, 56-57;
The combination [thus far] is as discussed in the rejection of claim 1, above.
Paragraphs 32 describes where “Before any new user-defined voice command… is accepted” the new command is compared against preexisting commands in the grammar data store, including by parsing the voice command and checking the entire command and each parsed piece for potential ambiguities with preexisting commands.  Paragraphs 27-29 similarly describe where voice commands are compared against a “set” of commands defined within grammar data store, where the grammar data store can include, among other things, a user-defined grammar, and where the user-defined grammar can include a set of user-defined commands.  Paragraphs 39-44 similarly describe parsing a voice command and analyzing “each parsed portion” to determine potential recognition ambiguities, and also more particularly describes where the portion being parsed and checked for ambiguities is a NAME.  Paragraphs 47-50 similarly describe parsing a voice command and determining likelihood of confusion for components, and more particularly describes where a “new-user-defined voice command” is received and where the components are “compared against preexisting voice commands”.  Paragraph 42 further describes where a NAME can be a word, phrase, or sentence, parsed portions of which can be similar to other commands, and where parsed portions are words.  Paragraphs 49 and 51 further describes where a new user-defined command can be accepted and associated with “one or more” programmatic actions.
These portions further suggest “wherein the method includes examining resources to generate a… that includes a plurality of… names… and wherein the determining includes determining whether the word or phrase of the candidate directive invoking vocal utterance sounds confusingly similar to… a… name of the plurality of… names” [analyzing/”examining” previous new user-defined voice command NAMES which are used as “resources” used to define new user-defined voice commands for the system, and accepting, into the user-defined portion of the grammar data store, those NAMES that are not likely to be confused with preexisting commands, thereby contributing to the generation of a set of preexisting command NAMES that are compared to the “current” new user-defined voice command/”candidate directive invoking vocal utterance” to “determine whether the word or phrase of the candidate directive invoking vocal utterance sounds confusingly similar to” a NAME in the set of NAMES])
Agapi suggests wherein the determining includes determining whether the word or phrase of the candidate directive invoking vocal utterance sounds confusingly similar to…a…name of the plurality of…names.  Agapi, in view of Ittycheriah do not, but Blandin suggests wherein the determining includes determining whether the word or phrase of the candidate directive invoking vocal utterance sounds confusingly similar to a speech rendering of a…name of the plurality of…names (Paragraphs 36 and 40;
Same combination as discussed in the rejection of claim 1, where performing TTS on the existing commands [i.e. the existing command NAMES] and comparing those NAMES to the new user-defined voice command determines whether the new user-defined voice command’s word[s] are confusingly similar to the TTS “speech rendering” of the NAMES in the set of NAMES that are compared to the new user-defined voice command)
Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing to perform a simple substitution of one type of acoustic similarity comparison with another because the prior art teaches the claimed invention except for the substitution of an acoustic similarity comparison which does not necessarily compare a TTS-generated audio of a word with input audio with an acoustic similarity comparison which does.  Blandin teaches that acoustic similarity comparison which does compares a TTS-generated audio of a word with input audio was known in the art.  One of ordinary skill in the art could have substituted one type of acoustic similarity comparison with another to obtain the predictable results of a system which receives, from a user, a new user-defined voice NAME, compares input speech of a new user-defined voice NAME to a set of existing voice NAMEs, and provides the user a warning that the new user-defined voice NAME may be confused with an existing NAME (as per Agapi) where the warning identifies both the input and which word may be confused with the input (as per Ittycheriah) where the comparing compares the input speech to a TTS-generated audio of a word (as per Blandin).
	Agapi suggests examining resources to generate a… that includes a plurality of… names. Agapi, in view of Ittycheriah and Blandin, do not, but Gammel suggests examining resources to generate a list that includes a plurality of… names (“comparing the name to be enrolled to the names in the database to reject any name that is too similar”, col. 1, lines 36-41; “During similar name checking… match an existing name on the list… already on the list”, col. 5, lines 54-65; “If a third utterance is requested for enrollment, then that name is checked first to see if it is too similar to another name on the list”, col. 8, lines 49-53; “it is determine if that name is too similar… to a name already on the speed dial list”, Abstract; 
	Gammel, like Agapi, teaches comparing names to existing names in a database to determine if any are too similar.  
	Agapi teaches comparing a new user-defined voice command to a “set” [see Agapi, paragraph 27] but Gammel more specifically describes where the set of names compared to a name is more specifically a “list”.
Gammel thus suggests where the set of existing voice command NAMES which is compared to the new user-defined voice command in Agapi is more specifically a “list” of existing voice command NAMES [as opposed to a “set” which is not necessarily a list])
Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing to perform a simple substitution of one type of set which is compared to a name with another because the prior art teaches the claimed invention except for the substitution of a set which is compared to a name which is not necessarily a list with a set which is compared to a name which is.  Gammel teaches that a list which is compared to a name was known in the art.  One of ordinary skill in the art could have substituted one type of set which is compared to a name with another to obtain the predictable results of a system which receives, from a user, a new user-defined voice NAME, compares input speech of a new user-defined voice NAME to a set of existing voice NAMEs, and provides the user a warning that the new user-defined voice NAME may be confused with an existing NAME (as per Agapi) where the warning identifies both the input and which word may be confused with the input (as per Ittycheriah) where the comparing compares the input speech to a TTS-generated audio of a word (as per Blandin) where the set is a list (as per Gammel).
	Yaker suggests where a name entered by a user can be a spoken file name for a newly created file, and Patch suggests where a user can also speak to generate a list of file names and/or directory names.
	An additional reference, however, would be required to address where Agapi’s new user-defined command (which is for invoking the execution of a computer system function) is compared to a list of file names or directory names that can be referenced in text based commands that the computer system is configured to execute.
	Agapi does teach where a name can be parsed into different words (paragraph 42) and where a portion of one command can be confused with the entirety of another command (paragraphs 8-9), Bickley does teach where a command can include a function and a word (play message and save message in paragraph 7), and Gopinath (cited in the rejection of claims 4 and 15 in the Office Action mailed 10/21/2020) teaches where a command can have a function component and a reference to a name (e.g. Call Bob).  While these references may suggest where Agapi’s commands can be commands that include a function component and a person/object name component, where Agapi’s system parses the name component and compares the name component to existing commands, these references, at a minimum, do not reasonably suggest where an object/person name portion of a new user-defined voice command is compared to an object/person name in the existing voice commands.
	In contrast, the combination applied to reject claim 13 suggests claim 13 because, in the combination, an input is a file name which is compared to a list of file names (i.e. comparing two names of the same type, and not comparing an input command type input with a list of file name type names).
Also in contrast, Claim 23 does not specify that the resource names are resource names that can be referenced in text based commands that the computer system is configured to execute (such that the commands in Agapi, themselves, can be interpreted as resource names).

As per Claim 17 (and consequently its dependent claims 10, 20, 27, and 31), the prior art of record does not teach or suggest the combination of all limitations in claim 17, including (i.e. in combination with the remaining limitations in claim 17) receiving a candidate audio data set with the candidate audio data set including: (i) a candidate text proposed for association with a candidate text based command, and (ii) audio data corresponding to a candidate text to speech rendering of the candidate text, determining, using the candidate text to speech rendering (i.e. the text to speech rendering that is included in the candidate audio data set, along with the candidate text proposed for association with a candidate text based command) that speech recognition software is likely to misidentify utterances of the candidate text as corresponding to a text based command other than the candidate text based command.
Bickley et al. (US 2003/0069729), while teaching the use of text-to-speech conversion (i.e. “rendering” text into “speech”) as part of the confusability prediction (paragraph 61) specifically teaches away from the use of TTS to compare spoken phrases (paragraphs 14 and 48).  Therefore, one of ordinary skill in the art, reading the passages in Bickley which teach away from TTS, would not find obvious a combination which includes the use of TTS.  Paragraph 12 also appears to describe where text to speech causes a system to be speaker/system speech recognition dependent and that a more reliable method to predict acoustic confusability is needed when using a combination of a text phrase and an audio file, and therefore it would not be obvious to one of ordinary skill in the art to combine Bickley with electronic synthesis based on “the voice data” received from a user or based on a user’s vocal tendencies which is designed to make the speech rendering user dependent (as per claims 6-7, whereas Bickley appears to be directed to user/system independent).
The prior art also teaches where enrollment data can include both a speech recording (i.e. not a TTS rendering) and text of the utterance(s) spoken by the speaker.  In this reference, enrollment data to be used for adapting the trained neural network to a speaker is obtained, where enrollment data comprises speech data corresponding to one or more utterances spoken by the user, and enrollment data may comprise information indicating the content of the utterance(s) spoken by the speaker, such as the text of the utterances.
2017/0169815 “The data used for adapting an acoustic model to a speaker is referred to herein as " enrollment data." Enrollment data may include speech data obtained from the speaker, for example, by recording the speaker speak one or more utterances in a text. Enrollment data may also include information indicating the content of the speech data such as, for example, the text of the utterance(s) spoken by the speaker and/or a sequence of hidden Markov model output states corresponding to the content of the spoken utterances”, paragraph 4; “After the trained neural network acoustic model is accessed at act 102, process 100 proceeds to act 104, where enrollment data to be used for adapting the trained neural network to a speaker is obtained. The enrollment data comprises speech data corresponding to one or more utterances spoken by the speaker. The speech data may be obtained in any suitable way. For example, the speaker may provide the speech data in response to being prompted to do so by a computing device that the speaker is using (e.g., the speaker's mobile device, such as a mobile smartphone or laptop). To this end, the computing device may prompt the user to utter a predetermined set of one or more utterances that constitute at least a portion of enrollment text. Additionally, the enrollment data may comprise information indicating the content of the utterance(s) spoken by the speaker (e.g., the text of the utterance(s), a sequence of HMM output states corresponding to the utterance(s), etc.)”, paragraph 36;
2005/0071163 “For example, the user could input the text string "Welcome to the IBM text-to-speech system" in the text input field (42) and then click on the record button (43) to start recording as the user recites the same text string into the microphone in the manner in which the user wants the system to reproduce the synthesized speech. When the input utterance is complete, the user can click on the stop button (44) to stop the recording process”, paragraph 26; 
The prior art teaches “In some embodiments, details sub-node 414 includes text or recorded spoken words from the speech input, a digitized or text-to-speech version of a text input from the user, and/or the current location of user device 104 (FIG. 1) for inclusion in the automatic response” (which suggests where “details sub-node” includes text and a TTS version of a text input from a user)
2015/0045003 paragraphs 112-113
The prior art teaches determining similarity between activation words of voice recognition devices by converting words to phonetic symbol strings and then determining their edit distance, and if the activation words are the same or similar, then a warning is issued. (Comparing activation words of two different voice recognition devices)
2017/0053650 paragraphs 73-74;
David B. Roe, Michael D. Riley (“Prediction of Word Confusabilities for Speech Recognition”) teaches determining phonetic pronunciation of words from text by performing TTS and where words that have similar phonetic pronunciations are likely to be confused.  Section 3.1 appears to describe limitations (which can be considered teaching away.  Section 2 also seems to say that it is simpler calculation than comparison at the acoustic level (which seems to suggest that comparisons are not being made between speech renderings)
Roe et al. “basic idea behind predicting word confusability is simple.  Text-to-speech systems can determine the phonetic pronunciation of words from text.  Words that have similar phonetic pronunciations are likely to be confused by speech recognizers”, Section 1. Introduction; “Though there are several approaches for determining the acoustic similarity between words, we choose an approach based on phonetic pronunciation and a measure of confusability of the phonetic units rather than acoustic examples… of the words themselves.  Given two potentially similar words, we begin with their phonetic pronunciations from a text to speech synthesizer.  Then we estimate the probability that the phonetic pronunciation of the first word will be misrecognized as the second word, rather than the first… allows an estimate of confusability before recording speech utterances to find the actual pronunciations of the desired vocabulary… benefit of simplicity of calculation compared to estimates of similarity at the acoustic level… drawback that actual pronunciations may not be represented accurately by the phonetic pronunciations from a dictionary”, Section 2. Theory
The prior art describes performing speech rendering by converting an input text phrase to a synthesized speech rendering, performing text transcription and then comparing the text transcription with a list of test phrases (in order to identify acoustic similarity).  Paragraph 25 describes converting two text phrases into phoneme sequences then calculating phonetic distance. (where Roe describes phonetic distance between phonetic pronunciations determined by TTS).  Paragraph 18 has phonetic transcription corresponding to a symbolic representation of how a spoken rendering of the text should sound which seems to suggest that the phonetic transcription is not a spoken rendering, and all other instances seem to suggest the rendering is the actual audio sound, not the phonetic representation (which suggests that phonetic pronunciation is not a rendering)
Rao et al. (US 2019/0295531) “As one particular example, an input text phrase of "profit" can be input by a user. The input text phrase can be converted to an audio output corresponding to a synthesized speech rendering of the word "profit." A text transcription of the audio output can be determined. For instance, the text transcription can be a transcription that reads as "prophet," which is a homophone (e.g. phonetically similar) to the word "profit." The text transcription can be compared against a list of test phrases to identify a match between the text transcription and one or more of the test phrases. If the list of test phrases includes the word "prophet," a match can be found, and the input text phrase, "profit," can be identified as being phonetically similar to the word "prophet," as found in the list of test phrases”, paragraph 16 [supported by paragraph 16 of 62/410,564]; paragraphs 18-19
Another prior art reference also teaches receiving, from a user, an input nametag (e.g. a phrase) via microphone (obviously audio) or keyboard (obviously text) and applying TTS to text input in order to make confusability calculations.  In this reference, paragraph 58 describes calculating confusability using the TTS sequence of phonemes by comparing the “text entry sequence of phonemes” (at least suggested to be the TTS-generated sequence of phonemes) to “phonemes of entries already stored in at least one of the domains”.  TTS, in this case (paragraph 51) also converts the text entry into a sequence of phonemes which, similar to what was discussed above pertaining to Roe and Rao, is not necessarily a speech rendering.  Paragraph 59 further describes an example of phoneme comparison which looks like it compares data representations of phonemes (e.g. JH/IH/M and T/IH/M have a 2/3 overlap) which suggests that “sequence of phonemes” is not a speech rendering.  This reference also teaches where ASR detects presence of not just nametags but also spoken commands and numbers (paragraph 2).  Paragraph 4 also teaches where a user tries to store a nametag that sounds like an already-stored nametag, number or command, and where confusability between similar sounding words is known as a “substitution error”  Paragraph 6 also teaches where confusability scores are calculating by comparing an uttered nametag with all previously stored nametags and commands combined, and prompting the user to use a different nametag when a confusability calculation is too high (this paragraph only teaches away from using this technique for numbers).  This reference, however, does not specifically teach that TTS is also applied to the entity being compared to the input (paragraph 58 describes comparing phonemes generated by TTS to phonemes of entries already stored in at least one of the domains but does not specifically state that those phonemes of already-stored entries are also TTS electronically synthesized from the entries [i.e. “rendered” into “speech” from the entries]).  This reference is particularly towards storing nametags but may suggest (but does not specifically describe) where a user is trying to enter a new command (for claim 1).  
As per claim 17, this reference does not appear to teach where the text and the TTS speech rendering are part of the same entity.  Additionally, since the phonemes are not necessarily speech renderings (in the acoustic level audio signal sense) it is also not clear that a set including the text and a TTS speech rendering is received (as opposed to text and a phonetic text representation).
Additionally, in this reference, the TTS phoneme comparison is directed to the text-independent embodiment, which involves inputting a new nametag by typing text and performing TTS on the typed text, whereas the text-dependent embodiment which involves entering a new nametag by microphone makes confusability determinations by using SLMs and comparing confidence levels for a nametag/number/command domain to thresholds (i.e. the text-dependent embodiment appears to be based on speech recognition techniques and not TTS-based phoneme comparison, and thus this reference does not read on claim 1).
Chengalvarayan et al. (US 2011/0288867) “nametag input for the nametag is received from the user and processed… receive the nametag input via the microphone… other examples… alphabetical or alpha-numerical keyboard”, paragraph 50; “speaker-independent… input is a text entry from the user… TTS… converts the text entry to a sequence of phonemes”, paragraph 51; “speaker-dependent… nametag input is an utterance from the user”, paragraph 52; “nametag confusability”, paragraph 55; “confusability of the nametag input is calculated with previously stored nametags”, paragraph 56; “confusability calculation can be based on a comparison of the text entry sequence of phonemes to phonemes of entries already stored in at least one of the domains… TTS… convert the text entry into the sequence of phonemes… confusability score is maximum if the sequence of phonemes converted from the text entry corresponds identically to a sequence of phonemes of any stored entry in any of the domains”, paragraph 58; “nametag input can be ‘Kaushik’s Cell Phone’… recognition result can be ‘741.’”, paragraph 60;
The prior art teaches keyword spotting that performs text-to-speech on keywords and then correlates the text-to-speech audio signals of keywords to event audio using acoustic similarity measures to spot keyword occurrences.
Blandin (US 2017/0169816) paragraph 40;
The prior art teaches comparing a name and determining that a name is too similar to a name already on a speed dial list
	Gammel et al. (US 5832429) “request for a new template is received it is determined if the list of speed dial names is full (Step 201) and is not it is determined if that name is too similar (Step 205) to a name already on the speed dial list. If so, that name is rejected but if not it is determined if the speed dial name is too short (Step 302), and if not; too short or if the user wants to enter the short name the system asks the user to repeat the speed dial name and if a match it is entered. If not a match the system will swap the first and second utterance and compare to see if a match”
The prior art teaches receive an utterance and compare the utterance with pre-existing commands in at least one speech recognition grammar (paragraph 26) determining if the provided utterance is potentially ambiguous or acoustically similar to a pre-existing command, and if so, determining a substitute (paragraph 27) and providing a substitute that is dissimilar to pre-existing commands and presenting a notice that the utterance is potentially confusing and the option to use the determined substitute instead of the utterance (paragraph 28).  Paragraph 21 describes where commands in store can each be associated with a set of programmatic actions to be performed whenever a user issues the corresponding command, and determining whether an utterance is potentially ambiguous or acoustically similar to an entry in the command data store”.  Paragraph 29 describes associating the utterance or selected substitute with a set of programmatic actions.  This reference teaches away from receiving a subsequent different voice command after a user-defined command is similar to an existing command (paragraph 6) and therefore cannot be applied to reject claim 1.
2008/0133244 paragraphs 10, 21, 26, 27, 28, 29
Paragraphs 20-21 describe where a speaker provide, via a microphone, a spoken utterance meant to be associated as a user-defined command, and where the spoken utterance that is meant to be associated as a user-defined command is analyzed to determine if the spoken utterance is acoustically similar to any existing commands contained in the command store, which can include user defined commands and/or system defined commands, and where the commands can each be associated with a set of programmatic actions to be performed whenever a user issues a corresponding command.  Paragraph 5 further describes examples of acoustically similar speech commands [including one user-defined speech command] which are at least suggested to perform a corresponding computer function [i.e. mail check or spell check].  Paragraphs 38-39 describes a computer system embodiment which is controlled by loading and executing a computer program, where computer programs are any expression, in any language, code, or notation, of a set of instructions intended to cause a system to perform a particular function.
These portions suggest “receiving, from a user, voice data defining a candidate directive invoking vocal utterance for invoking a directive to execute a first text based command to perform a first computer function of a computer system;” [receiving a spoken user-defined command/”voice data” that is a candidate to invoke a directive to execute a corresponding programmatic action that causes the computer system to perform a corresponding function, where the programmatic action and corresponding function are performed by executing computer program instructions/”commands” which are commonly/conventionally defined using computer program text]
“responsive to determining that a word or phrase of the candidate directive invoking vocal utterance sounds confusingly similar to…a word or phrase defining a second… command, communicating, to the user, information indicating that the word or phrase of the candidate directive invoking vocal utterance sounds confusingly similar to…the word or phrase defining the second… command”: Paragraphs 20-21 describe where a speaker provide, via a microphone, a spoken utterance meant to be associated as a user-defined command, and where the spoken utterance that is meant to be associated as a user-defined command is analyzed to determine if the spoken utterance is acoustically similar to any existing commands contained in the command store, which can include user defined commands and/or system defined commands, and where the commands can each be associated with a set of programmatic actions to be performed whenever a user issues a corresponding command.  Paragraph 5 further describes examples of acoustically similar speech commands [including one user-defined speech command] which are at least suggested to perform a corresponding computer function [i.e. mail check or spell check].  Paragraphs 38-39 describes a computer system embodiment which is controlled by loading and executing a computer program, where computer programs are any expression, in any language, code, or notation, of a set of instructions intended to cause a system to perform a particular function.  Paragraph 23 and Figure 1 describes where a new command is potentially ambiguous with an existing command, and where a user-defined command “Car” is acoustically similar to “Card”.)
	The prior art teaches confusable commands, including where examples are “delete this voicemail” and “repeat this voicemail”, “read it” and “delete it” and “get rid of it”, 
2011/0224972 “After selecting one of the remaining unprocessed menu elements in the FSM document 112, the build system 118 uses the language-neutral GRXML document 120 and the localized response document to identify responses for the selected menu element (608). After identifying the responses for the selected menu element, the build system 118 determines whether the responses for the selected menu element pass an acoustic confusability test (610). The acoustic confusability test determines whether two or more of the responses are acoustically confusable. Two responses are acoustically confusable when the IVR system 110 would perform different actions in response to the responses and there is a significant possibility that the IVR system 110 would confuse one of the responses for the other response. For example, the IVR system 110 could potentially confuse the words " delete" and "repeat" because these words have similar sounding endings. The responses for the selected menu element pass the acoustic confusability test if none of the responses for the selected menu element are acoustically confusable with any other one of the responses for the selected menu element”, paragraph 100; 
Doyle (US 2003/0125945) “For example, if "read it" is being confused with "delete it" because the two phrases are acoustically similar, then the system would, for example, remove " delete it" from the grammar's vocabulary and substitute it with "get rid of it". The phrase "get rid of it" is not acoustically similar to "read it" and therefore cannot be as easily confused by the system”, paragraph 99;
2008/0221896 “delete this voicemail… confused with ‘repeat this voicemail’”, paragraph 6
2019/0027138 teaches where the same word is already used to wake another command (“The predefined use can be determined by looking up existing commands. For example, if "Gort" has already been coded as a command for turning on a microwave, then using it as a wake-up utterance for the command hub 104 is likely to cause confusion”, paragraph 33;)
	10699706 appears to teach recognizing a list corresponding to input speech so that the system can disambiguate who the user is trying to refer to, and doesn’t seem to be generating speech renderings of the generated list.  This reference does include a command called Kitchen in Figure 1.  (“In the illustrative embodiment, individual 2 may seek to establish a communications session with a particular device associated with a user account for individual 2 (e.g., a device named "Kitchen") using voice activated electronic device 100a. In some embodiments, a user account associated with individual 2 may have a contact named "Kitchen" as well as a number of devices nicknamed "Kitchen." Thus, it may be necessary for computing system 300 to perform various processing methods to determine whether to establish a communications session between electronic device 100a and a contact named "Kitchen" or a device named "Kitchen." In the illustrative embodiment, the user account has one contact named "Bob Kitchen" and two devices called "Kitchen" associated with the user account. For instance, the user account may have two homes, with a device located in a respective kitchen of each home (e.g., kitchen device 100b and kitchen device 100c). Accordingly, computing system 300 may generate a list of contacts and devices associated with the user account. The contacts and devices may be represented by a name given to each contact and device, referred to herein as "entity names." The entity names on the list may include the contact name "Bob Kitchen," as well as devices names "Kitchen Echo Show" (corresponding to kitchen device 100b) and "Kitchen Echo" (corresponding to kitchen device 100c, which may be located in a third location 16c). Computing system 300 may then compare each entity name on the list to the word "Kitchen" to determine a confidence level for to each entity name. A confidence level may be a value that is determined by how closely a particular entity name matches the name of the target. For instance, in an embodiment, an entity name (i.e., text data) representing the contact named "Bob Kitchen" and text data representing the target named "Kitchen" are not identical, and based on statistical data, computing system 300 may determine a confidence level of "MEDIUM" with respect to the entity name "Bob Kitchen." Additionally, entity names "Kitchen Echo Show" and "Kitchen Echo" are not identical to text data representing the target, but may, based on historical data have a higher confidence level than "Bob Kitchen." For instance, computing system may determine a confidence level of "HIGH" with respect to the entity names "Kitchen Echo Show" and "Kitchen Echo."”)
7313525 teaches comparing a user-identified bookmark name to existing bookmark names and grammars, and also comparing a list of suggested bookmark names to compare with existing bookmark names and grammars (“In a further effort to improve the accuracy of bookmark recognition, the system may also include functionality that compares an elected bookmark name with existing bookmark names and grammars to ensure there is no confusion. In one example, once the user identifies a bookmark name the system provisionally accepts the bookmark name. However, before entering the bookmark name into the user profile, the system compares the provisionally accepted bookmark name with the existing bookmark names and grammars. If there is no conflict, then the bookmark name is finally accepted and added to the user profile. If, however, the system identifies a conflict with existing bookmark names or grammars the system may then prompt the user to select another bookmark name. Another example may have the system checking for conflicts before it presents the list of suggested bookmark names to the user. Specifically, the system after retrieving the list of suggested bookmark names from the application may compare the suggested list against the existing bookmark names and grammars in the user profile. If the system identifies a conflict, the conflicting bookmark name from the proposed list will not be presented to the user. There are other examples that can function individually or in combination to minimize potential conflict between bookmark names and the associated potential inaccurate recognition”;)
6535848 teaches listing a set of file names from which a user can select (“Display screen 700 is displayed on the transcription computer monitor. Display screen 700 desirably lists a set of file names 702 from which the user can select. Display screen 700 can include other information relating to the files, such as, for example, the date 704 the file was created or last edited, and information 706 describing the relationship of the file to other files”)
2007/0016420 (IBM reference) describes constructing a list of alternative letter sequences by replacing letters in a sequence with similar sounding letters. (“The probabilities of mistaking one letter for another are typically represented as a matrix, which is called a "confusion matrix." The probability of interchanging letters belonging to different letter classes is assumed to be small. When using letter classes, the post processor constructs the list of alternative letter sequences by replacing each letter of the best ranking sequence with similarly-sounding letters, according to the letter classes described above. The post processor typically ranks the list, for example by computing likelihood scores based on the confusion matrix”, paragraph 66;)
2005/0203741 teaches comparing a letter sequence with a list of allowable words (“system compares each letter sequence with a list of allowable words and identifies the spelled identifier as soon as the list is reduced to a single identifier”, paragraph 3;)
5710864 teaches callers vocalizing utterances that are similar to an employee name (“Conventional recognizers often have difficulty verifying the occurrence of keywords in an unknown speech utterance. Typical automated inbound telephone applications, for example, route incoming calls to particular employees by recognizing a caller's utterance of an employee's name and associating that utterance with the employee's extension. Callers often pronounce names that are not in the directory, or worse, vocalize utterances that are phonetically similar to a different employee's name, causing the caller to be routed to the wrong party”)
8380514 (Figures 1-2) also more specifically teaches where a user is notified that an utterance is “potentially confusing” and provides a substitute, and receiving an input from a user that either refuses or accepts the substitute.  This reference, however, does not describe the manner in which ambiguity and acoustic similarity is determined.

Upon further search (in response to the amendment filed 9/27/2021):
6839670 teaches where a user/speaker can set up or edit a personal vocabulary in the form of name lists, function lists, etc., and suggests a user “enrolling” a spoken name as a means for dialing a particular phone number (col. 5, lines 11-32) and where there are a plurality of user-specific name lists which are set up, including a list for storing telephone numbers under predetermined name/abbreviations, and a list for storing function names for commands or command sequences” (col. 20, lines 1-10) and where each user can set upon his/her own name lists or abbreviation lists (col. 17, lines 32-34).  This reference appears to suggest where a user can have a personal list of function names (for executing commands) and a personal list of names, but this reference does not appear to specifically describe where spoken name is compared to multiple lists to determine confusability.
7110948 teaches “A speech recognition system in a mobile telephone, the speech recognition system comprising: means for storing a word vocabulary in trellis tree structure, wherein words in the vocabulary are arranged in a plurality of different groups of words, word group selection means for enabling a user to speak via voice commands into the mobile telephone to select a first of said plurality of different groups of words, said first group of words being selected based upon at least a word spoken by the user, and speech recognition means for comparing input speech from a user to words in said selected first group of words, so that comparing of the input speech is performed relative to said selected first group of words prior to comparing the input speech with other of the plurality of different groups of words so that a limited number of groups of the entire vocabulary is searched via said comparing during speech recognition processes”.  This reference appears to describe comparing speech to multiple groups of words, but this comparison is for speech recognition (not to determine whether words are confusable).
6584439 teaches “The help function is context sensitive--whenever Help is requested, the voice controlled device responds with a description of the available options, given the current context of the voice controlled device. If Help is requested when the voice controlled device is listening for a command, the voice controlled device will respond with its state and the list the commands that it can respond to (e.g. "At Main menu. You can say . . . ") Further detail on any specific command can be obtained with the "Help <command>" syntax (e.g. "Help Dial", "Help Call", and even "Help Help"). If "Help" is requested while the voice controlled device is waiting for some type of non-command response (e.g. "Say the name"), then the voice controlled device will respond with a statement of the voice controlled device's current status, followed by a description of what it is waiting for (e.g. "Waiting for user response. Say the name of the person whose phonebook entry you wish to create, or say Nevermind to cancel.").”.  This reference describes where syntax for a command is a function followed by one of a plurality of command words that can be referenced by the function command.
5754977 teaches determining a list of words which are closest to a “current output” (Figure 3) which is the same word as a word that a user desires to add to the list of words available for subsequent recognition (col. 3, line 54 – col. 4, line 18) and at least suggests adding a word to a data base list if the word that the user desires to add is sufficiently “distant” from the words that are already in the list and where the system prompts for possible new input if the word that the user desires to add is insufficiently “distant” (Figure 4, col. 4, lines 19-59).  This reference also describes where words can be data representing pictures, phrases, graphics, charts, voice prints, etc. (col. 3, lines 48-53) which suggests where a newly added word is compared to different types of words.
2007/0005372 teaches “In a very large vocabulary, such as for example a list of names of all cities in Germany, there is the problem that the addition of other words, for which recognition is activated parallel to this list, leads to a higher probability of a mix-up. This means, that supplemental commands, which are active in parallel, are often confused with city names. The recognition of larger vocabularies is particularly difficult with large dynamic loaded lists; these lists could be either static lists such as city names or also dynamic lists such as text or voice enrollments. It is here difficult to define in advance what size of resources the speech recognition system must have allocated to it in order to be able to evaluate sufficient numbers of alternatives in the case of similar words” (paragraph 5).  This reference describes where commands can be confused with city names, but does not appear to describe where the system determines that a candidate new command is determined to be confusable with a city name.
2012/0192096 teaches “The active command line driven user interface provides forgiveness in that each command name has a number of aliases which may be alternate names or phrases. A command alias may be shared between two or more command names. If multiple commands share the same alias, the multiple commands will all appear in the command list 512. This obviates the need for users to know the command name for a desired action, or to navigate through a list of available commands in a user interface menu. The user need only type what they want naturally. If the input provided by the user matches any one of the command aliases, that command is displayed in the command list 512. For example, as shown in FIG. 5, if the user types the command alias "Dial" in the command line 504 the proper command name "Call" is displayed in the list in the command list 512. As shown in FIG. 6, if the user types the command alias "Show", the proper command name "Browse" and "Map" commands are displayed in the list in the command list 512. Displaying the proper command name in response to entry of a command alias may help users learn the proper command names overtime as a result of being repeatedly presented with the proper command names in response to entry of a command alias. This may reduce the amount of alias handling over time, thereby reducing processing demands on the portable electronic device” (paragraph 113).  This reference describes where a word command can have an alias (a different word that is associated with that command).
2006/0064177 teaches “Sample Phonebook Situation: A particular phonebook can include the names "Bill Clinton," "George Bush," "Tony Blair" and "Jukka Hakkinen." In the event that the user wishes to add the new name "John Smith," it may not be confused with any of the existing words due to the very low degree of similarity with the existing names. If, on the other hand, the user wants to add new name "Juha Hakkinen," then the present invention may report a possible confusion between "Juha Hakkinen" and "Jukka Hakkinen." If the user were to alter the new name, this could greatly reduce the likelihood of potential confusion. For example, the name dialing performance of the phonebook application could be greatly improved if the user altered the new name to "Juha Hakkinen Runner." Otherwise the system could undergo many errors because of the high similarity between "Jukka Hakkinen" and "Juha Hakkinen." (paragraph 75).  This reference describes confusion between names and receiving a new name which is similar to a confusable name.
8380758 teaches “In some environments, commands may be mapped to longer, more descriptive names to reduce the likelihood of confusion with other commands. For example, the UNIX command "ls-la", which instructs a command terminal to list the full details of all files (including hidden files) in a directory, could hypothetically be represented by the command string "list_directory_contents-all_files" instead. However, not only would such longer commands be more tedious and time-consuming for users to type, but their length might result in frequent mistakes”.  This reference describes associating a sequence of words with a UNIX command (a sequence of characters which can be interpreted as a “text-based command”).
5987411 teaches “FIG. 2 is a flow diagram showing an enrollment method implemented by VAD system 100 for testing the confusability and inconsistency of a candidate phrase. FIG. 2 shows that VAD system 100 first determines whether dictionary 150 is full (step 210). If dictionary 150 is full, interface unit 110 directs the user to delete an old phrase before adding a new one to dictionary 150, and then returns the user to VAD system 100's main menu (step 215). If a candidate phrase can be entered into dictionary 150, VAD system 100 will then determine whether dictionary 150 is empty (step 220). Since confusability is not an issue in this case, controller 140 will train an entry of dictionary 150 using the candidate phrase when inconsistency determining unit 130 determines that the first two utterances of the phrase said by the user are not inconsistent (steps 225 and 230). Inconsistency determination unit 130 enables VAD system 100 to determine whether the utterances are sufficiently consistent with one another in order to allow them to be used to train an entry of dictionary 150. If inconsistency determination unit 130 determines that the utterances are inconsistent, controller 140 will reject the candidate phrase for entry into dictionary 150 (step 235). As noted above, VAD system 100 does not test phrases for confusability in this case because there are no existing dictionary entries with which the phrase may be confused” and “In a VAD system, the voice command is typically the name of the party the user wishes to call, such as the phrase "Bob Johnson."”.    This reference describes testing for confusability and inconsistency (see also Figure 2).

Upon further search (in response to the amendment filed 3/15/2022):
20180166069 teaches an input module receiving a speech signal of a new word and phonetic symbols corresponding to the speech signal (i.e. a phonetic symbol “text”, see Figure 4A and paragraph 48).  The speech signal in this reference is not clearly a TTS rendering of the phonetic symbols, and is not clearly used to determine that the speech recognition system is likely to misidentify utterances of the candidate text as corresponding to a text based command other than the candidate text based command.
2020/0364067 (LATE filing date) teaches “The output module 111 further outputs the responsive content. In some implementations, the responsive content is assistant content, and the output module 111 causes the assistant content to be rendered via the assistant client application. The assistant content can include text, images, and/or other content to be visually rendered via a display of the feature phone 101, audio data (e.g., a text-to-speech version of text) to be rendered via speaker(s) of the feature phone, and/or other content to be rendered via a display, speaker(s), and/or other user interface output device(s)” (paragraph 70).  This reference does not qualify as prior art and appears to be directed to output of responsive content, and not to receiving candidate data for adding new commands/phrases to a speech recognition system.
2019/0129769 teaches “Alternately, the end user 230 may interact with the service 200 using a command line interface (CLI) 216. Like the GUI 214, the CLI 216 provides the end user with the ability to manipulate various portions of the service 200. However, per its name the CLI is used by manually typing a series of commands into a terminal or console, typically on a client device. For example, entering the command “projects create—name mRNA—description ‘mRNA Sequencing Analysis Project’” would create a workspace in the service 200 with that description and name. Similarly, the commands “files list” or “apps list” would present the end user 230 with a list of files and apps available to that user, respectively. One advantage of the CLI 216 is that a series of commands can be pre-written and entered in sequence, allowing for some automation of service 200 functions. Like the GUI 214, the CLI 216 is typically accessed by the end user 230 via a client device 220, such as the end user's own workstation or laptop, which may include certain libraries or applications developed for accessing the CLI 216 on the service 200. End users who prefer programmatic access to service 200 functions may prefer the CLI 216 over the GUI 214” (paragraph 38).  This reference describes a list of files/apps that are available to a user, but does not appear to describe where the list of files/apps are files/apps that can be referenced in text based commands, and does not appear to describe where speech renderings of the files/apps in the list of files/apps are compared to a word/phrase of a candidate new command.
2002/0046033 teaches “The memory 8 of the client computer 20 preliminarily stores standard recognition word sets to be used for voice recognition and interactive operational patterns correlated with the standard recognition word sets to be executed for required equipments. The user is able to execute registration by replacing (updating) the pre-registered standard word sets with new commonly used word set, or to newly add another new word set and its associated voice recognition operational pattern for a newly required function. Of course, particular recognition word sets for use in voice recognition operation and correlated particular operational pattern for a particular function may be originally and arbitrarily registered by the user himself.” (paragraph 32).  Operational patterns do not appear to refer to a TTS-synthesized version of new word(s), and appear to refer to functions performed in response to word(s).

Upon further search (in response to the amendment filed 7/1/2022):
4972485 teaches “informing the speaker that the word he/she is trying to train is too similar to previously trained words and that a new word should be selected” (col. 2, lines 54-65)

Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ERIC YEN whose telephone number is (571)272-4249. The examiner can normally be reached M-F 12:00PM -8:30PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, RICHEMOND DORVIL can be reached on (571)272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





EY 7/8/2022
/ERIC YEN/Primary Examiner, Art Unit 2658