DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Objections
Claims 8 is objected to because of the following informalities:  
Line 3 of claim 8 includes what appears to be a large amount of space between each word (likely due to word processor formatting).  No deletion markup is required in a future amendment but it would probably be useful for those spaces to be the size of a single space in future submitted claims.
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-13 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

As per Claim 1:

“the frequency weight” in “generate a speech recognition result corresponding to the speech data using the frequency weight” is ambiguous.  “determine a frequency weight for each word” determines one frequency weight per word and so there are multiple frequency weights and it is not clear which frequency weight is the one that “the frequency weight” in “generate a speech recognition result corresponding to the speech data using the frequency weight” is supposed to refer to.

Claims 12-13 include the same issues as claim 1.

As per Claim 2, “the converted text” lacks antecedent basis.  Applicant fairly clearly meant to refer to the text produced by “convert the speech data into text”, but as claimed there is no conversion of the text (the speech data is converted, not the text, such that there is no “converted text”)

As per Claim 3, “each section in the speech data” (line 4) lacks antecedent basis and is ambiguous.  No part of claims 1-2 describe where there are any sections in the speech data (even though speech data typically includes enough data to be partitioned into sections and is typically partitioned into frames).  Also, as claimed, it is not clear what qualifies as a section (a frame of data? a portion of the speech data containing an entire word?).
singular probability that is “word-by-word” (where word-by-word seems to suggest multiple probabilities).  See also paragraphs 235-236 and 258 of the Specification which describes respective probabilities for different words.  Applicant appears to be trying to claim where one probability is calculated for each of a plurality of words, but as claimed “calculate a word-by-word probability” refers to calculating a singular probability that is “word-by-word”
“the word-by-word probability” in line 5 of claim 3 is unclear (same reasons as discussed in the previous paragraph), and assuming “calculate a word-by-word-probability corresponding to each section in the speech data” refers to calculating one probability for each of a plurality of sections/words, “the word-by-word probability” is ambiguous (because each section has a word-by-word probability and it is not clear which section’s word-by-word probability is the one that “the word-by-word probability” in line 5 of claim 3 is supposed to refer to).
“the calculated word-by-word probability” in line 6 of claim 3 is also unclear and ambiguous (same issues as discussed in the previous paragraph). 
“the frequency weight” in line 6 of claim 3 is ambiguous (there is one weight per word in claim 1 and it is not clear which word’s frequency weight is the one that “the frequency weight” in line 6 of claim 3 is supposed to refer to)
“a highest calibrated word-by-word probability for each section” is unclear (same issue as discussed above pertaining to “a/the word-by-word probability”).
for each section” necessarily refers to where there are multiple words selected for every section, and not where one or more words are selected for each section.  If Applicant intended to claim where, for example, one or more words are selected per section, thereby selecting multiple words overall, this is not consistent with the existing language.

	As per Claim 4, “the calibrated word-by-word probability” includes the same issues as other recitations of “word-by-word probability” in claim 3.  
Also, if Applicant amends claim 3 to recite where one calibrated probability is calculated per word, “the calibrated… probability” would be ambiguous (currently there is no ambiguity because there is only one probability which is calibrated in claim 3)
“a highest normalized calibrated word-by-word probability for each section” is unclear (same issue as discussed above pertaining to “a/the word-by-word probability”).
Also, while not linguistically improper (except for the “word-by-word probability” phrase), “selecting words having a highest normalized calibrated word-by-word probability for each section” necessarily refers to where there are multiple words selected for every section, and not where one or more words are selected for each section.  If Applicant intended to claim where, for example, one or more words are selected per section, thereby selecting multiple words overall, this is not consistent with the existing language.

	As per Claim 5: 

“the frequency weight” in line 5 is ambiguous (multiple frequency weights determined in claim 1)
“check a usage frequency of each word” is ambiguous because it can refer to checking one frequency that corresponds to every word, or to one frequency per word.  If this phrase refers to checking one frequency for every word, this seems unusual because different words presumably have different frequencies, and if this phrase refers to one frequency per word, then “the usage frequency” in line 5 of claim 5 is ambiguous and “the usage frequency of each word” in lines 5-6 of claim 5 lacks antecedent basis.
Assuming “the usage frequency of each word” refers to one usage frequency per word, “the usage frequency” in line 5 is ambiguous (line 3 check[s] a usage frequency of each word [i.e., one usage frequency per word is checked]).
“the usage frequency of each word” lacks antecedent basis if “check a usage frequency of each word” refers to one usage frequency per word (as opposed to one frequency that corresponds to every word)

	As per Claim 6:
“each word” in line 3 lacks antecedent basis.
“each word” in line 5 lacks antecedent basis.
	“the word-by-word probability” lacks antecedent basis (claim 6 depends on claim 5, and not on claims 2, 3, or 4) and “the word-by-word probability” is also unclear (same issues as those discussed pertaining to “a/the word-by-word probability” in the rejection of claim 3, above).
per word, and not one weight that corresponds to every word)
	“the same type” in the 2nd to last line is unclear.  More specifically, it is not clear what similarities are required between two devices to qualify as a “same type”.  For example, PDAs and cell phones are both mobile handheld devices, but may not be “the same type” in because PDAs don’t always have phone call functions.  As another example, all of PDAs, smart phones, laptops, and desktops typically contain CPUs and memories, but also have various differences such that they could not be considered “the same type”.  Therefore, it is not clear what similarities would make two devices “the same type” and what differences would make two devices different types, such that it is not clear what the scope of “the same type” is supposed to be.

	As per Claim 7:
	“the private frequency weight” and “the public frequency weight” in lines 4-5 are ambiguous (assuming “determine a private frequency weight of each word” and “determine a public frequency weight of each word” in claim 6 each refer to determining one private/public frequency weight per word, and not one weight that corresponds to every word)
	“the word-by-word probability” lacks antecedent basis (claim 7 depends on claim 6 which depends on claim 5, and not on claims 2, 3, or 4, and since “the word-by-word 

	As per Claim 8, “the frequency weight” is ambiguous (multiple frequency weights determined in claim 1).

	As per Claim 9, “a main user or a device operation status” in the 4th to last line could be interpreted as either:
	1. a main user (i.e. an identity of a main user), or a status of device operation
Or
2. a status of a main user, or a status of device operation
It is not clear (as claimed) which interpretation Applicant intended to claim (probably interpretation 2)

As per Claim 11, “the frequency weight” (recited twice once in each of the last two lines of claim 11) are ambiguous (multiple frequency weights determined in claim 1).

The dependent claims include the issues of their respective parent claims.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claim 13 is rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.  The claim(s) does/do not fall within at least one of the four categories of patent eligible subject matter.

Claim 13 is directed to a recording medium which includes non-statutory transitory embodiments within its scope.  Paragraph 262 of the Specification only provides examples of a “processor-readable medium” which does not limit the scope of “recording medium” in claim 13 to only non-transitory embodiments.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 5, 8-10, 12-13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Sejnoha et al. (US 2012/0059658), hereafter Sejnoha, in view of Lloyd et al. (US 8,521,526), hereafter Lloyd.


As per Claims 1 and 12-13, Sejnoha suggests (along with its method and recording medium equivalents) An artificial intelligence apparatus for recognizing speech of a user, comprising: a microphone; and a processor configured to: obtain, via the microphone, speech data including speech of a user,… generate a speech recognition result corresponding to the speech data…, and perform control corresponding to the speech recognition result (paragraphs 39-40, 42, 46-50, 55, 58-60, 62-63, 129-132; Figure 6; [all paragraphs and Figures are cited for each limitation with “key” paragraphs and Figures pertaining to each limitation identified below, i.e. all other paragraphs and Figures not specifically referenced for any particular limitation are eligible to provide context and additional support]
“An artificial intelligence apparatus for recognizing speech of a user, comprising: a microphone; and a processor configured”: paragraphs 39, 46-48, 58, 129-132; a client device [“apparatus”] has an integrated microphone [paragraph 46] where the client device executes an automated speech recognizer that performs automated speech recognition on spoken query audio data received from the user via the integrated microphone [paragraphs 46-48, where user-supplied audio data which is recognized by the speech recognizer in paragraph 48 is at least suggested to be the spoken search query spoken into an integrated microphone in paragraphs 46-47, see also paragraph 58 which describes where the user supplies a search query by voice in the form of audio data].  Paragraphs 129-132 further describes where the client device can be a device which implements the functions described in Sejnoha by executing instructions via a processor included in the client device [i.e. where the “apparatus” comprises “a 
“to: obtain, via the microphone, speech data including speech of a user,…”: paragraphs 46-48; a user speaks a search query into the microphone integrated into the client device, thereby causing the client device to obtain user-supplied audio data that is at least suggested to be a data representation of the spoken search query [i.e. “speech data including speech of a user”].
“generate a speech recognition result corresponding to the speech data…,”: paragraphs 39-40, 42, 46-48, 58-60, 62-63; a highest-scoring/confidence recognition result is obtained/”generated” and corresponds to “the speech data” [the user-supplied audio data at least suggested to be a data representation of the spoken search query] because it is derived from performing speech recognition on “the speech data”.  Paragraph 63 particularly describes an embodiment where multiple recognition results are obtained, and a highest scoring/confidence recognition result is used as a basis for a search query for a search engine.
“and perform control corresponding to the speech recognition result”: paragraphs 39-40, 42, 46-48, 58-60, 62-63; the client device’s processor “controls” the client device to display search results received from search engines in response to issuing a query to the search engines [paragraph 39] where the “control” “correspond[s] to the speech recognition result” in the sense that displaying of the search results displays search 
Also of relevance is that Sejnoha describes [in paragraphs 49-50] where an alternative to client-side speech recognition is speech recognition on a server [suggesting that server-side speech recognition can also be performed client-side])
Sejnoha does not, but Lloyd suggests An artificial intelligence apparatus for recognizing speech of a user, comprising: a microphone; and a processor configured to: obtain, via the microphone, speech data including speech of a user, determine a frequency weight for each word using a speech recognition log, generate a speech recognition result corresponding to the speech data using the frequency weight, and perform control corresponding to the speech recognition result (Figures 1-2, 5; col. 2, lines 1-25; col. 4, line 33 – col. 5, line 46; col. 5, line 47 – col. 6, line 4; col. 6, lines 35-53; col. 7, line 50 – col. 9, line 29; col. 9, lines 47-61; col. 15, lines 22-54; col. 16, lines 34-42; col. 17, lines 3-37;
Lloyd describes performing speech recognition [server-side] on a user-spoken “query term”, determining a plurality of candidate transcriptions, obtaining search results based on a query based on the highest-scoring candidate transcription, and displays the search results [similar to Sejnoha].  Lloyd further describes where scoring of the candidate transcriptions includes determining frequencies [of occurring in past queries] and corresponding weights for each of the plurality of individual words in the candidate transcriptions [see Figure 2 for the simplest depiction, and see cited passages for full detail, where, even though the n-grams of element 210 includes one entry for the 
Lloyd thus suggests “An artificial intelligence apparatus for recognizing speech of a user, comprising: a microphone; and a processor configured to: obtain, via the microphone, speech data including speech of a user, determine a frequency weight for each word using a speech recognition log, generate a speech recognition result corresponding to the speech data using the frequency weight, and perform control corresponding to the speech recognition result”: where the speech recognition in Sejnoha’s client device, generates the highest-scoring/confidence recognition result [i.e. “the speech recognition result”] by determining frequencies of occurrence in past queries for each of the words in the plurality of recognition results/candidate transcriptions, determining corresponding weight for each of the words [i.e. “determining a frequency weight for each word”], and multiplying each recognition result’s score by the weights of the recognition result’s respective words, thereby generating weighted-score recognition results, the highest of which is determined to be the highest-scoring/confidence recognition result [i.e. “the speech recognition result”].  Col. 4, line 66 – col. 5, line 5 and col. 8, lines 1-7 describe where frequencies are “stored” in a table association with their n-grams, which at least suggests where there is a data “log” that “logs” the n-grams and their frequencies, where the “log” can be interpreted as a “speech recognition log” in the sense that it is a “log” that is used to perform “speech 
Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing to perform a simple substitution of one type of speech recognition with another because the prior art teaches the claimed invention except for the substitution of speech recognition which does not determine a highest scoring recognition result based on weights that are determined based on counted frequencies of each of a plurality of words occurring in past search queries of a search history (where weights for higher frequencies have higher values than weights for lower frequencies) with speech recognition which does.  Lloyd teaches that speech recognition which determines a highest scoring recognition result based on weights that are determined based on counted frequencies of each of a plurality of words occurring in past search queries of a search history (where weights for higher frequencies have higher values than weights for lower frequencies) was known in the art.  One of ordinary skill in the art could have substituted one type of speech recognition with another to obtain the predictable results of a client device which receives a spoken search query from a user via a microphone of the client device, performs speech recognition on the spoken search query to generate a plurality of recognition results that each have a 
	
	As per Claim 2, Sejnoha does not, but Lloyd suggests wherein the processor is configured to: convert the speech data into text, and generate the speech recognition result based on the 20converted text (Figures 1-2, 5; col. 2, lines 1-25; col. 4, line 33 – col. 5, line 46; col. 5, line 47 – col. 6, line 4; col. 6, lines 35-53; col. 7, line 50 – col. 9, line 29; col. 9, lines 47-61; col. 15, lines 22-54; col. 16, lines 34-42; col. 17, lines 3-37;
	The combination [thus far] is as discussed above in the rejection of claim 1, where Lloyd further describes where the candidate transcriptions whose frequencies are determined are the output of performing speech recognition on the user’s spoken input [see Figures 1-2 and col. 6, lines 35-53].  Figure 2 also depicts where the candidates are a sequence of natural language words, which at least suggests where the candidates are “text” [see also col. 2, lines 1-25 which describes where speech recognition is performed to select two or more “textual, candidate transcriptions that 
	Lloyd thus suggests “wherein the processor is configured to: convert the speech data into text, and generate the speech recognition result based on the 20converted text”: the speech recognition performed by the processor of Sejnoha’s client device produces, from the user-supplied audio data that is at least suggested to be a data representation of the spoken search query [“the speech data”], a textual recognition result/candidate transcription [i.e. “text”] which is eventually the highest-scoring/confidence recognition result [“the speech recognition result”].  The speech recognition process that produces a candidate transcription from input speech can be interpreted as a process that “converts” speech into text [because it receives speech as an input and produces a textual recognition result/candidate transcription as an output].  The “speech recognition result” is generated “based on the converted text” because the textual recognition result/candidate transcription [“the converted text”] which is eventually the highest-scoring/confidence recognition result becomes “the speech recognition result” [after frequency weighting is applied to the candidate transcription scores].)
Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing to perform a simple substitution of one type of speech recognition with another because the prior art teaches the claimed invention except for the substitution of speech recognition which does not determine a highest scoring recognition result based on weights that are determined based on counted frequencies of each of a plurality of words occurring in past search queries of a search history (where weights for higher frequencies have higher values than weights for lower 
Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing to perform a simple substitution of one type of speech recognition result with another because the prior art teaches the claimed invention except for the substitution of speech recognition results which are not necessarily text with speech recognition results which are.  Lloyd teaches that speech recognition results which are text were known in the art.  One of ordinary skill in the art could have substituted one 

	As per Claim 5, Sejnoha does not, but Lloyd suggests wherein the processor is configured to: check a usage frequency of each word using the speech recognition log, and 5set the frequency weight higher as the usage frequency of each word increases (Figures 1-2, 5; col. 2, lines 1-25; col. 4, line 33 – col. 5, line 46; col. 5, line 47 – col. 6, line 4; col. 6, lines 35-53; col. 7, line 50 – col. 9, line 29; col. 9, lines 47-61; col. 15, lines 22-54; col. 16, lines 34-42; col. 17, lines 3-37;
	Same combination applied to reject claim 1, where, as discussed in the rejection of claim 1 [see particularly the portion based on Lloyd], the search history can also be interpreted as a “speech recognition log” because it is a history/”log” of past search queries which is used for “speech recognition”.

	Lloyd thus suggests “wherein the processor is configured to: check a usage frequency of each word using the speech recognition log, and 5set the frequency weight higher as the usage frequency of each word increases”: where the speech recognition processing performed by the processor of Sejnoha’s client device includes counting/”checking”, for “each” of a plurality of n-gram individual “words”, a respective “frequency” of the n-gram occurring/being-“used” in past search queries in the search history [“speech recognition log”], and setting weights for higher occurrence/”usage” frequencies to have higher values than weights for lower occurrence/”usage” frequencies.)
Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing to perform a simple substitution of one type of speech recognition with another because the prior art teaches the claimed invention except for the substitution of speech recognition which does not determine a highest scoring recognition result based on weights that are determined based on counted frequencies of each of a plurality of words occurring in past search queries of a search history (where weights for higher frequencies have higher values than weights for lower frequencies) with speech recognition which does.  Lloyd teaches that speech recognition which determines a highest scoring recognition result based on weights that 

As per Claim 8, Sejnoha does not, but Lloyd suggests wherein the processor is configured to: determine utterance situation information corresponding to an utterance situation, and calibrate the frequency weight using the utterance 15situation information (Figures 1-2, 5; col. 2, lines 1-25; col. 4, line 33 – col. 5, line 46; col. 5, line 47 – col. 6, line 4; col. 6, lines 35-53; col. 7, line 50 – col. 9, line 29; col. 9, lines 47-61; col. 15, lines 22-54; col. 16, lines 34-42; col. 17, lines 3-37;

	The candidate transcriptions/recognition results [see Figure 2 of Lloyd] can be interpreted as “utterance situation information” in the sense that they are “information” about words that may have been spoken by a user in a “situation” where the user spoke an “utterance”.  Since the individual-words/n-grams in the candidate transcriptions/recognition results are used to determine the frequencies of each n-gram which are used to determine the weight values for each n-gram [see Figure 2 of Lloyd], the candidate transcriptions/recognition results can be interpreted as “utterance situation information” that is “used” to set/”calibrate” the “frequency weights” to be particular values corresponding to the frequencies [at least in the sense that, unless an n-gram exists in the candidate transcriptions, its weight value is not set/”calibrated”]
	Lloyd thus suggests “wherein the processor is configured to: determine utterance situation information corresponding to an utterance situation, and calibrate the frequency weight using the utterance 15situation information”: where the speech recognition processing performed by the processor of Sejnoha’s client device includes determining candidate transcriptions/recognition results [which can be interpreted as “utterance situation information”, as discussed in the previous paragraph, where the “utterance situation information” is “information” about/”corresponding to” a “situation” where the user spoke an “utterance”] and setting/”calibrating” the “frequency weights” for each individual-word/n-gram “using the utterance situation information” [at least in the sense that, unless an n-gram exists in the candidate transcriptions, its frequency and corresponding weight value are not set/”calibrated”]

Lloyd thus also suggests “wherein the processor is configured to: determine utterance situation information corresponding to an utterance situation, and calibrate the frequency weight using the utterance 15situation information” in the following alternative/additional: where the speech recognition processing performed by the processor of Sejnoha’s client device includes determining a set of individual-words/n-grams in the candidate transcriptions/recognition results [where the set of individual-words/n-grams can be interpreted as “utterance situation information”, as discussed in the previous paragraph, where the “utterance situation information” is “information” about/”corresponding to” a “situation” where the user spoke an “utterance”] and setting/”calibrating” the “frequency weights” for each individual-word/n-gram “using the utterance situation information” [the n-grams/individual-words are used to determine their respective frequencies which are used to set/”calibrate” their respective weights to be respective values]
Additionally/alternatively, one of the n-grams/individual-words in the candidate transcriptions can, by itself, be interpreted as the “utterance situation information” of claim 8 [as an individual word that may have been spoken by a user during a “situation” where the user spoke an “utterance”] and the one of the n-grams/individual-words in the 
Lloyd thus also suggests “wherein the processor is configured to: determine utterance situation information corresponding to an utterance situation, and calibrate the frequency weight using the utterance 15situation information” in the following alternative/additional: where the speech recognition processing performed by the processor of Sejnoha’s client device includes determining an n-gram/individual-word in the candidate transcriptions/recognition results [where the n-gram/individual-word can be interpreted as “utterance situation information”, as discussed in the previous paragraph, where the “utterance situation information” is “information” about/”corresponding to” a “situation” where the user spoke an “utterance”] and setting/”calibrating” the “frequency weight” for the individual-word/n-gram “using the utterance situation information” [the n-gram/individual-word is used to determine its respective frequency which is used to set/”calibrate” its respective weight to be a respective value])
Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing to perform a simple substitution of one type of speech recognition with another because the prior art teaches the claimed invention except for the substitution of speech recognition which does not determine a highest scoring recognition result based on weights that are determined based on counted frequencies of each of a plurality of words occurring in past search queries of a search history (where weights for higher frequencies have higher values than weights for lower 

As per Claim 8, Sejnoha does not, but Lloyd suggests wherein the processor is configured to: determine utterance situation information corresponding to an utterance situation, and calibrate the frequency weight using the utterance 15situation information (Figures 1-2, 5; col. 2, lines 1-25; col. 4, line 33 – col. 5, line 46; 
	The combination [thus far] is as discussed in the rejection of claim 1.
Lloyd further describes where frequency may be adjusted if an n-gram occurs in a query term that was included in a search query that is associated with a context which is similar/identical/dissimilar to the current context of the mobile device [adjusted down if dissimilar context, adjusted up if similar context, see col. 8, lines 22-34] and also where context data which affects frequency values that are used to determine weighting includes, among other things, device type and a time of day when a spoken query term [input spoken search query, see Figure 5, element 212] was spoken [see Figure 5 and col. 15, lines 35-54, col. 16, lines 34-42, col. 17, lines 3-37].  Col. 15, lines 35-54 also describes a particular embodiment of context data which is “the user’s location” [which suggests an embodiment where similarity/dissimilarity between the user’s location corresponding to the input spoken search query and a past search query leads to frequency being adjusted up/down]
Lloyd thus suggests “wherein the processor is configured to: determine utterance situation information corresponding to an utterance situation, and calibrate the frequency weight using the utterance 15situation information”: where the speech recognition processing performed by the processor of Sejnoha’s client device includes determining context information for the user’s spoken search query, where the context information includes user location at which the user entered/spoke the spoken search query, where the user location context information can be interpreted as “utterance situation information corresponding to an utterance situation” [in the sense that the user 
As a supplemental note [for future reference], it would not be clearly obvious to apply the “type of device” [mobile/non-mobile] in Lloyd to the combination of Sejnoha and Lloyd because, in the combination applied to reject claim 1, the speech recognition is performed client-side and the search history is also stored client-side [and so all the queries would presumably be invariably entered via a mobile device such that device type would presumably not be a useful distinguishing factor])
Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing to perform a simple substitution of one type of speech recognition with another because the prior art teaches the claimed invention except for the substitution of speech recognition which does not determine a highest scoring recognition result based on weights that are determined based on counted frequencies of each of a plurality of words occurring in past search queries of a search history (where the counted frequencies of at least some of the plurality of words are adjusted 

As per Claim 9, Sejnoha does not, but Lloyd suggests wherein the utterance situation information includes at least one of a type of the artificial intelligence apparatus, an 20installation location of the artificial intelligence apparatus, a 76Attorney Docket No.: 20519-0570001Client Ref.: 19SWP283US02/PO19-00514US location of the user, a main user or a device operation status, and wherein the device operation status includes operation status information of the artificial intelligence apparatus (Figures 1-2, 5; col. 2, lines 1-25; col. 4, line 33 – col. 5, line 46; col. 5, line 47 – col. 6, line 4; col. 6, lines 35-53; col. 7, line 50 – col. 9, line 29; col. 9, lines 47-61; col. 15, lines 22-54; col. 16, lines 34-42; col. 17, lines 3-37;
	Same combination as discussed in the rejection of claim 8.
	The user location context information [“utterance situation information”] can be interpreted as “a location of the user” [such that “the utterance situation information includes at least one of…a76Attorney Docket No.: 20519-0570001Client Ref.: 19SWP283US02/PO19-00514US location of the user”]
The “wherein the device operation status includes operation status information of the artificial intelligence apparatus” limitation does not need to be addressed because it further narrows one of the alternatives that the utterance situation information can [but is not required to] include, and the combination addresses one of the other claimed alternatives [i.e. if a claim requires at least one of A, B, and C, and the prior art addresses A and the claim further narrows B, then the prior art still meets the claim language because the limitation further narrowing B, when added to “at least one of A, B, and C”, leads to “at least one of A, narrower B, and C” which still only requires one of 
Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing to perform a simple substitution of one type of speech recognition with another because the prior art teaches the claimed invention except for the substitution of speech recognition which does not determine a highest scoring recognition result based on weights that are determined based on counted frequencies of each of a plurality of words occurring in past search queries of a search history (where the counted frequencies of at least some of the plurality of words are adjusted based on user location context information corresponding to the user’s spoken query and where weights for higher frequencies have higher values than weights for lower frequencies) with speech recognition which does.  Lloyd teaches that speech recognition which determines a highest scoring recognition result based on weights that are determined based on counted frequencies of each of a plurality of words occurring in past search queries of a search history (where the counted frequencies of at least some of the plurality of words are adjusted based on user location context information corresponding to the user’s spoken query and where weights for higher frequencies have higher values than weights for lower frequencies) was known in the art.  One of ordinary skill in the art could have substituted one type of speech recognition with another to obtain the predictable results of a client device which receives a spoken search query from a user via a microphone of the client device, performs speech recognition on the spoken search query to generate a plurality of recognition results that each have a corresponding score, uses the highest scoring recognition result to query a search engine, and displays search results for the spoken search query (as per Sejnoha) where the scores for the recognition results are calculated based on weights that are determined based on frequencies that are each a frequency that a respective one of a plurality of words in the recognition results occurs in past search queries, where the frequencies are stored, in a table, in association with their respective word, and where weights for higher frequencies have higher values than weights for lower frequencies, where context information indicating a user location where the spoken search query was/is spoken is determined and is used to adjust the frequency of at least one of the plurality of words, and where the weight(s) for the at least one of the plurality of words are determined based on the adjusted frequency (as per Lloyd)

	As per Claim 10, Sejnoha suggests further comprising a communication circuit configured to communicate with at least one external device, wherein the device operation status further includes operation status information of the at least one external device (paragraphs 39-40, 42, 46-50, 55, 58-60, 62-63, 129-132; Figure 6;
	The combination [thus far] is as discussed in the rejection of claim 9, where, as discussed in the rejection of claim 1, the client device of Sejnoha is interpreted as “The artificial intelligence apparatus”.  
	Paragraph 39 describes various embodiments of client devices like a mobile phone or smartphone which commonly/typically, if not inherently, includes “a communication circuit configured to communicate with at least one external device” [i.e. hardware/circuitry that transmits/receives telephone call data].  

The “wherein the device operation status further includes operation status information of the at least one external device” limitation does not need to be addressed because it further narrows one of the alternatives that the utterance situation information can [but is not required to] include, and the combination addresses one of the other claimed alternatives [i.e. if a claim requires at least one of A, B, and C, and the prior art addresses A and the claim further narrows B, then the prior art still meets the claim language because the limitation further narrowing B, when added to “at least one of A, B, and C”, leads to “at least one of A, narrower B, and C” which still only requires one of A, narrower B, and C, and can therefore still be addressed by the prior art that addresses A]).
	
Allowable Subject Matter
Claims 3-4, 6-7, 11, would be allowable if rewritten to overcome the rejection(s) under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), 2nd paragraph, set forth in this Office action and to include all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:  

	As per Claim(s) 3 (and consequently claim 4 which depends on claim 3), the prior art of record does not teach or suggest the combination of all limitations in claim(s) 1, 2, and 3 together, including (i.e. in combination with the remaining limitations in claim[s] 1, 2, and 3) wherein the processor is configured to: calculate a word-by-word probability corresponding to 5each section in the speech data, calibrate the word-by-word probability by multiplying the calculated word-by-word probability by the frequency weight, and convert the speech data into the text by selecting words 10having a highest calibrated word-by-word probability for each section.
2016/0005402 teaches “Automatic speech recognizers typically generate a set of alternative hypotheses 134 (i.e., candidate words) for each recognized region in an audio stream. For example, when the automatic transcription system 128a attempts to recognize the spoken word "knot," the system 128a may generate a list of alternative hypotheses 134 consisting of the words "knot," "not," "naught," and "nit," in that order. The system 128a typically associates a confidence measure with each hypothesis representing a degree of confidence that the hypothesis accurately represents the 
2014/0343935 teaches “Outputting the final recognition results may include determining the final scores of the one or more word candidates for the time span based on the probability values and reliabilities of the one or more word candidates for the time span, and outputting a word candidate having a highest value for the time span as one of the final recognition results” (paragraph 24) and “The final recognition result output unit 30 outputs final recognition results based on the results of the speech recognition and verification unit 32. The final recognition result output unit 30 determines final scores based on the probability values and reliabilities of the one or more word candidates for each time span. Furthermore, the final recognition result output unit 30 may output a word candidate having the highest value for each time span as a final recognition result. That is, the final recognition result output unit 30 may search all the paths of a word lattice, may determine a path having the highest value, and may present the determined path as a final recognition result” (paragraph 41).  This reference appears to describe generating final recognition results, one for each of a plurality of time spans, based on a highest scoring word candidate for each time span.  This reference does not appear to describe where a single result is generated by combining the highest scoring candidates for each time span (but this may not be required for claim 3 of this application)
It would not necessarily be obvious to combine the n-gram occurrence frequency weighting of Lloyd with the scoring of each candidate for each time span/region in the previous two references (and vice versa) because Lloyd suggests weighting an entire transcription’s score based on word-specific weights, not where each word’s score is not where the weighting is word-by-word).
10387568 teaches “The routine 400A then proceeds to operation 414, where a keyword score KS for the given candidate keyword 132 is calculated. In one implementation, the keyword score KS is calculated as KS=.SIGMA..sub.i.di-elect cons.keywordWS.sub.i, where the summation runs through every word contained in the given candidate keyword 132. The keyword score 134 can also be calculated as a weighted sum of the word scores for the words contained in the candidate keyword 132 and the weight for each word score can be utilized to reflect the importance of the corresponding word. From operation 414, the routine 400A proceeds to operation 416, where it ends” and “A keyword can be a unigram, i.e. consisting of a single word, or a multi-gram, i.e. consisting of multiple words” This reference is directed to extracting keywords from documents, but does at least suggest where a score for a multi-word keyword is calculated based on a weighted sum of word scores for words in a keyword.  In this reference, weights are used to reflect importance of a corresponding word (not frequency).
2014/0278407 teaches “If the candidate transcription was found, the computing system 110 determines a probability score for the candidate transcription using the first component (308). If the candidate transcription was not found, the computing system determines a probability score for the candidate transcription using a second component of the language model (310). For example, the computing system 110 can "back off" to a generalized n-gram model. The computing system 110 normalizes the probability score from the second component (312), for example, by multiplying the 
2019/0258717 teaches “The curating of the knowledge continues at step 795 where the scores are interpreted to determine whether the initial interpretation is reliable and, when reliable, adds the initial interpretation to a knowledge database. For example, a weighting approach is utilized to aggregate scores to produce a confidence level. For instance, weighting factors are multiplied by each component of the scores (e.g., an alignment component, the source reliability component, an information age component) to produce intermediate scores for aggregation to produce the conference level. The confidence level indicates that the initial interpretation is reliable when the confidence level is greater than a confidence threshold” and “A third interpreting step includes determining the confidence level based on a weighting approach and scores for the confirming entigen groups and other scores for the disconfirming entigen groups. For example, the processing module multiplies each score by a weighting factors for confirming and disconfirming to produce an intermediate confidence level and aggregates the intermediate confidence levels to produce the confidence level”
10609037 teaches “the scores may be combined in a weighted combination, such as a weighted sum, by multiplying each score by a corresponding weight and summing the weighted values to determine an aggregate correspondence score. The weights may be determined with a variety of different techniques. In some cases, the weights may be hand tuned manually, for instance, by a system administrator or developer. In some cases, the weights may be changed dynamically based on feedback, such as feedback indicative of false positives or false negatives in the determination of block 34” and “In some cases, certain attributes are expected to be more indicative of attacks than others, as some attributes are expected to change relatively frequently and are less indicative of the second computing device being impersonated. Accordingly, some embodiments may determine a score for each pair of attributes compared between the known profile and the observe profile and that score may be weighted based on the likelihood of differences indicating an attack. In some cases, the score is a binary value of zero or one indicating whether to values match exactly, like a device maker name, or a binary value of zero or one indicating whether a given operating system version has decremented or incremented between capturing the known profile and the observed profile. In some cases, the score is a cardinal value indicating differences between known and observed attributes, like a number of increments between application versions, or an edit distance between strings”;
2020/0243092 (foreign priority date precedes effective date of this application) teaches “The converter 30H converts the speech data into a character string for every word string by adopting a word string having the highest appearance 
2008/0208582 teaches “in speech than simply the recognition of each word spoken. As a result, while conventional speech recognition techniques focus on identifying a set of alternatives for each spoken word, determining which alternative has the highest probability of actually being the word that was spoken and discarding the remainder of the alternatives, embodiments of the invention recognize that the discarded alternatives may have significant value in identifying”.  These references appear to teach (but also teach away from) word-by-word recognition.

As per Claim(s) 6 (and consequently claim 7 which depends on claim 6), the prior art of record does not teach or suggest the combination of all limitations in claim(s) 1, 5, and 6 together, including (i.e. in combination with the remaining limitations in claim[s] 1, 5, and 6) wherein the processor is configured to:  10determine a private frequency weight of each word using a private speech recognition log, determine a public frequency weight of each word using a public speech recognition log, and calibrate the word-by-word probability using the 15private frequency weight and the public frequency weight, wherein the private speech recognition log includes speech recognition logs in the artificial intelligence apparatus, and wherein the public speech recognition log includes speech recognition logs in an external apparatus of the same type as the 20artificial intelligence apparatus


combination of all limitations in claim(s) 1, 8, 9, 10, and 11 together, including (i.e. in combination with the remaining limitations in claim[s] 1, 8, 9, 10, and 11) wherein the processor is configured to: determine a calibration weight with respect to words 15corresponding to the utterance situation information, and calibrate the frequency weight by multiplying the frequency weight by the calibration weight.
2011/0161077 teaches “At STEP 212, the MRP 110 processes the received n-best list from each selected SRE of the SREs 114, 116. Confidence-score values and word-score values in each received n-best list may be modified based on context information provided by the DDP 106 and a predefined weight value associated with the applicable SRE. For example, assume that a specific context of "find a restaurant" is received in the context information from the DDP 106 and that two types of static grammar are available, each in a separate SRE: 1) "the names of restaurant chains within large cities"; and 2) "specific restaurant names within each city." In a typical embodiment, names of restaurant chains are less likely to be found in small cities and are more likely to be found in large cities. In this case, a weight-multiplier value modifying the returned confidence-score values and word-score values in each returned n-best list can be established based on city size. A size or number of businesses in a city may be obtained, for example, from the Context Database 112” (paragraph 46) and “In this case, a city name from the context information received from the DDP 106 is what is being used with information from the Context Database 112 in order to determine city size and the weight multipliers to use to modify appropriate confidence-score values and word-score values. In a typical embodiment, after weighting weighted scores are modified, not where the weights to be multiplied with the scores are calibrated/adjusted/modified/changed.  While multiplying a weighted score with a city-based weight may be mathematically equivalent to multiplying a weight used to generate the weighted score by the city-based weight and then multiplying the confidence/word score based on the result of multiplying the weight used to generate the weighted score by the city-based weight, the claim language specifically recites where the frequency weight is calibrated (not just where the score is weighted, and where the weight that is calibrated is specifically a frequency weight).
8862467 teaches “With reference again to FIG. 3, at operation 308, the process uses context information received in the request from the client device to select one or 
2011/0231191 teaches “According to the present invention, since the weight coefficient generation device calculates the weight coefficient of the likelihood of the vocabulary to be recognized on the basis of the information quantity in the address database in the lower hierarchy of the place names of the recognition candidates stored in the address database, a larger weight can be given to the likelihood of the place names of which frequency in use is assumed to be high. This enables the generation of the weight coefficients of the place names for the purpose of improving the speech recognition performance” (paragraph 12).  Place names, in this reference, appears to refer to frequency with which place names are used, and does not appear to refer to a situation.
2005/0131699 teaches “Reference numeral 201 denotes a speech recognition module which recognizes speech input by the speech input device 105 or the like. More specifically, the speech recognition module 105 analyzes input speech, makes distance calculations with reference patterns, retrieval process, recognition result output process, and the like. The speech recognition dictionary 113 holds information of word IDs, notations, pronunciations, and word weights associated with words to be recognized. The acoustic model 112 holds models of phonemes, syllables, words, and the like, which are formed of, e.g., Hidden Markov Models: HMMs. A reference pattern of a word to be recognized is formed using models in the acoustic model 112 in accordance with word information and pronunciation information in the speech recognition dictionary 113. Reference numeral 202 denotes a frequency-of-occurrence update module which updates frequency-of-occurrence information of words to be recognized using the speech recognition result of the speech recognition module 201. The position/frequency-of-occurrence table 114 holds information associated with the positions and frequencies of occurrence of words to be recognized. Reference numeral 203 denotes a weight update module which calculates the weights of words to be recognized on the basis of the position/frequency-of-occurrence table 114, and changes information associated with weights in the speech recognition dictionary 113” (paragraph 36)
2019/0228763 teaches “Such performance may be tracked over time by controller 108 and used for a variety of purposes. In an embodiment, controller 108 selects one of intent classifications 114, 115 or final resultant intent 116 for implementation at all times or for implementation in certain environments or times based 
5802488 teaches “Speech recognition unit 5 fetches the weighting coefficient assigned to a recognition target speech according to time data by referencing coefficient setting unit 4. However, in Working example 2, coefficient storage unit 21 is connected to coefficient setting unit 4, and the content (weighting coefficients) stored in coefficient storage unit 21 is referenced by coefficient setting unit 4. As shown in FIG. 2B, coefficient storage unit 21 includes past time data storage unit 42 that stores the time data relating to past statistic data and coefficient table creation unit 44 that creates coefficient tables based on the statistic data from past time data storage unit 42. Based on coefficient tables created by coefficient table creation unit 42, coefficient setting unit 4 outputs a large weighting coefficient for multiplying to the recognition data of a phrase .

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
JP 2020/089641 teaches “To allow an operation instruction by an operator's voice to be recognized accurately.SOLUTION: Provided is a voice recognition input device in which a voice recognition data table is stored in which a plurality of voice commands are associated with one operation command and a weighting coefficient corresponding to the frequency of use of the voice command is recorded for each voice command, voice recognition processing is performed with voice input as a recognition target, the voice command is output as a result of the voice recognition processing, and the voice command is converted into the operation command recorded corresponding to the voice command referring to the voice recognition data table to output the operation command to an external device. In this case, a plurality of voice command candidates that can correspond to voice input are selected and each of the plurality of voice 
Dai et al. (US 2015/0325238) suggests An artificial intelligence apparatus for recognizing speech of a user comprising: a microphone; and a processor configured to: obtain, via the microphone, speech data including speech of a user, determine a frequency weight for each… using a speech recognition log, generate a speech recognition result corresponding to the speech data using the frequency weight, and perform control corresponding to the speech recognition result (Figure 1; paragraph 36, 38-41, 43-47, 49-52, 56, 88-89, 91;
“An artificial intelligence apparatus for recognizing speech of a user, comprising: a microphone; and a processor”: Figure 1; paragraph 36, 38-41, 49-52, 88-89, 91; an electronic apparatus that performs voice recognition [paragraph 36, 38-39, 49-52] which includes at least one microphone [paragraph 40] and a computer/data-processing-apparatus [“processor”] that is loaded with program instructions such that the computer/data-processing-apparatus is “configured” to perform steps [paragraphs 88-89, 91].  “voice recognition”, in this reference, is at least suggested to refer to determining what a user said [conventionally referred to as “speech recognition”, see paragraphs 49-52, where paragraphs 38 and 40 describes where voice information is “of a user”] such that the electronic apparatus can be interpreted as being “for recognizing speech of a user”, and voice recognition can be interpreted as an “artificial intelligence” process since it simulates the intelligence used by a human to listen to and 
“configured to: obtain, via the microphone, speech data including speech of a user,”: paragraphs 38, 40; obtaining first voice information of the user [“speech data including speech of a user”] by recording the first voice information by/”via” the microphone of the electronic apparatus.
“determine a frequency weight for each… using a speech recognition log,”: paragraphs 38-41, 43-47, 49-52, 56; maintaining a record [“log”] of how many times a user has spoken each of a plurality of possible recognition results [paragraphs 43-46, 51], where the record is used for performing speech recognition [i.e. the “log” is a “speech recognition log”] and where the record is used to “determine” “weights” to be applied to the matching scores “for each” possible recognition result, where the weights are based on user usage “frequency” [i.e. the weights are “frequency weights”].  The record/”log” is suggested to exist because computing devices typically do not inherently contain knowledge when it does not store data containing the knowledge [see also paragraph 56 which describes where “times of being used of each recognition entry in the N recognition entries” is “detected”, where the most intuitive way for a device to “detect” how many times a recognition entry has been used is to analyze data that indicates how many times each recognition entry has been used]
“generate a speech recognition result corresponding to the speech data using the frequency weight,”: paragraphs 38-41, 43-47, 49-52; the weights for the recognition results are used to determine a highest-weighted-score recognition result for the first voice information [thereby “generat[ing] a speech recognition result corresponding to the 
“and perform control corresponding to the speech recognition result”: paragraphs 49-52; executing an operational instruction corresponding to the first voice information to “control” the apparatus to perform a particular function that corresponds to the speech recognition result [e.g. finding contact Xiaoming and dialing Xiaoming’s phone number are collectively a function that implements what the words “call up Xiaoming” are asking for].  Additionally/alternatively, the operational instruction corresponding to the first voice information is at least suggested to be based on the determined recognition results [because if the system could go directly from the voice information to the operational instruction, then the speech recognition processing would be unnecessary])
	While “each word” may appear to be a trivial distinction at a glance, Dai describes commands that are implemented using phrases, and where the frequency weights are frequency weights for entire command phrases, not for the individual words in the command phrases.  As claimed, one frequency weight must be determined for each word (every word), and the word-specific frequency weight must be used to determine a speech recognition result that corresponds to a performed control.
2006/0173683 teaches “In general, in an additional aspect, the invention features a method of extending a speech vocabulary on a mobile device having a speech recognizer, the method involving: storing on the mobile device a lexicon for the speech recognizer, the lexicon including a plurality of words; storing a second plurality of words on the mobile device, the second set of text words being associated with an application 
Korfin et al. (US 2002/0095294) teaches “A verbal command can take the form of a single word, a short phrase, or a complex natural language sentence.” (paragraph 34).  While this suggests where the different commands/recognition-results in Dai can be single words, it is not obvious that there would be different single-word commands that correspond to the same input speech (i.e. in Dai, shortening “call up Xiaoming”, “call Xiaoming”, or “help me to call Xiaoming” into single word commands would lose at least some of the semantic meaning, because shortening to “call” would lose the call “target” Xiaoming, and shortening to “Xiaoming” would lose the function “call”, and if all 
2011/0015926 (LG reference) teaches “The storage unit 604 stores the word list. The word list can include key words (e.g., predetermined words) corresponding to various information. For example, the key words may correspond to an area name, a drama title, a multimedia title, the name of entertainers, or titles of data previously set by the user or downloaded via a wireless communication network. The storage unit 604 may be implemented as flash memory storing various storage data that can be updated together with a phonebook, origination messages, and reception messages in the general mobile communication terminal” (paragraph 86) and “FIG. 8 is an overview of a screen display illustrating detected words. As shown in FIG. 8, the controller 603 detects words consistent with word information which has been extracted by the voice analyzing unit 602 from among the key words stored in the storage unit 604, and when voice call communications are terminated, the controller 603 displays the detected words on the display unit 605. For example, if an area name such as `Hawaii`, the name of an entertainer (e.g., Matthew Fox), and a drama title (e.g., `LOST`) are detected, the controller 603 displays the detected words on the display unit 605. Here, the controller 603 may sequentially display the words according to the number of frequency of the word which has been spoken (or pronounced) by the user. Namely, if `Hawaii` is spoken by the user ten times, the controller displays `Hawaii ten times` on the display unit 605” (paragraph 100) and “The controller 1203 may detect words consistent with the word information which has been extracted by the voice analyzing unit 1202 from among the 
2013/0060561 teaches “As used herein, a “phrase” is a collection of one or more words delimited by one or more non-word characters; thus, a single word can be a “phrase” as defined herein”.  This reference describes where a phrase can be a word, not where the term “word” can refer to a multi-word phrase.
8150823 teaches “The noise word 36 can be one or more words”
2020/0402516 teaches “prevention program 200 utilizes microphone 132 to detect a trigger word in an operating environment of listening device 130 and initiates beamforming module 134 to detect a voice command subsequent to the trigger word. For example, a trigger word can be one or more words that is a directive to a computer program to perform a specific task (e.g., initiate, run, wake-up, etc.).” (paragraph 37).  In this reference, the trigger word is one or more words preceding a command and does not appear to be the command itself, but this reference does appear to describe where one kind of “word” can be more than one word.
2017/0229115 teaches “The training data converter 110 may select the erroneous word from among a plurality of candidate words. The candidate words may be determined based on phonetic similarities between words. The candidate words may be phonetically similar to the word to be replaced with the erroneous word. For example, when the word to be replaced with the erroneous word is "write", the candidate words may be "wrote", "rewrite", "light", "right", and "lite" which are phonetically similar to "write". The training data converter 110 may select the erroneous word from among the candidate words "wrote", "rewrite", "light", "right", and "lite". A probability of each candidate word being selected as an erroneous word may be identical, or a predetermined candidate word may have a relatively high probability of being selected as an erroneous word. For example, when "write" is incorrectly recognized as "right" with a highest frequency among the candidate words "wrote", "rewrite", "light", "right", and "lite" with respect to "write", "right" may be set to have a relatively high probability of being selected as an erroneous word of "write", when compared to the other candidate words” (paragraph 66).  This reference suggest where single words can be phonetically similar, but this reference does not appear to describe where the single words are sufficient to be an entire utterance that causes a device to perform a function.
2011/0029301 teaches “Equation 2 may be used to rewrite the word sequence W composed of words w.sub.1, w.sub.2, . . . , w.sub.Q, and illustrates that a language model probability value for the word sequence W may be obtained by multiplying the language model probabilities of individual words by each other” (paragraph 77).  This 
9361289 teaches “The user may also utilize speech recognition services, such as those provided by a network-based ASR service (which may or may not be part of a larger spoken language processing system), an ASR service executing locally on the user's client device, or the like. Generally described, ASR systems use language models to help determine what a user has spoken. For example, a general language model may include a very large number of words and phrases in an attempt to cover most utterances made by a general population. The words may be ranked, scored, or weighted so that words which are used often by the general population are given more weight in ASR hypotheses than words which are rarely used. Such weights are typically based on speech patterns of a general population targeted by the general language model. However, because each user is unique and may, e.g., use certain words substantially more often than those same words are used by the general population, the general language model may not produce satisfactory results for utterances that include such words. For example, if a user has purchased albums by a particular artist who is not a mainstream artist, the user is more likely than a user in the general population to make utterances (e.g., spoken commands to play music) that include that artist's name. If the artists name is given very low weight in the general language model, or is omitted altogether, then the user's experience may be less than satisfactory”.  This reference appears to teach away from weighting words based on general population usage, but does describe weighting words.  This does not appear to describe weighting every word.
9514747 teaches “In another aspect, in the presence of undesired latency, certain paths or portions of a graph may be weighted to make those paths higher or lower scoring to speed up ASR processing. Certain regions of a graph may be tagged with different identifiers so that under certain conditions the scores associated with portions of the graph may be adjusted. Multiple different tags may be used to make the system more configurable and adjustable to respond to different latency/utterance conditions. In this manner the ASR processing may be more finely tuned than with the pruning techniques described above. If certain portions of the graph may take longer to process or are computationally expensive but should not be discarded entirely, those portions may be weighted lower under high latency conditions. Or, if those portions should be discarded, their weights may be set to zero. In some aspects, some arcs may be tagged based on `obscurity` measures to avoid those arcs when the speech recognition system is subjected to latency pressure. Similarly, if other portions of the graph are less computationally expensive, etc., those portions may be weighted higher under high latency conditions in an attempt to have the ASR system focus its processing more heavily on the most likely paths. These adjustable weights may be set for different words, word patterns, or any other configuration based on desired tuning and/or empirical experimentation for how well the weighting reduces latency. The weights may also be configured/adjusted based on the specific user who spoke the utterance, particular audio conditions, etc. For example, words that are commonly spoken by the user may receive higher weights than words rarely spoken by the particular user”.  This reference describes weighting words that are more commonly each/every word.
2007/0129942 teaches “Now, with reference to FIGS. 4 and 5, we will describe a process implemented by a program according to the present invention for the visualization, i.e. annotated graphing of the contents of the business meeting described with respect to FIGS. 1 through 3. At a business meeting, provision is made for the recording of the sequential audio content of the meeting, as illustrated in FIG. 1, and for the storage of the recorded audio file, step 60. Each speaker at the meeting is identified, step 61, e.g. by the triangulation, previously described with respect to FIG. 1. The audio file is then converted into the stored sequential text document of the complete content of the meeting, step 62. The stored audio file may be subsequently converted to the text of the audio content of the meeting or it may be directly converted into text on a real time basis as the speaking in the meeting continues. In either instance, conventional speech recognition techniques may be used, such as the conventional techniques described in U.S. Pat. No. 6,937,984 (filed Dec. 18, 1998). Next, the stored sequential text document of the full content is analyzed, step 63, so that a graphical outline may be created that visualizes and annotates the graphical content to provide sequential graphical annotated outline that is scrollable in synchronization with the scrolling of the sequential text document as was shown with respect to FIG. 3. In a computer controlled display terminal as described in FIG. 2, there is provided an operating system with a graphics engine, e.g. the graphics/text functions of Windows.sup.XP, which, in turn, translates the vectors provided for the areas in a stacked area graph into dynamic pixel arrays providing the annotated stacked graphs shown in FIG. 3. Some of the analytical each/every word/term.
2005/0192802 teaches “In one embodiment of the invention, the majority of the components for a disambiguating back end are shared among different input modalities e.g. for handwriting recognition and for speech recognition. The word list 214 comprises a list of known words in a language. The word list 214 may further comprise the information of usage frequencies for the corresponding words in the language. In one embodiment, a word not in the word list 214 for the language is considered to have a zero frequency. Alternatively, an unknown word may be assigned a very small frequency of usage. Using the assumed frequency of usage for the unknown words, the known and unknown words can be processed in a substantially same fashion. The word list 214 can be used with the word based disambiguating engine 216 to rank, eliminate, and/or select word candidates determined based on the result of the pattern recognition front end (e.g., the stroke/character recognition engine 212 or the phoneme recognition engine 213) and to predict words for word completion based on a portion of user inputs. Similarly, the phrase list 215 may comprise a list of phrases that includes two or more words, and the usage frequency information, which can be used by the phrase-based 
2014/0088961 teaches “The ASR engine analyzes the speech content in the audio track of the segment after having removed the background sounds/noises from consideration based on the background sounds/noise profiles for the segment. The ASR engine operates to recognize words in the speech content and generate textual equivalents of these words. The identification of words in speech input is generally known in the art and thus, a more detailed description is not provided herein. However, the automatic detection of words by the ASR engine in the illustrative embodiments described herein is augmented by the dynamic configuration of the ASR engine as discussed above. In addition, if the ASR determines that there are a plurality of possible textual equivalents to an spoken word in the speech content of the audio track of the segment, then the library of commonly used words for the speaker, and the corresponding weights determined by frequency of use of the words in social network service information” (paragraph 23) and “In addition, the text/video/audio postings of the user to the social network service may be analyzed to identify various characteristics of the user's speech. From an analysis of the text postings, a listing of frequently used words may be generated and ranked or weighted according to their frequency of use. From an analysis of the video/audio postings, a user's speech patterns, such as low pitch, high pitch, rapid speaking, slow speaking, frequent pauses, and the like may be determined” (paragraph 61).
2016/0098393 teaches “As discussed above, some embodiments employ an ASR engine front end as input to an NLU engine. An illustrative process for selecting an 
2009/0150156 teaches “The generated list of possible destinations may be post-processed in an operation 530 in order to assign weights or ranks to one or more of the entries in the N-best list. The post-processing may include analyzing the destination list generated in operation 520 according to various factors in order to determine a most likely intended destination from a full or partial voice destination input. For example, post-processing operation 530 may rank or weigh possible destinations according to shared knowledge about the user, domain-specific knowledge, dialogue history, or other factors. Furthermore, the post-processing operation 530 may analyze the full or partial destination input in order to identify an address to which a route can be calculated, for example, by resolving a closest address that makes "sense" relative to the input destination. For example, a user may specify a partial destination that identifies a broad 
2013/0325448 teaches “Referring again to FIG. 6G, operation 684 may include operation 686 depicting managing adaptation data comprising one or more words having a weight based upon a frequency of appearance of the one or more words. For example, FIG. 2, e.g., FIG. 2G, shows adaptation data comprising one or more words that are assigned a frequency of appearance based weight managing module 286 managing adaptation data comprising one or more words having a weight (e.g., a real number between zero and one hundred) based upon a frequency of appearance of the one or more words (e.g., if the user carries out a lot of transactions with automated teller machine devices, the word "four" may have a higher weight than the similarly-pronounced word "ford")” (paragraph 168)
2020/0357392 teaches “For example, if the word "street" occurs at least once in the candidate speech recognition result, then the controller 148 sets the value of the corresponding element in the feature vector as 1 during the feature extraction process. In another embodiment, the controller 148 identifies the frequency of each word, where "frequency" as used herein refers to the number of times that a single word occurs within a candidate speech recognition result. The controller 148 places the number of 
5241619 teaches “multiplying the different probability scores for each of said N different word sequences from step f so as to produce the combined probability scores of said models”
2006/0116997 teaches “wherein the query string comprises multiple words and determining a score for the query string comprises multiplying a product of probabilities of overlapping sequences of tokens associated with a first word by a product of probabilities of overlapping sequences of tokens associated with a second word in the query string, and further by a probability for a unigram of silence”
2018/0173494 teaches “The speech recognition apparatus 100 may receive information on speech commands that the speech recognition apparatus 100 receives from the user in a plurality of situations to store the plurality of candidate activation words. The speech recognition apparatus 100 may extract a plurality of words included in the speech commands. The speech recognition apparatus 100 may store at least one word as a candidate activation word corresponding to a specific situation, based on a frequency with which the plurality of words are included in the speech commands received in a specific situation among the plurality of situations” (paragraph 70).
5802488 teaches “Speech recognition unit 5 fetches the weighting coefficient assigned to a recognition target speech according to time data by referencing coefficient setting unit 4. However, in Working example 2, coefficient storage unit 21 is connected to coefficient setting unit 4, and the content (weighting coefficients) stored in coefficient storage unit 21 is referenced by coefficient setting unit 4. As shown in FIG. 2B, 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ERIC YEN whose telephone number is (571)272-4249.  The examiner can normally be reached on M-F 12:00PM -8:30PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, RICHEMOND DORVIL can be reached on (571)272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.







EY 9/18/2021
/ERIC YEN/Primary Examiner, Art Unit 2658