Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-20 are pending. Claims 1, 15 and 18 are independent.
This Application was published as U.S. 2022/0301548.
            Apparent priority: 16 March 2021.

	Note 16/929,383 published as 20220020361.  No Obviousness Double Patenting present.
Claim Objections
Claim 4 is objected to because of informalities that may be addressed with the following suggested amendments: 
4. The system of claim [[2]] 3, wherein the bag of phone N-grams includes phone bigrams.
The bag of n-grams first appears in Claim 3.  As is, the Claim lacks antecedent basis for “the bag of …”

Claims 7-8 are objected to because of informalities that may be addressed with the following suggested amendments: 
7. The system of claim 1, wherein the first automatic [[search]] speech recognition engine is the same as the second automatic [[search]] speech recognition engine.
8. The system of claim 7, wherein the first automatic [[search]] speech recognition engine includes a deep neural network acoustic model.
No antecedent basis for “search recognition.”

Claim 12 is objected to because of informalities that may be addressed with the following suggested amendments: 
12. The system of claim 11, wherein the Naive Bayes bag of words are weighted by [[a]] lattice probabilities associated with the 3training set of the topic keyword-containing lattices.
Appropriate correction is required.
35 U.S.C. 112(f) Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 
The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. 
Such claim limitation(s) is/are: the various “modules” in Claims 1-14. These limitations are generic in the context of the art and don’t refer to any specific structure and only serve as placeholders for the structure that performs the associated function(s) without providing any information about what that structure is. MPEP 2181 I A says:
For a term to be considered a substitute for "means," and lack sufficient structure for performing the function, it must serve as a generic placeholder and thus not limit the scope of the claim to any specific manner or structure for performing the claimed function. It is important to remember that there are no absolutes in the determination of terms used as a substitute for "means" that serve as generic placeholders. The examiner must carefully consider the term in light of the specification and the commonly accepted meaning in the technological art. Every application will turn on its own facts.
Based on the ordinary skill in the art and description of functions of these components in the Specification, they refer to processors or a combination of processor and memory and possibly transducers such as microphones and displays or to a combination of software and hardware.
PLEASE NOTE: This is NOT a rejection. Please don’t address it as a rejection. If the Applicant does not agree with the INTERPRETATION, he may argue or amend to replace the terms interpreted under 112(f) with structural terms such as “microphone” or “processor” as appropriately supported by the Specification. In the alternative, he may let the interpretation stand if the intent was to include a means plus function limitation in the Claim.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 6-10, and 13-20 are rejected under 35 U.S.C. 103 as being unpatentable over McDonough (U.S. 5626748) in view of Thomson (U.S. 20220059077).
Regarding Claim 1, McDonough teaches:
1. A voice topic spotting system comprising: 
a learning module arranged to: [ McDonough, Figures 3 and 5 pertain to the Training phase.]
i) receive a plurality of training audio segments, [ McDonough, Figure 3 teaches that the “Potential Speech Events 20” and “Topic K, Speech Message K 36” are input to the training process both of which teach the “plurality of audio segments” of the Claim which are used for the process of training.  Figure 5, “potential speech events 20” teach the “plurality of audio segments” of the Claim which are used for the process of training.  “FIG. 5 is a block diagram of further components that are used in a preliminary phase of training, i.e., how one obtains the potential speech events.”]
ii) receive segment topic labels associated with the each of the plurality of training audio segments, [ McDonough, Figure 3, “Topic K 36” associated with the “speech message K” teaches the “segment topic labels” of the Claim.  “FIG. 3 is a block diagram of the primary components used in the training procedure for training the system of FIG. 1. The input to the training procedure is either or both a transcribed training data 30 and untranscribed speech data 36. Transcribed training data corresponds to a set of text sequences each identified with the true topic. Untranscribed speech data corresponds to a set of recorded messages each labeled with its true topic but which are otherwise unannotated.”  Col. 6, 4-12.]
iii) execute a first automatic speech recognition engine to extract a plurality of topic keyword hypotheses associated with the plurality of training audio segments; [ McDonough, Figure 3, “speech event frequency detector 38” which can be implemented, see Figure 6A, by a  “speech recognizer 12, 38.”  “Each event frequency detector 12 and 38 can be constructed hypothesized string of events and then computing the event frequencies from the string of events. In this form of event frequency detector, the hypothesized string can be a hypothesized word sequence produced by a speech recognizer, a set of putative word or phrase occurrences produced by a word spotter, or a set of putative word or phrase occurrences with associated confidence scores. ….” Col. 7, 5-25.]
iv) apply a chi-squared test to select a set of topic-indicative words as a subset of the plurality of topic keyword hypotheses and [ McDonough, Figure 11, pertains to “hypothesis testing” and one of the methods of “hypothesis testing” used in McDonough is “Chi-squared.”  “Speech Event Subset Selection.  …. Hence, event selection is necessary to determine the subset of words or other acoustic events which, when observed or hypothesized in a speech message, best serve to indicate the topic membership of that message. …One preferred method of event selection (shown in FIG. 11) is founded on the concept of hypothesis testing. According to one aspect of the invention, (as shown in FIG. 12) hypothesis testing is used to determine whether or not there is sufficient evidence to establish that the occurrence pattern for a given keyword or event is dependent on the topic membership of the speech messages in which the event is observed. If such a dependence is established, the event is assumed to be a good indicator of topic membership. The X2 -test is well-known in the art … and useful for testing dependencies of this type.”  Col. 9, line 65 to Col. 10, line 26.  “12. The method according to claim 11, wherein said hypothesis test is a chi-squared test, and the step of obtaining topic-conditional significance or association scores by said chi-squared test includes the step of calculating the chi-squared (X2) values as follows ….”] select a training set of topic keyword- containing lattices associated with the set of topic-indicative words; [ McDonough, Figures 10, 11 and 12 are used for selecting a set of topic keywords used for the training but does not teach that first a set of “topic-keyword containing lattices” are selected.  (Note, however, that speech recognizers generally generate multiple speech recognition hypotheses in the form of a latch or an n-gram best list.)   In Figure 11, the “words or event above threshold” are selected from the “hypothesis strings.”  FIG. 11 shows a preferred embodiment of event selection that is founded on the concept of hypothesis testing.]
iv) generate a fast keyword filter model based on the set of topic-indicative words, and [ McDonough, Figure 3, the “Topic K Event Frequencies K 40” teaches the “keyword filter.”  It can output a set a “events” where the “events” can be keywords:  “ In the case where transcribed training data is available as indicated at output 40, each text sequence, provided from the transcribed training data 30. is converted into a set of event frequencies using the text event detector 32. For each of a set of potential text events 34, the text event detector scans the text and determines the frequency of occurrence of that event. As is the case with the potential speech events 20, potential text events can include individual words, multiword phrases, and complex phrases specified in a form such as a regular expression or a context-free grammar.” Col. 6, 13-22.]
v) generate a topic identification model based on the training set of topic keyword-containing lattices; and [ McDonough, Figure 3, “Topic Modeling 42” is the output of the training which begins by input of the training set and the trained model is used to identify topic keywords.  “The topic modeling component 42 uses as input the output 40 representative set of event frequencies along with the true topic labels. In the preferred embodiment shown in FIG. 4, topic modeling comprises first selection of a subset of the available events. Then, parametric probabilistic models for the event frequency of each of the selected events are estimated. In the preferred embodiment, the parametric models take the form of multinomial distributions or mixtures of multinomial distributions, although other distributions can be used as described in greater detail hereinafter. The topic model parameters 22 are then comprised of the selected subsets of events and the parameter values for the individual event frequency distributions.”  Col. 6, 30-42.]
a voice topic classifier module including: [ McDonough, Figure 1 shows the application of the model that has been trained by the process of Figure 3.]
a second automatic speech recognition engine arranged to identify one or more keywords included in a received audio segment and output the one or more keywords; [ McDonough, Figure 1, “speech event frequency detector 12” includes a speech recognizer. The input is shown as the “spoken message 10” and the output includes “Event Frequencies 14” which means the frequencies of the keywords and the “Topic Classifier Output 18.” “The event frequencies are processed by the topic classifier 16 to produce the topic classifier output 18. The output can take the form of a choice from a preselected set of known topics see (FIG. 2A) a choice of either presence or absence of a particular known topic (see FIG. 2B), or a confidence score that a particular known topic is present (see FIG. 2C). The topic classifier 16 makes use of topic model parameters 22 that are determined during a prior, or potentially ongoing, training procedure.”  Col. 5, line 62 to Col. 6, lines 3.  “The speech event frequency detectors 12 and 38 of FIGS. 1 and 3, which are used either in processing new speech data or in training the system, are each designed to extract relevant features from the speech data…. Events include presence of individual words, multiword phrases, and complex phrases specified in a form such as a regular expression or a context-free grammar. An example of a multiword phrase would be "credit card" or a brand name card such as "American Express". An example of a complex phrase would be a syntactically correct flight identification in an air-traffic-control command, or a time including any form such as "twelve o3 clock", "noon", or "five minutes to ten."”  Col. 6, line 66 to Col. 7, line 5.  See Col. 7, lines 5-25 provided above and Figure 6A for the teaching that speech event frequency detectors 12 and 38 include a speech recognizer.]
a fast keyword filter, implementing the fast keyword model, arranged to receive the one or more keywords and detect whether the received audio segment includes any keywords of the set of topic-indicative words and, if detected, output the received audio segment as a topic keyword-containing audio segment but, if not detected, not output the received audio segment; [ McDonough, Figure 2A, a topic is selected from a “set of known topics 22A.” “… The output can take the form of a choice from a preselected set of known topics see (FIG. 2A) a choice of either presence or absence of a particular known topic (see FIG. 2B), or a confidence score that a particular known topic is present (see FIG. 2C)….”  Col. 5, line 62 to Col. 6, lines 3.]
a decoder arranged to, if the topic keyword-containing audio segment is outputted by the keyword filter, receive the topic keyword-containing audio segment and generate a topic keyword-containing lattice associated with the topic keyword-containing audio segment; and 
a voice topic classifier, implementing the voice topic identification model, arranged to receive the topic keyword-containing lattice and execute a machine learning technique to determine a topic associated with the topic keyword- containing audio segment. [ McDonough teaches that the training could be a “potentially ongoing, training procedure” (Col. 6, line 3) such that topic classifier of Figure 1 and the training process of Figure 3 occur iteratively.]

McDonough does not teach that it uses its hypothesis testing to “select a training set of topic keyword- containing lattices associated with the set of topic-indicative words” and selects the keywords directly.
McDonough does not teach the use of an encoder/decoder system.
Thomson teaches:
1. A voice topic spotting system comprising: [Thomson, Figure 14, teaches that the analysis of the transcript may be used for topic spotting:  “[0302] In some embodiments, the audio processing operations of the audio processor 1426 may include causing transcription and/or analysis of part or all of the audio 1406 using a first ASR system different from the ASR system 1420. In these or other embodiments, the transcription and/or analysis may be used to identify the specific topic and corresponding language model for the ASR system 1420. In some embodiments, using all of the audio 1406 to determine the particular topic and/or language model may be more accurate than using only a portion of the audio 1406.”.]
a learning module arranged to: [Thomson, Figure 12 shows the training of the ASR which includes topic identification.  See [0301] and [0303].]
…
a voice topic classifier module including: [Thomson, Figure 12, the trained ASR can identify the topics.  See [0303].]
…
a decoder arranged to, if the topic keyword-containing audio segment is outputted by the keyword filter, receive the topic keyword-containing audio segment and generate a topic keyword-containing lattice associated with the topic keyword-containing audio segment; and [ (This limitation sets forth what the ASR does and when the ASR is implemented by a decoder (or in an encoder-decoder configuration) then the decoder (ASR) generates the lattice of the recognized word hypotheses.  Thomson, Figure 25, either of the first or second decoders 2510,2520 teach the “decoder” of the Claim because they receive the acoustic features that are extracted from the first or second audio input to the system of Figure 24, and generate the lattice hypotheses that could include the topic keyword and thus be  “a topic keyword-containing lattice associated with the topic keyword-containing audio segment” of the Claim.  “[0514] In some embodiments, the first decoder system 2510 may be configured to generate and output a first word lattice. The first word lattice may be a directed acyclic graph with a single start point and edges labeled with a word and a score. The first word lattice may include multiple words.….”]
a voice topic classifier, implementing the voice topic identification model, arranged to receive the topic keyword-containing lattice and execute a machine learning technique to determine a topic associated with the topic keyword- containing audio segment. [Thomson, Figure 14, teaches an “audio processor 1426” that may be configured to detect topics in an audio input in order to select a topic-specific language model in the ASR.  Then, Thomson in Figure 12 shows the “training system 1212” and teaches that the “model 1214” that is being trained is implemented in DNN which is a machine learning model and also that this “model 1214” can be some type of model used by the ASR.  The combination of the two teachings when taken together suggests that the “model 1214” can be a topic detection model which is being trained in a machine learning process.  “[0301] In some embodiments, the audio processor 1426 may be configured to detect that the audio 1406 pertains to a particular topic. In these or other embodiments, the audio processor 1426 may load a topic-specific language model into the ASR system 1420 such that the ASR system 1402 may perform transcription operations based on the particular topic and words and phrases associated therewith.” “[0302] …  the transcription and/or analysis may be used to identify the specific topic and corresponding language model for the ASR system 1420. ….”   “[0253] In some embodiments, the model 1214 may be a deep neural network model or other type of machine learning model that may be trained based on providing parameters and a result. In some embodiments, the model may be a language model or an acoustic model that may be used by an ASR system to transcribe audio. Alternately or additionally, the model may be another type of model used by an ASR system to transcribe audio.”  “[0366] … For example, the fuser system 1620 may use a machine learning model to make a selection of a word from the fuser system 1620…..”  Classifier to select a word by machine learning in [0331].  See [0431], [0432] machine learning based classifier to classify audio or text.  Machine learning to determine difficulty of audio at [0076].]
McDonough and Thomson pertain to or include topic spotting and it would have been obvious to use the “decoder” of Thomson which is used for speech recognition and outputs a lattice of hypotheses and is trained via machine learning in place of the speech recognizer of McDonough as a more modern method of performing the same task.  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.  (Note in the Conclusion and as further support that a lattice is one way of output of ASR hypotheses: Santos (U.S. 10,838,954) “… The ASR component 1208 may also output multiple ASR hypotheses in the form of a lattice or an N-best list ….” Col. 31, lines 23-43.)

    PNG
    media_image1.png
    402
    743
    media_image1.png
    Greyscale


    PNG
    media_image2.png
    517
    763
    media_image2.png
    Greyscale


    PNG
    media_image3.png
    362
    797
    media_image3.png
    Greyscale


    PNG
    media_image4.png
    271
    748
    media_image4.png
    Greyscale


    PNG
    media_image5.png
    521
    268
    media_image5.png
    Greyscale


Regarding Claim 6, McDonough does not discuss the details of the models.
Thomson teaches and therefore suggests:
6. The system of claim 1, wherein plurality of topic keyword hypotheses are represented as phone N-grams. [Thomson teaches that instead of words, phonemes may be generated by the ASR which means that the n-gram used by the language model would be a phone n-gram.  See [460] above.  Thomson also teaches that its ASR models include topic spotting models.  The two teachings when combined suggest that phoneme n-gram models of Thomson may be used for its topic spotting.]
Rationale for combination as provided for Claim 1 with addition that the n-gram models which are a well-known feature of language models may also be brought from Thomson for the ASR of McDonough.

Regarding Claim 7, McDonough teaches:
7. The system of claim 1, wherein the first automatic search recognition engine is the same as the second automatic search recognition engine. [McDonough teaches the topic classification in Figure 1 and the training of the topic classifier model in Figure 3 both of which include speech recognition as part of their “speech event frequency detection.”  McDonough teaches that the training can be done separately or during the execution in which case both would be using the same speech recognizer because the output of the system is fed back to the system as training data.  “The output can take the form of a choice from a preselected set of known topics see (FIG. 2A) a choice of either presence or absence of a particular known topic (see FIG. 2B), or a confidence score that a particular known topic is present (see FIG. 2C). The topic classifier 16 makes use of topic model parameters 22 that are determined during a prior, or potentially ongoing, training procedure.” Col. 5, line 64 to Col. 6, line 3.]

Regarding Claim 8, McDonough does not discuss the details of the models.
Thomson teaches:
8. The system of claim 7, wherein the first automatic search recognition engine includes a deep neural network acoustic model. [Thomson teaches that the models used for the ASR may be implemented in neural networks: “[0253] In some embodiments, the model 1214 may be a deep neural network model or other type of machine learning model that may be trained based on providing parameters and a result. In some embodiments, the model may be a language model or an acoustic model that may be used by an ASR system to transcribe audio. Alternately or additionally, the model may be another type of model used by an ASR system to transcribe audio.”]
Rationale for combination as provided for Claim 1.

Regarding Claim 9, McDonough teaches:
9. The system of claim 1, wherein the decoder includes a finite state transducer (FST) decoder. [McDonough:  “For this reason, the word recognizer comprising one component of the gisting system is able to make use of finite state networks specifically designed to model each of a number of commonly occurring words and phrases ….”  Col. 2, lines 60-65.  See also Col. 8, lines 8-11.]

Regarding Claim 10, McDonough teaches:
10. The system of claim 9, wherein the decoder implements a Hidden Markov Model (HMM). [McDonough does not teach the use of a “decoder.”  However, the decoder of Claim 1 is one implementation of the ASR (speech recognizer) which is part of the event frequency detector taught by McDonough.  McDonough teaches the use of HMMs for its ASR:  “In a preferred embodiment of the event frequency detector shown in FIG. 7, a hidden Markov model (HMM) based word and phrases spotter is used.”  Col. 7, lines 26-28.]

Regarding Claim 13, McDonough does not mention the duration of the input “spoken message 10.”
Thomson teaches and suggests:
13. The system of claim 1, wherein the audio segment has a duration greater than or equal to 10 seconds. [Thomson teaches that accuracy is estimated for periods of time including 10 seconds or greater:  “[0061] …For example, past accuracy estimates may include the estimated and/or calculated accuracy for a previous period of time (e.g., for the past 1, 5, 10, 20, 30, or 60 seconds), since the beginning of the communication session, or during at least part of a previous communication session with the same transcription party. ….”  Because the accuracy pertains to the audio segment, this suggests also the duration of audio segments.  See also “[0059] In some embodiments, the transcription system 120 may be configured to determine an accuracy of the transcriptions generated by the transcription system 120. The accuracy may be estimated for an entire communication session, a portion of a communication session, a phrase, or a word…”]
Rationale for combination as provided for Claim 1.
Regarding Claim 14, McDonough teaches in Figures 2A, 2B, and 2C that the topic identification is performed by choosing from a limited “set of known topics” (22A, 22B, 22C).  However, it does not specify a number of words for the topics.
Thomson teaches:
14. The system of claim 1, wherein the set of topic-indicative words is less than or equal to 1000 words.  [Thomson, Figure 24, teaches that instead of the “align system 2420,” a neural network including a CNN may be used:  “[0476] … For example, one or more of the layers of a neural network that is part of the environment 2400 may include a convolutional neural network (CNN) layer, which may include a pooling layer. Nodes in the CNN layer may include multiple inputs from a previous layer in each direction. For example, a node in a CNN layer may include 10, 20, 50, 100, 500, 1000, 2000, 3000, or more inputs in each direction. …”  See also [0475].  The inputs to the CNN include the topics/labels if the model is to be used for topic spotting.]
Rationale for combination as provided for Claim 1 with addition that the number of labels/topics is here correlated to the implantation method which is the use of a CNN and therefore the number of inputs to the CNN is a proxy for the number of topics/labels. 

Regarding Claim 15, McDonough teaches all of the limitations as shown with respect to Claim 1 except for lattice which is taught by Thomson.  
15. A voice topic spotting learning system comprising: 
a communications interface arranged to: 
i) receive a plurality of training audio segments and segment topic labels associated with each of the plurality of audio segments, 
ii) output a fast keyword filter model to a voice topic classifier, and 
iii) output a topic identification model to the voice topic classifier; and 
a processor, in electrical communication with the communications interface, arranged to: 
i) execute an automatic speech recognition engine to extract a plurality of topic keyword hypotheses associated with the plurality of audio speech segments; 
ii) apply a chi-squared test to select a set of topic-indicative words as a subset of the plurality of topic keyword hypotheses and select a training set of topic keyword- containing lattices associated with the set of topic-indicative words; 
iii) generate the fast keyword filter model based on the set of topic-indicative words, and 
iv) generate the topic identification model based on the training set of topic keyword-containing lattices.

Claim 16 has a limitation similar to the limitation of Claim 14 which is rejected under the same rationale.  The language appears different but different labels and different topics are the same thing.
16. The system of claim 15, wherein the segment topic labels include an amount of different labels that are less than or equal one of 10, 20, 40, 50, 100, and 200.  [Thomson, Figure 24, teaches that instead of the “align system 2420,” a neural network including a CNN may be used:  “[0476] … For example, one or more of the layers of a neural network that is part of the environment 2400 may include a convolutional neural network (CNN) layer, which may include a pooling layer. Nodes in the CNN layer may include multiple inputs from a previous layer in each direction. For example, a node in a CNN layer may include 10, 20, 50, 100, 500, 1000, 2000, 3000, or more inputs in each direction. …”  See also [0475].  The inputs to the CNN include the topics/labels if the model is to be used for topic spotting.]

Regarding Claim 17, McDonough teaches: 
17. The system of claim 15, wherein the fast keyword filter model and the topic identification model are generated offline before the voice topic classifier receives an audio segment for voice topic identification. [McDonough teaches that the training of the model including the “Topic K, Event Frequencies K 40” detector of Figure 3 is separate from the execution of the model and the use of it to detect “Event Frequencies 14” in Figure 1 (event = keyword) the training could be “prior” to the execution.  “FIG. 1 is a block diagram of the components that are used to process a spoken message, or other speech data input, indicated at 10, and ultimately produce the topic classifier output 18. The spoken message 10 is processed by a speech event frequency detector 12, which in turn is coupled to a predetermined set of potential speech events 20, e.g., a vocabulary of predetermined words and phrases. The speech event frequency detector produces a signal representative of a set of event frequencies 14 for the potential speech events. The potential speech events 20 can include individual words, multiword phrases, and complex phrases specified in a form such as a regular expression or a context-free grammar. The event frequencies are preferably estimates of the frequency of occurrence of the potential speech events in the spoken data. The speech event frequency detector preferably includes a speech recognizer or word and phrase spotter. The frequency of occurrence of the specified potential speech events is determined by processing the output of the speech recognizer or word spotter although such processing could be integrated into the speech recognizer or word and phrase spotter. The event frequencies are processed by the topic classifier 16 to produce the topic classifier output 18. The output can take the form of a choice from a preselected set of known topics see (FIG. 2A) a choice of either presence or absence of a particular known topic (see FIG. 2B), or a confidence score that a particular known topic is present (see FIG. 2C). The topic classifier 16 makes use of topic model parameters 22 that are determined during a prior, or potentially ongoing, training procedure.”  Col. 5, line 41 to col. 6, line 3.]

Regarding Claim 18, McDonough and Thomson together teach all of the limitations as shown with respect to Claim 1.
18. A runtime voice topic spotting classifier system comprising: 
a communications interface arranged to: 
i) receive an audio segment, 
ii) receive a fast keyword filter model from a voice topic spotting learning system, and 
iii) receive a topic identification model from the voice topic spotting learning system; and 
an automatic speech recognition engine arranged to identify one or more keywords included in a received audio segment and output the one or more keywords; 
a fast keyword filter, implementing the received fast keyword model, arranged to receive the one or more keywords and detect whether the received audio segment includes any keywords of the set of topic-indicative words and, if detected, output the received audio segment as a topic keyword-containing audio segment but, if not detected, not output the received audio segment; 
a decoder arranged to, if the topic keyword-containing audio segment is outputted by the keyword filter, receive the topic keyword-containing audio segment and generate a topic keyword-containing lattice associated with the topic keyword- containing audio segment; and 
a voice topic classifier, implementing the received voice topic identification model, arranged to receive the topic keyword-containing lattice and execute a machine learning technique to determine a topic associated with the topic keyword-containing audio segment.

Claim 19 has a limitation similar to the limitation of Claim 13 which is rejected under the same rationale.

Regarding Claim 20, McDonough teaches and therefore suggests: 
20. The system of claim 19, wherein the audio segment includes an audio data file. [McDonough teaches that its method is used for improved access to “audio archives” and “audio archives” store “audio data files.”  Teaching of “audio archive” suggests that the input audio is “an audio data file.”  “The present invention can be used for sorting speech data in any one of a number of applications. For example, in addition to classifying recordings of air-traffic control dialogs, other examples include sorting of speech data, such as radio news recordings, by the topic of the news story. This type of sorting can be used for automatic detection of speech data of interest or can be used as part of an automatic indexing mechanism for improved access to audio archives. Still other examples include automatic response to or routing of phone calls based on the topic or subject matter in a spoken message from the caller. Still other applications similar to the automatic indexing and data detection applications described above include classification of stored or incoming voice messages in a voice mail system.”  Col. 12, lines 28-41.]

Claims 2-5 and 11-12 are rejected under 35 U.S.C. 103 as being unpatentable over McDonough and Thomson in view of Lisuk (U.S. 20160078022).
Regarding Claim 2, McDonough and Thomson do not mention the Naïve Bayes model. (Note the instant Application: “[0053] The fast keyword filter model 302 may include a direct classification technique such as a Naïve Bayes classifier with bag of phone N-grams….”)
Lisuk teaches:
2. The system of claim 1, wherein the fast keyword filter includes a direct classification technique. [Lisuk is directed to a classification system that predicts labels for documents.  Lisuk teaches that it uses different classification methods including Naïve Bayes on a document that is preprocessed into a bag of words.  Naïve Bayes is a direct classification method.  “[0067] In… For example, the classifier may utilize learning models including, but not limited to, support vector machine (SVM), naïve Bayes, neural network, Latent Dirichlet Allocation (LDA), Seam (Search and Learn), matrix factorization, ordinary least squares regression, weighted all pairs, contextual-bandit, and so forth. …”  “[0068] …  For example, a common representation of natural language documents used for this purpose is referred to as the “bag-of-words” model, where the order of the words is ignored and the frequency of each word is used as a feature input into the classifier….”]
McDonough/Thomson and Lisuk pertain to classification and topic detection and it would have been obvious to use the classification techniques of Lisuk for the system of combination as an equivalent method.  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

    PNG
    media_image6.png
    404
    433
    media_image6.png
    Greyscale
 
    PNG
    media_image7.png
    407
    466
    media_image7.png
    Greyscale


Regarding Claim 3, McDonough does not discuss the details of the models.
Thomson teaches:
3. The system of claim 2, wherein the fast keyword filter includes a Naive Bayes classifier with bag of phone N-grams. [Thomson teaches that instead of words, phonemes may be generated by the ASR which means that the n-gram used by the language model would be a phone n-gram.  “[460] … The language model of the decoder system 2330 may use statistics derived from n-grams to determine word probabilities. …”]
Thomson does not mention the Naïve Bayes classifier which is one of the statistical methods used for providing the statistical framework of language models.
Lisuk teaches:
wherein the fast keyword filter includes a Naive Bayes classifier with bag of phone N-grams. [Lisuk is directed to a classification system that predicts labels for documents.  “[0067] In an embodiment, classification computer 301 represents a computing device that implements a machine learner (“classifier”). Depending on the embodiment, the classifier may be implemented via hardware, software, or a combination of hardware and software. For the purpose of illustrating clear examples, it is assumed that the classifier adheres to a supervised or semi-supervised learning model. For example, the classifier may utilize learning models including, but not limited to, support vector machine (SVM), naïve Bayes, neural network, Latent Dirichlet Allocation (LDA), Seam (Search and Learn), matrix factorization, ordinary least squares regression, weighted all pairs, contextual-bandit, and so forth. Thus, the classification computer 301 retrieves as input a labeled set of training documents from the labeled document database 303 and trains the classifier using the training documents.”  “[0068] …  For example, a common representation of natural language documents used for this purpose is referred to as the “bag-of-words” model, where the order of the words is ignored and the frequency of each word is used as a feature input into the classifier….”]
Rationale of combination as provided for Claim 2.

Regarding Claim 4, McDonough does not discuss the details of the models.
Thomson teaches and therefore suggests:
4. The system of claim 2 (should be 3), wherein the bag of phone N-grams includes phone bigrams. [Thomson teaches the use of n-grams and bi-gram is a type of n-gram.  Thus, the use of bi-grams is suggested.]
(Lisuk: “[0096] …  Thus, in other embodiments, the classification computer 301 identifies features related to “n-grams”, which are contiguous sequences of n items (e.g. syllables, words, phrases, sentences, and so forth) from a given sequence of text or speech. As a result, by utilizing “n-grams” at least some information regarding the word-order is presented as a feature to the classifier….”)
Rationale of combination as provided for Claim 1.

Regarding Claim 5, McDonough does not teach the use of neural networks. Thomson, Figure 24, teaches that instead of the “align system 2420,” a neural network including a CNN may be used:  “[0476] … For example, one or more of the layers of a neural network that is part of the environment 2400 may include a convolutional neural network (CNN) layer, which may include a pooling layer. Nodes in the CNN layer may include multiple inputs from a previous layer in each direction. For example, a node in a CNN layer may include 10, 20, 50, 100, 500, 1000, 2000, 3000, or more inputs in each direction. …”  See also [0475].
Lisuk teaches:
 5. The system of claim 1, wherein the fast keyword filter includes a convolutional neural network. [Lisuk teaches the training and use of a neural network classifier for its topic spotter: “[0067] …  For example, the classifier may utilize learning models including, but not limited to, support vector machine (SVM), naïve Bayes, neural network, Latent Dirichlet Allocation (LDA), Seam (Search and Learn), matrix factorization, ordinary least squares regression, weighted all pairs, contextual-bandit, and so forth. Thus, the classification computer 301 retrieves as input a labeled set of training documents from the labeled document database 303 and trains the classifier using the training documents.”   “[0097] …In an embodiment, continuing from the previous example, the positive and negative examples are fed into the classifier which determines a general rule (e.g. via neural network, SVM, matrix factorization, ordinary least squares regression, or any other learning technique) that maps features to a corresponding label (“STUDY TARGET” or “NOT STUDY TARGET”).]
Rationale of combination as provided for Claim 2.  Lisuk uses a neural network for filtering the keywords/topics as present or absent and Thomson uses a CNN as part of its speech recognition and it would have been obvious use a CNN in Lisuk as a type of NN.

Regarding Claim 11, McDonough and Thomson do not mention the Naïve Bayes model.
Lisuk teaches:
11. The system of claim 1, wherein the topic identification model includes a Naive Bayes bag of words. [Lisuk is directed to a classification system that predicts labels for documents and teaches the use of Naïve Bayes as a method of classification used on a bag of words representation of the document.  “[0067]… For the purpose of illustrating clear examples, it is assumed that the classifier adheres to a supervised or semi-supervised learning model. For example, the classifier may utilize learning models including, but not limited to, support vector machine (SVM), naïve Bayes, neural network ….”  “[0068] …  For example, a common representation of natural language documents used for this purpose is referred to as the “bag-of-words” model, where the order of the words is ignored and the frequency of each word is used as a feature input into the classifier….”]
Rationale of combination as provided for Claim 2.

Regarding Claim 12, McDonough and Thomson do not mention the Naïve Bayes model.
Lisuk teaches:
12. The system of claim 11, wherein the Naive Bayes bag of words are weighted by a lattice probabilities associated with the training set of the topic keyword-containing lattices. [Lisuk is directed to a classifier using the Naïve Bayes method of classification on a bag of words representation of a natural language document and teaches that the confidence scores of the training samples (i.e. probabilities associated with training set of the topic keyword-containing lattices) are weighted:  “[0102] In some embodiments, scores associated with different portions of the document are assigned different weights. For example, scores of training examples derived from the abstract may be multiplied by a different factor than training examples derived from other sections of the document. Thus, portions of the document assigned to higher weights are emphasized more during aggregation and more greatly impact the result of the classification”  “[0100] At block 505, the classification computer 301 classifies the unlabeled document using the trained classifier. In an embodiment, building upon the previous example, the classification computer 301 uses the binary classifier to determine a label (“STUDY TARGET” or “NOT STUDY TARGET”) for each protein for each linguistic structure. In an embodiment, for each protein, a score is assigned to each linguistic structure based on the determination that is then accumulated across the linguistic structures of the document. This accumulated score represents the classifier's confidence that a particular protein is the target protein discussed in the document. Thus, the protein with the highest accumulated score represents the predicted label for the document. In many cases, classifiers are designed to provide not just a classifier label, but also a score or probability associated with that classifier label. Thus, for each protein, the score assigned to each linguistic structure may be provided by the classifier itself, which is then accumulated across the entire document. However, in other embodiments specific scores or weights may be assigned to each outcome, such as (“STUDY TARGET=+1, NOT STUDY TARGET=−1). Thus, in an embodiment, at the end of block 505, the classification computer 301 has access to a score for each protein discussed in the document that indicates the likelihood that the protein is the target protein.”]
Rationale of combination as provided for Claim 2.  The classification method is from Lisuk which teaches weighting the results of topic detection and the entities subjected to classification are from the combination which takes the lattice hypotheses as input and performs ASR in the process of topic spotting of McDonough.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Santos (U.S. 10,838,954) teaches:
iv) apply a chi-squared test to select a set of topic-indicative words as a subset of the plurality of topic keyword hypotheses and select a training set of topic keyword- containing lattices associated with the set of topic-indicative words; [Santos teaches “The different ways a speech utterance 1204 may be interpreted (i.e., the different hypotheses) may each be assigned an ASR probability or an ASR confidence score representing the likelihood that a particular set of words matches those spoken in the utterance. … Thus, each potential textual interpretation of the spoken utterance (hypothesis) is associated with an ASR confidence score. Based on the considered factors and the assigned ASR confidence score, the ASR component 1208 outputs the most expected text recognized in the audio data. The ASR component 1208 may also output multiple ASR hypotheses in the form of a lattice or an N-best list with each hypothesis corresponding to an ASR confidence score or other score (such as probability scores, etc.).” Col. 31, lines 23-43.]

Karlinsky (U.S. 20200175332) Figure 6 teaches the two steps of “training a classifier to recognize a class based on a training data set … 602” and using the trained classifier to conduct the classification as “receive a classification related to the feature vector as an output from the classifier 606.”  In Figure 4, “training data 402” could be audio, text, or other forms of content.  “[0015] … Server 120 is connected to a data store 130, which generally represents a data storage entity (e.g., database, repository, or thee like) that stores content such as images, audio, video, text, and other content used in training synthesis models and classifiers according to embodiments described herein. …”  Classification includes topic identification: “[0026] Client 140 may use the classification in a variety of different ways. For example, the classification may be used to sort images, identify people, objects, or places present in images or videos, identify a song or a voice in an audio recording, identify a topic or other contents present in text, and for a variety of other purposes.”  

    PNG
    media_image8.png
    462
    274
    media_image8.png
    Greyscale

    PNG
    media_image9.png
    413
    538
    media_image9.png
    Greyscale


    PNG
    media_image10.png
    367
    526
    media_image10.png
    Greyscale

    PNG
    media_image11.png
    500
    453
    media_image11.png
    Greyscale


Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 1800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659