Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-20 are pending. Claims 1, 13 and 20 are independent and have been amended.  Some of the dependent Claims have also been amended.
This Application was published as U.S. 20220254335.
            Apparent priority: 5 February 2021.
Figure 3 is representative of the invention:

    PNG
    media_image1.png
    573
    659
    media_image1.png
    Greyscale

Applicant’s amendments and arguments are considered but are either unpersuasive or moot in view of the new grounds of rejection that, if presented, were necessitated by the amendments to the Claims.
This action is Final.
Allowable Subject Matter
Claims 4-6 and 16-18 are objected to for depending from rejected base claims but is found otherwise allowable if rewritten into independent format including all of the limitations of its base claim and any intervening claims.
The following is an examiner’s statement of reasons for allowance: In view of each of the particular limitations of the independent Claims when considered in the order established by the Claim language and in the context of the language of the independent Claims when each Claim is considered as a whole, the independent Claims of this Application were not found in the prior art that was viewed.
In particular, note the response to Arguments below.  The second metric used to obtain the hyper interpolation weights cannot be based on the Expectation Maximization method but Claim 1 does not state it and it is only in Claims 4-6 (and 16-18) that types of metrics that are claimed cannot be based on EM.  “[0043] … Of course, in other embodiments, other metrics can be used. In an embodiment, the first metric is limited to entropy based one (e.g., PPL). EM should be applicable in this estimation block. Otherwise, metrics such as e.g., % WER, do not work with EM.”
  In Levit both the first and second interpolation coefficients are based on the EM algorithm.
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee. Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”

4. The computer-implemented method of claim 3, wherein the second metric specific to the application is a character error rate. 
5. The computer-implemented method of claim 3, wherein the second metric specific to the application is a word error rate. 
6. The computer-implemented method of claim 3, wherein the second metric specific to the application is a keyword error rate.

Claim 16 is a computer program product system claim with limitations corresponding to the limitations of method Claim 4.
Claim 17 is a computer program product system claim with limitations corresponding to the limitations of method Claim 5.
Claim 18 is a computer program product system claim with limitations corresponding to the limitations of method Claim 6.

Response to Arguments
Arguments are moot in view of the new grounds of rejection or modified mapping.
Claim 1 as amended provides:
1. A computer-implemented method for generating a language model for an application, the method comprising: 
estimating interpolation weights of each of a plurality of language models according to an Expectation Maximization (EM) algorithm based on an entropy-based first metric; 
separating language model components capable of being overestimated by the entropy-based first metric by 
classifying the plurality of language models into two or more sets based on characteristics of the two or more sets; 
estimating a hyper interpolation weight for the two or more sets based on a second metric specific to the application; and 
interpolating the plurality of language models using the interpolation weights and the hyper interpolation weight to generate a final language model.

Regarding the amended language: “separating language model components capable of being overestimated by the entropy-based first metric by classifying the plurality of language models into two or more sets based on characteristics of the two or more sets,” the Specification further provides:
[0047] At block 430A, renormalize the interpolation weights for each of the first 321 and the second set 322 using respective different objective functions for each of the two or more sets. This is described in further detail hereinbelow. However, it is to be appreciated that a goal here is to separate those corpuses and to process them separately, then apply the final metric. 
[0048] That is, the objective here is to separate LM components which could be overestimated by entropy based metric (i.e. PPL). A sample of general rules for the classification is as follows. The size of transcribed corpus usually is not so large because transcription cost is high. However, it is best matched. On the other hand, generated text by a RNN and pseudo truth text (ASR out corpus) are huge and easy to be overestimated even if it is not so well matched. They should be therefore processed separately. Medium size ones (not transcripts) and huge scraped ones (from Web sites) can be classified to another separate set (in total 3 groups). It is still possible to obtain the weights (by the final metric).

Applicant argues:

    PNG
    media_image2.png
    283
    647
    media_image2.png
    Greyscale

Response p. 8.
In Reply: 
“Interpolation weights” are based on the perplexity metric of each LM.
“Hyper interpolation weights” are based on a second metric specific to the application for which the LM is being generated.
Claim uses both the “interpolation weights” (IW) and the “hyper interpolation weights” (HIW) to generate the “final language model.”
The IW and HIW are defined differently.
The Claim does not specify expressly whether and how they are used differently for the interpolation of the LMs.  Although, based on the definition of the “second metric,” the HIW appear to be used for interpolation between sets of LMs.
Specification provides:
[0043] At block 410, estimate an interpolation weight wi for each of the corpuses C1 through CN to obtain estimated interpolation weights w1 through wn using an Expectation Maximization (EM) algorithm based on a first metric. In an embodiment, the first metric can be perplexity. Of course, in other embodiments, other metrics can be used. In an embodiment, the first metric is limited to entropy based one (e.g., PPL). EM should be applicable in this estimation block. Otherwise, metrics such as e.g., % WER, do not work with EM.
…
[0045] At block 430, estimate a hyper interpolation weight for the first set 321 and the second set 322 based on a second metric specific to the application. In an embodiment, the second metric can be selected as any of a character error rate, a word error rate, a keyword error rate, and an empirical metric based on human perception. Examples include, but are not limited to, bilingual evaluation understudy (BLEU), slot error rate, and so forth. BLUE is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is"--this is the central idea behind BLEU.

The Specification states: “metrics such as e.g., % WER, do not work with EM.”  Then, the Specification states that the second metric can be WER.  Accordingly, in the Specification, the IW and the HIW cannot be the same because the first one is based on EM and the other one not based on EM.  Applicant has similarly argued that:
    PNG
    media_image3.png
    103
    616
    media_image3.png
    Greyscale

However, the Claim, while basing the IW and HIW on two different things, does not say that they are mutually exclusive.  Figure 4 of Levit at 430 determines “first values for interpolation coefficients,” and after two more steps in between, determines “second values for interpolation coefficients.”  These are two different values like the IW and HIW of the Claim and it is NOT simply performing another iteration using the same metric.

The Rejection included and still includes with respect to the teachings of Levit that: “Steps 362 to 375 that perform this second round of optimization also used the EM approach ([0086]).”  This is where Levit is different from the Specification. This difference manifests in Claims 4-6 and 16-18.  But not in Claim 1.

As to the relationship between “perplexity” and “entropy-based” Applicant argues:

    PNG
    media_image4.png
    246
    620
    media_image4.png
    Greyscale

Response, p. 8.

In Reply, Claim 3 expressly states that the entropy-based metric is perplexity:

    PNG
    media_image5.png
    66
    606
    media_image5.png
    Greyscale

Therefore, a reference that teaches that perplexity is used as the metric also teaches that the metric is entropy based.
Further, Applicant admits that “perplexity” is “an example of a class” of “entropy-based” metrics.  It follows that “entropy-based metric” in a Claim is taught by a “perplexity-based metric” in a reference.  The teaching of a reference needs to be the same or narrower and here it is.  Explained another way:  say “entropy-based metrics” include 3 types: A, B, and C.  A reference that teaches any of A, B, or C teaches an “entropy-based metric.”  At the same time, if an “entropy-based metric” is itself a subset of “M-class metrics,” then a reference that teaches “M-class metrics” cannot teach an “entropy-based metric” because “M-Class metrics” include other types of metrics.  Narrow teaching teaches a broad claim; broad teaching does not teach a narrower claim.  

Applicant’s second argument follows from the first and is addressed by the above reply.  See Response, p. 9.

Note the Specification of the instant Application that itself describes “Perplexity” as a type of “Entropy Based Metric”:  “[0048] That is, the objective here is to separate LM components which could be overestimated by entropy based metric (i.e. PPL)….”
Note that the Specification of the instant Application admits that PPL is an entropy based metric but has flaws and addresses the flaws not by replacing this metric but by using a multi-step interpolation scheme:
[0003] PPL (perplexity) has been believed to be good for optimizing weights on LMs for a long time. This is partly because the calculation is quite fast by the Expectation Maximization (EM) algorithm. In addition, PPL is highly correlated with recognition accuracy (i.e., % WER). However, limitations of this have been pointed out when the LM is trained by out-of-domain text. The model for generating text is usually in-domain based. However, the reliability is unknown, where PPL would not be good for evaluating LMs.
…
[0017] In one or more embodiments of the present invention, it is proposed to combine Perplexity (PPL) based estimation and accuracy based on in multi-step interpolation, which allows us to calculate better weights at a reasonable computation cost. 
[0018] Perplexity is a way of evaluating language model. A language model is a probability distribution over entire sentences or texts. Perplexity is a measurement of how well a probability distribution or probability model predicts a sample. Perplexity can be used to compare probability models. A low perplexity indicates the probability distribution is good at predicting the sample. A high perplexity indicates the probability distribution is not good at predicting the sample.
Published Application.
In view of the above, it is apparent that the Inventors and their Representative are aware that Perplexity teaches Entropy-Based and in fact the Inventors had in mind, the metric of Perplexity (PPL) when they refer to Entropy-Based Metrics in their Claims.  Accordingly, further evidence (i.e. Rastrow) is not necessary.

See also the references in the Conclusion section including:  Ash (U.S. 20180315420): “[0179] Perplexity is a standard measure in the field of speech recognition and entropy is the logarithm of perplexity (and is normally used as it is more convenient in many cases).” 
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1, 3, 9-10, 13, 15, and 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Levit (U.S. 20160267905).
Regarding Claim 1, Levit teaches:
1. A computer-implemented method for generating a language model for an application, [Levit, “Optimized language models are provided for in-domain applications through an iterative, joint-modeling approach that interpolates a language model (LM) from a number of component LMs according to interpolation weights optimized for a target domain…..”  Abstract.  Figure 6 showing the computer 600.]
the method comprising: 
estimating interpolation weights of each of a plurality of language models according to an Expectation Maximization (EM) algorithm based on an entropy-based first metric; [Levit, Figure 3, “Receive a Plurality of Component LMs … 310,”  “Determine Initial Context-Independent Weights 320” and then “Optimize Context-Independent Weights 330” which is performed by the optimization process of steps 332-345 which use the EM algorithm by optimizing “perplexity of a training set.”  The “first metric” of the Claim may be taught by the “perplexity” factor.  Peplexity is “an entropy-based first metric.”  The “interpolation weights” are estimated/optimized according to an EM algorithm with the goal to “minimize perplexity of the resultant interpolated LM on an in-domain sample corpus” / “first metric”: “[0017] …  A goal of the interpolation and optimization process is to determine optimized interpolation weights λm so as to minimize perplexity of the resultant interpolated LM on an in-domain sample corpus. One approach to this task uses Expectation Maximization (EM) to estimate interpolation weights as n-gram responsibilities of each model averaged over the entire training corpus …”

    PNG
    media_image6.png
    416
    438
    media_image6.png
    Greyscale


    PNG
    media_image7.png
    144
    424
    media_image7.png
    Greyscale

….”  “[0073] Steps 332 through 345 provide embodiments of an iterative optimization process for determining optimized values for the context-independent interpolation weights corresponding to step 330. Some embodiments of this iterative optimization process are further described in connection to FIGS. 1 and 2A. As described previously, a goal of the interpolation and optimization process is to determine optimized interpolation weights λm so as to minimize perplexity of the resultant interpolated LM on an in-domain sample corpus. Accordingly, some embodiments of steps 332 through 345 may be understood as applying an EM approach to the optimization, wherein linear interpolation optimizes perplexity of a training set with respect to the collection of component LMs linearly combined in the probability space.”]
separating language model components capable of being overestimated by the entropy-based first metric by classifying the plurality of language models into two or more sets based on characteristics of the two or more sets; [Levit, the component Language Models (LMs) are “class-based” which means that they are classified into “domains”:  “[0058] Suppose a second component LM (referred to as "LM2") understands US state names and has the same vocabulary as LM1 and does not include the word "carolina" as a name, since LM2 is a class-based LM for understanding states. …”   “[0004] … A number of component LMs may be interpolated and optimized for a target domain by reducing the perplexity of in-domain training material.  The component LMs may include class-based LMs, and the interpolation may be context-specific or context-independent. …These updated interpolation weights may be used to produce new (or updated) weighting coefficients for LM components, such that a combination or interpolation of component LMs is further optimized for the target domain. … These interpolated LMs therefore may provide adaptability, personalization, and dynamically defined classes, as well as offer significant improvements in speech recognition accuracy and understanding, machine translation, and other tasks where interpolated LMs are used.”   Note that this claimed classification step is not based on the previous step of estimating the interpolation weights and appears disconnected from it.  (Consistent with the Supporting Specification.)] [The added language at the beginning of the limitation, restates the “classifying” step because “classifying” also “separates.”  The added language claims the result of the operation of classifying.  See also the supporting Specification that is discussed in the Response to Arguments.]
estimating a hyper interpolation weight for the two or more sets based on a second metric specific to the application; and [Levit, Figure 3, “Determine initial Context-Dependent Weights 360” from the first optimization at 330 and then proceed to the second round of optimization which is now “context-dependent” at “Optimize context-dependent Weights 360.”  Steps 362 to 375 that perform this second round of optimization also used the EM approach ([0086]).  The second-time optimized as “context dependent weights” teach the “hyper interpolation weight” of the Claim.  The “second metric” may be taught by the “context” which is a history “h” of the word and depends on the “context” which in turn is specific to each application/domain (In the medical domain treatment and patient occur close together whereas in the chemical domain treatment occurs with material or compound but not patient or person).  The n-grams/history/context appearing in one domain/application may be different from n-grams/history/context in another domain (see [0091]).  “[0017] …In the case of the context-specific scenario, one vector of interpolation weights … may be optimized for each history h, with the overall probability of a word given word history …”  “[0085] At step 360, the context-dependent weights are optimized. Some embodiments of step 360 comprise an iterative optimization process for determining optimized interpolation weights (as provided in steps 362 through 375) similar to the iterative process of step 330, except that the weights are context specific.  ….”  “[0049] …The interpolation weights λ may be context independent λm or context specific λm(h) (wherein weights are optimized for a history h), as described previously.”  See also [0085] for step 360 of Figure 3 where the weights are optimized the second time / “hyper interpolation weights.” ]
interpolating the plurality of language models using the interpolation weights and the hyper interpolation weight to generate a final language model. [Levit, “[0018] The resulting, optimized LMs, including for example merged or interpolated LMs created by embodiments of the invention and component LMs optimized by embodiments of the invention ….”  Figure 3, 380: “[0097] At step 380, the optimized context-dependent weights and component LMs are provided. … In some embodiments, the component LMs may be interpolated (according to the optimized context-dependent weights) on the fly, as needed in a real-time application scenario. Alternatively, in some embodiments the optimized context-dependent weights and component LMs provided in step 380 may be combined into a single, unified LM (such as by merging the component LMs according their corresponding optimized interpolation weights). This unified LM, which is context dependent, may be formed from component LMs that include class-based LMs (or LMs compatible with the WPE framework, described previously).”] 

Also:  Figure 4, 460 and Figure 5, 550 teach the same merging or combining of the component LMs what may be “class-based” into one LM.  “[016] …Alternatively, in some embodiments the component LMs and interpolation coefficients determined in step 460 may be combined into a single, unified LM (such as by merging the component LMs according their corresponding interpolation coefficients).”  “[0120] A… Alternatively, in some embodiments the optimized interpolation weights and component LMs provided in step 550 may be combined into a single, unified LM (such as merging the component LMs according their corresponding optimized interpolation weights). This unified LM, which is context-independent or context-specific, may be formed from component LMs that include class-based LMs (or LMs compatible with the WPE framework, described previously).”

Regarding Claim 3, Levit teaches:
3. The computer-implemented method of claim 1, wherein the entropy-based first metric is perplexity. [Levit teaches that the weights are selected to minimize “perplexity” as the “first metric” of the Claim: ““[0017] …  A goal of the interpolation and optimization process is to determine optimized interpolation weights λm so as to minimize perplexity of the resultant interpolated LM on an in-domain sample corpus….”]
(See also AAPA:  “[0002] Model interpolation is widely used for improving language model (LM) performance, where optimal weights on LM components are estimated by the condition that the perplexity (PPL) for the development set should be minimized….”)

Regarding Claim 9, Levit teaches:
9. The computer-implemented method of claim 1, further comprising transforming speech into text using the final language model in a speech recognition session. [Levit teaches the use of the interpolated LM for speech recognition in an ASR.  “[0018] The resulting, optimized LMs, including for example merged or interpolated LMs created by embodiments of the invention and component LMs optimized by embodiments of the invention, offer significant improvements in speech recognition and understanding, machine translation, and other tasks where LMs are used….”]

Regarding Claim 10, Levit teaches:
10. The computer-implemented method of claim 1, wherein the plurality of language models comprise n-gram language models. [Levit teaches that the component LMs and the interpolated resulting LM comprise of n-grams:  “[0004] … The component LMs may include class-based LMs, and the interpolation may be context-specific or context-independent. In particular, by way of iterative processes, component LMs may be interpolated and used to express training material in terms of n-grams (basic units of language modeling) in a number of alternative ways….”  “[0005] … Using these posterior probabilities, updated interpolation coefficients are determined that reflect contribution by the component LMs for each token n-gram relative to the sum of contributions of all component LMs towards the probability of that particular n-gram….”  See also [0021] and [0023].]

Claim 13 is a computer program product system claim with limitations corresponding to the limitations of method Claim 1 and is rejected under similar rationale.  Additionally:
13. A computer program product for generating a language model for an application, the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising:  [Levit, Figure 6 showing the computer 600 and “Memory 612.”]
…

Claim 15 is a computer program product system claim with limitations corresponding to the limitations of method Claim 3 and is rejected under similar rationale.

Claim 20 is a system claim with limitations corresponding to the limitations of Claim 1 and is rejected under similar rationale.  Additionally:
20. A computer processing system for generating a language model for an application, the system comprising: 
a memory device for storing program code; and  [Levit, Figure 6 showing the computer 600 and “Memory 612.”]
a processor device operatively coupled to the memory device for running the program code to: [Levit, Figure 6 showing the computer 600 and “Processors 614.”]
…
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 2 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Levit in view of Rastrow (U.S. 10,032,463). 
Regarding Claim 2, Levit teaches and therefore suggest:
2. The computer-implemented method of claim 1, 
wherein the estimating a hyper interpolation weight includes renormalizing the interpolation weights for each of the two or more sets using respective different objective functions for each of the two or more sets. [Levit, Figure 3 shows the two stages of optimizing the weights (context independent at 330 and then context dependent at 360) and both stages use a similar EM approach that is detailed in the telescoped flowcharts.  Each step, according to the written description of Levit” uses a corpus of sentences for each component LM (for each set or class) with normalized weights L’(w) for that class.  Accordingly, it is at the least suggested that the weights are “renormalized” for each set.  Figure 3, at 334:  “Determine Alternative Parses,”  “[0077] The alternative parsed representations of the corpus may be determined using the statistical data reflected in the component LMs, as described herein, with each parse comprising a sequence of one or more tokens that together represent the corpus. In particular, consider a corpus W of sentences w with normalized weights L'(w) …”  Then at step 364: “Determine Alternative Parses,” for the second round of optimizing weights:  “[0090] … In particular, consider again a corpus W of sentences w with normalized weights L'(w) ….” ]
Levit does not mention the use of “objective functions for … sets.”
Rastrow teaches:
wherein the estimating a hyper interpolation weight includes renormalizing the interpolation weights for each of the two or more sets using respective different objective functions for each of the two or more sets. [Rastrow, Figure 4, 410, “compute objective function” and “412: update neural networks ….”  The update of the neural networks is done by use of the “objective function” which is a function that represents the classification of the Language Models (in Rastrow classifications is based on users, user-specific, so each user would be a domain in the context of Claims of this Application).  Steps 408-410 (generation of the LM output to update of the NN) are repeated and the “parameters” are updated.  The repetition and update suggest the “renormalizing” of the Claim.   “At block 410, the computing device 500 or some other component of the spoken language processing system 100 can compute an objective function value for language model output generated above at block 410. Illustratively, the objective function may be a classification-based objective function, such as cross entropy. Generally described, computing the objective function value involves comparing, for a given training data input, the actual language model output with the expected or correct output to determine a difference or measure of error in the actual output. A NN is trained using such an objective function by modifying the parameters (e.g., weights) of the neural network to minimize the error.”  Col.11, lines 35-48.  “At block 412, the computing device 500 or some other component of the spoken language processing system 100 can perform back propagation to update parameters of both the language model decoder 118 and the interaction history encoder 116. The computing system 500 can compute updates to the parameters of the encoder and decoder NNs by computing the gradient of the objective function with respect to the individual parameters of the NNs. The gradient can be computed as the derivative of the objective function with respect to any weight in the NNs. This gradient can be used to modify the parameters (e.g., the weights) of the nodes of the NNs to reduce the error of the NNs (e.g., the difference between the actual output for a given input and the correct or preferred output for the input), as determined using the objective function.”  Col. 11, lines 48-62.  “In some embodiments, the parameter updaters for the entire set of training data, or for subsets of training data, may be aggregated before applying them to the model parameters. In these embodiments, blocks 408 and 410 may be repeated for each input vector of the training data or current subset before the aggregated updates are applied at block 412.”  Col. 12, lines 35-44.]  [Note that “objective function” occurs a single time in the Specification and it is not clear how it is used for or related to the normalizing or renormalizing of the weights that are shown in [0052] and [0053].]
Levit and Rastrow pertain to natural language processing and development of Language Models for uses including ASR.  It would have been obvious to combine the concept of objective function of an LM from Rastrow with the system of Levit to arrive at the concept of modifying a LM by changing its objective function that is claimed.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 14 is a computer program product system claim with limitations corresponding to the limitations of method Claim 2 and is rejected under similar rationale.  


Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Levit in view of Dymetman (U.S. 20070033002) and Peng (U.S. 20220084510). 

Regarding Claim 7, Levit teaches:
7. The computer-implemented method of claim 1, wherein the classifying includes classifying the plurality of language models into a first set trained by written texts and manual transcriptions and a second set of generation-based models. [Levit, Figure 3, “Receive a Plurality of Component LMs and Training Material 310.”  The corpus includes text and LM from a “first set” of textual sources/ “written texts and manual transcriptions” and a “second set” of texts and LM that were artificially collected or created / “generation-based models.”  “[0068] Accordingly, at step 310, a plurality of component LMs and training material are received. In one embodiment, training material comprises a training corpus of textual information. …  In one embodiment, the training material may be received from one or more sources including user queries, SMS messages, web documents, electronic libraries, books, user input libraries, or artificially collected or created samples, for example. In one embodiment, the training material defines a domain and in one embodiment is related to a structured application domain; for example, a home entertainment system, personal device, or vehicle.”  See also Figure 4, 410, and [0098] and Figure 5, 510, and [0109].]
wherein the first set and the second set are divided based on corpus date, corpus style, and data generation model. [Levit teaches that types of corpora may be divided/classified according to date: “[0051] In some embodiments, each component LM 220 may reflect specific subdomains or certain types of corpora, such as certain classes (e.g. personal names, locations, dates/times, movies, games, etc.) words or dictionaries, phrases, or combinations of these, such as token-based component LMs….” ] (“Style” is a broad term; the instant Application defines it as:  “[0044] …These characteristics can include a data generation method of the sets, corpus date (recent, old), corpus style (colloquial, written, formal), and so forth.”)
Levit does not teach that the LMs or training corpora are divided according to style (colloquial, written, formal) or generation model.

Dymetman teaches:
wherein the first set and the second set are divided based on corpus date, corpus style, and data generation model. [Dymetman teaches: “[0026] … Other second language corpora are also contemplated. For example, if the author wants a text appropriate for modern colloquial usage, the second language corpus 28 can be built from a collection of recent newspaper stories. If the author is preparing a more formal paper, the language model might be built on the basis of a corpus of technical papers….”]
Levit and Dymetman pertain to natural language processing and development of corpora.  It would have been obvious to combine the concept of corpus style for dividing the corpora from Dymettman with the system of Levit that provides a number of criteria that may actually be construed as style as another category of classification.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Neither expressly mentions classifying the corpora according the generation model that is used.
Peng teaches:
wherein the first set and the second set are divided based on corpus date, corpus style, and data generation model. [Peng, Figures 5A and 5B.  “[0061] …  The disclosed implementations provide a mechanism for training a natural language generation model that can be employed in such a dialog system, or that can produce training examples for training a natural language understanding model.”  “[0071] FIG. 5A illustrates an example processing flow 500 for synthetic corpus generation using paired-data only and/or rich-in-ontology data, and FIG. 5B illustrates an example processing flow 550 for synthetic corpus generation using rich-in-utterance data.”]
Levit/Dymetman and Pent pertain to natural language processing and development of corpora.  It would have been obvious to combine the concept of corpus generation model for dividing the corpora from Peng with the system of combination that provides a number of criteria and also mentions that the corpora may be generated.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.


Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Levit, Dymetman, and Peng in view of Kao (U.S. 2017/0300563). 
Regarding Claim 8, Levit does not specify the manner of generation of the training corpus.  Dymetman and Peng do not mention an RNN.
Kao teaches and suggests:
8. The computer-implemented method of claim 7, wherein the set of generation-based methods includes language models based on pseudo truths generated from a Recurrent Neural Network. [Kao teaches generating text snippets:  “[0001] … More specifically, the present disclosure relates to the generating of text snippets using a supervised machine learning algorithm.”  Figure 4, “Train Neural Network 434.”  “[0018] In an example embodiment, a neural network is constructed and trained by representing a job as several feature vectors, where each feature vector represents a single sentence or snippet from the job description. These snippets are then scored by the trained neural network, which outputs a ranking score where a higher score is better. A member profile and keywords may be used after the fact to boost and re-rank snippets. With this approach, the snippet generation pipeline can take advantage of additional sources of data (the member profile and keywords) to provide customized snippets tailored to the member's search query….”  [0037] for neural networks trained using a supervised learning algorithm.  See [0038] for use of a Latent Dirichlet Allocation (LDA) to learn topics based on the atomic text units: “[0039] The unsupervised machine learning algorithm first extracts atomic units of text from the documents that might be useful in learning relevant topics. Then a process called Latent Dirichlet Allocation (LDA) is used to learn latent topics based on the atomic text units, using the desired number of topics and desired granularity of the topic model as parameters to the LDA process. LDA is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. Specifically, if topics are correlated in a document, there is a tendency for the algorithm to deduce that they are similar to each other. The more correlations in documents, the more similar the topics. The LDA algorithm identifies the broad classes of topics, and with each topic it saves a list of terms along with weights for each term (the weights indicating how related the term is to the topic). For a given document, the distribution over the topics is obtained. The result is a topic model having a list of topics and, for each topic, the list of related terms with weights for each term.”  Kao does not specifically teach the use of an RNN.  However, RNNs are the more commonly used NNs in the area of speech and natural language and the use of an RNN is suggested by the teaching of the use of the NN.]
Levit/Dymetman/Peng and Kao pertain to natural language processing as does the Claim and it would have been obvious to combine or substitute the teachings of the Kao with respect to snippet generation with the generative corpuses of combination which does not specify the method of generation as one manner of generating text.  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Levit. 
Regarding Claim 11, Levit teaches and therefore suggests:
11. The computer-implemented method of claim 10, wherein n is equal to 2 or more. [Levit as applied to Claim 10 teaches the use of n-grams and n-grams imply bigrams (n=2) or trigrams (n=3).  Levit also teaches the use of “context specific LMs” which means that n is at least 2 to provide context for each word.  See Abstract.]

Claims 12 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Levit in view of Brav (U.S. 2015/0312200). 
Regarding Claim 12, Levit classifies the sets by domain and not by date or style.
Brav teaches:
12. The computer-implemented method of claim 1, wherein the characteristics of the two or more sets comprise corpus date and corpus style. [Brav classifies a piece of text according to the corpus to which it may belong prior to submitting this piece of text to the proper destination.  Brav teaches corpus classification by style and date among other factors: “[0176] Referring again to FIG. 1L, in an embodiment, document outcome prediction assistance implementation 5900 may include a received source document structural analyzing module 5920, which, in an embodiment, may include one or more of a source document structure analyzing module 5922, a source document style analyzing module 5924, and a source document reading level analyzing module 5926. …”  “[0208] Referring now to FIG. 1AB, in an embodiment, similar works finding module 6500 may include a corpus comparison module 6530. Corpus comparison module 6530 may receive data set 4130 from the semantic corpus analyzer 4100 shown in FIG. 1K, or may obtain a corpus of texts, e.g., all the patents in a database, or all the articles from an article repository, e.g., the ACM document repository. Corpus comparison module 6530 may include the corpus obtaining module 6532 that obtains the corpus 5040, either from an internal source or an external source. Corpus comparison module 6530 also may include corpus filtering module 6534, which may filter out portions of the corpus (e.g., for a patent prior art search, it may filter by date, or may filter out certain references). Corpus comparison module 6530 also may include filtered corpus comparing module 6536, which may compare the filtered corpus to the source document.”  “[0211] Referring again to FIG. 1AB, received document to selected document mapping module 6540 may include an all-element mapping module 6542 for patent documents, a data/chart mapping module 6544 for research documents, and a style/structure mapping module 6546 for student paper documents. Any of these modules may be used to generate the mapped document 5060.”]
Levit and Brav pertain to natural language processing as does the Claim and it would have been obvious to combine or substitute the teachings of the Brav with respect to filtering and classification of corpora according to a number of different factors such as date and style with the system of Levit which classifies the corpora according to domain.  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 19 is a computer program product system claim with limitations corresponding to the limitations of method Claim 12 and is rejected under similar rationale.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 

Rastrow (U.S. 10,032,463): “At block 410, the computing device 500 or some other component of the spoken language processing system 100 can compute an objective function value for language model output generated above at block 410. Illustratively, the objective function may be a classification-based objective function, such as cross entropy. Generally described, computing the objective function value involves comparing, for a given training data input, the actual language model output with the expected or correct output to determine a difference or measure of error in the actual output. A NN is trained using such an objective function by modifying the parameters (e.g., weights) of the neural network to minimize the error.”  Col. 11, lines 37-45. 

See also: 
Ash (U.S. 20180315420): “[0179] Perplexity is a standard measure in the field of speech recognition and entropy is the logarithm of perplexity (and is normally used as it is more convenient in many cases).” 

Chen (U.S. 20120109651): “[0170] There are a number of ways of assessing the "goodness" of a LM. For the speech recognition purposes the recognition error rate achieved using a given LM is the most important criterion. Besides the recognition rate, the most common metric for evaluating a LM is cross-entropy or perplexity. Given a test set. W and a LM m, the cross-entropy between W and m is defined as …”

Ehsani (U.S. 20020128821): “[0099] Perplexity/Entropy [0100] Perplexity is a measure for determining the average branching factor of a recognition network and it is most often used as a measure for evaluating language models. It indicates the probability, computed over an entire network, that any given element can be followed by any other. For example, in a digit recognition system composed of 0-9 digits and two pronunciations for 0 ("oh" and "zero"), the perplexity of the recognition grammar exactly equals the number of elements, 11, because there are no constraining factors that favor certain digit sequences over others. Because word sequences underlie various kinds of constraints (imposed by syntax, morphology, idiomatic usage etc.) perplexity has been found useful in natural language processing to measure the strength of certain collocations (see, for example, Shimohata, S, T. Sugio, J. Nagata, "Retrieving Collocations by Co-occurrence and Word Order Constraints," Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, 1997, pp. 476-481.)”

Brav (U.S. 20150310128) teaches: Figure 1AB:  “[0208] Referring now to FIG. 1AB, in an embodiment, similar works finding module 6500 may include a corpus comparison module 6530….  Corpus comparison module 6530 also may include corpus filtering module 6534, which may filter out portions of the corpus (e.g., for a patent prior art search, it may filter by date, or may filter out certain references). Corpus comparison module 6530 also may include filtered corpus comparing module 6536, which may compare the filtered corpus to the source document.”

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659