DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
EXAMINER’S AMENDMENT
An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.
Authorization for this examiner’s amendment was given in a telephone interview with Pengju Shang on 9/10/21.
The application has been amended as follows:
IN THE CLAIMS
The claims have been amended as follows:
Claim 1 (Currently amended): A non-transitory computer-readable storage medium for filtering sentence pairs in two languages, the storage medium storing instructions, when executed by one or more processors, cause the one or more processors to perform operations comprising:
receiving a corpus of sentence pairs, wherein a first half and a second half of each of the corpus of sentence pairs are associated with a first label of a first language and a second label of a second language, respectively;



dual monolingual cross-entropy deltas of (i) the first half of the sentence pair, with respect to a plurality of first training sentences, according to a first monolingual language model of the first language and a second monolingual language model of the first language and (ii) the second half of the sentence pair, with respect to a plurality of second training sentences, according to a first monolingual language model of the second language and a second monolingual language model of the second language, and
a first monolingual rank score for the first half of the sentence pair determined using a first ranking of the first half of the sentence pair when selecting the sentence pair to train a first language model[[s]] that models representative first sentences in the first language, and a second monolingual rank score for the second half of the sentence pair determined using a second ranking of the second half of the sentence pair when selecting the sentence pair to train a second language model[[s]], corresponding to the first language s representative second sentences in the second language;
determining an overall score for each of the plurality of sentence pairs using the plurality of feature scores for the sentence pair;
selecting a subcorpus of sentence pairs of the corpus of sentence pairs for performing a language processing task, wherein the selecting is based on [[using]] the overall score of each of the plurality of sentence pairs and a predetermined threshold of a number of different words in the first language comprised in the first halves of all sentence pairs of the subcorpus of sentence pairs


Claim 2 (Currently Amended) The non-transitory computer-readable storage medium of claim 1, wherein the operations further comprise 
	training a machine translation model using the subcorpus of sentence pairs, and wherein performing the language processing task comprises: performing the language processing task using the machine translation model.

Claim 3 (Currently Amended) A system for filtering sentence pairs in two languages comprising:
one or more processors; and

receiving a plurality of sentence pairs, wherein a first half and a second half of each of the plurality of sentence pairs are associated with a first label of a first language and a second label of a second language, respectively;
determining a plurality of feature scores for each of the plurality of sentence pairs according to a first monolingual language model of the first language, a second monolingual language model of the first language, a first monolingual language model of the second language, and a second monolingual language model of the second language, wherein the determining comprises:
determining a dual monolingual cross-entropy delta-score for each of the plurality of sentence pairs using dual monolingual cross-entropy deltas with respect to the first half and the second half of the sentence pair according to the first monolingual language model of the first language, the second monolingual language model of the first language, the first monolingual language model of the second language, and the second monolingual language model of the second language, and
determining a first monolingual rank score and a second monolingual rank score for the first half and the second half, respectively, of the sentence pair determined using a first ranking of the first half of the sentence pair and a second ranking of the second half of the sentence pair, respectively, when selecting the first half and the second half of the sentence pair to train a first language model and a second language model, respectively, that model representative first sentences in the first language and second sentences in the second language, respectively;
determining an overall score for each of the plurality of sentence pairs using the plurality of feature scores for the sentence pair; and
selecting a subset of sentence pairs of the plurality of sentence pairs using the overall score of each of the plurality of sentence pairs for performing a language processing task using the subset of sentence pairs, wherein the selecting is based on the overall score of each of the plurality of sentence pairs and a predetermined threshold of a number of different words in the first language comprised in the first halves of all sentence pairs of the subcorpus of sentence pairs.


Cancel claim 4.

	Claim 5 (Currently amended): The system of claim [[4]]3, wherein the dual monolingual cross-entropy[[-]]delta-score for each of the plurality of sentence pairs comprises an exponentiated version of the dual monolingual cross-entropy deltas 

	Claim 6 (Currently amended): The system of claim [[4]]3, wherein the memory further stores instructions, that when executed by the one or more processors, cause the system to perform:
.

	Claim 7 (Currently amended):  The system of claim [[6]]3, wherein the dual monolingual cross-entropy deltas of each of the plurality of sentence pairs are related to a first informativeness of the first half of the sentence pair with respect to the first monolingual language model in the first language and a second informativeness of the second half of the sentence pair with respect to the first monolingual language model in the second language.

	Claim 8 (Currently amended):  The system of claim [[4]]6, wherein determining the dual monolingual cross-entropy deltas with respect to the first half and the second half of each of the plurality of sentence pairs comprises:
for each of the plurality of sentence pairs:
determining (i) a first entropy, with respect to the first language, of [[the]] a plurality of first training sentences using the first language model in the first language;
determining (ii) a second entropy, with respect to the first language, of the plurality of first training sentences and the first half of the sentence pair using the second language model in the first language;

determining (i) a first entropy, with respect to the second language, of [[the]] a plurality of second training sentences using the first language model in the second language;
determining (ii) a second entropy, with respect to the second language, of the plurality of second training sentences and the second half of the sentence pair using the second language model in the second language;
determining (b) a second delta in entropies between (i) the first entropy, with respect to the second language and (ii) the second entropy, with respect to the second language;
wherein the dual monolingual cross-entropy deltas comprise a sum of 
the difference between (a) the first delta in entropies and (b) the second delta in entropies, and
the average of (a) the first delta in entropies and (b) the second delta in entropies.

Claim 11 (Currently amended):  The system of claim 9, wherein a first monolingual corpus comprises the plurality of first training sentences, wherein a second monolingual corpus comprises the plurality of second training sentences, and wherein the first monolingual corpus and the second monolingual corpus 

Claim 15 (Currently amended):   The system of claim 3, wherein determining the plurality of feature scores for each of the plurality of sentence pairs comprises:
determining a length score using a ratio of a first length of the first half of the sentence pair and a second length of the second half of the sentence pair; and
determining a language identification score using a first confidence score and a second confidence score that words comprised in the first half and the second half of the sentence pair, respectively, are in the first language and the second language, respectively; and
determining a rank score comprising a product of [[a]]the first monolingual rank score and [[a]]the second monolingual rank score 

	Cancel Claim 17.

Claim 18 (Currently amended): The system of claim [[17]]3, wherein the first predetermined threshold comprises a number of different words in the first language comprised in the first halves of all sentence pairs of the subset of sentence pairs.

	Cancel Claim 20.

Allowable Subject Matter
Claims 1-3, 5-16, 18-19 are allowed.
The following is an examiner’s statement of reasons for allowance: Claims 1, 3 and their dependent claims thereof are allowed because the closest prior art either alone or in combination, fail to anticipate or render obvious, the claimed limitations of “receiving a corpus of sentence pairs, wherein a first half and a second half of each of the corpus of sentence pairs are associated with a first label of a first language and a second label of a second language, respectively; determining a plurality of feature scores for each of the plurality of sentence pairs using: dual monolingual cross-entropy deltas of (i) the first half of the sentence pair, with respect to a plurality of first training sentences, according to a first monolingual language model of the first language and a second monolingual language model of the first language and (ii) the second half of the sentence pair, with respect to a plurality of second training sentences, according to a first monolingual language model of the second language and a second monolingual language model of the second language, and a first monolingual rank score for the first half of the sentence pair determined using a first ranking of the first half of the sentence pair when selecting the sentence pair to train a first language model that models representative first sentences in the first language, and a second monolingual rank score for the second half of the sentence pair determined in combination with all other limitations in the claim(s) as defined by applicant. 
Consequently, the disclosed independent claims are allowed on behalf of above-discussed reasons. Since the disclosed dependent claims are dependent on one of the above independent claims, therefore they are also patentable.
The closest prior art:
Bojar et al. (US 11037028) discloses computer-implemented method for creating a translation model for low resource language pairs and applicable on noisy inputs utilizing several approaches: choosing particular input corpora covering in-domain noisy and clean texts as well as unrelated but larger general parallel texts, performing several chosen methods of creating synthetic parallel corpora and filtering, pre-processing, deduplicating and concatenating training corpora. However, fails to disclose the subject matter as recited above in independent claims.
Hughes et al. 1(US 10318640) discloses techniques for evaluating when words or phrases of a translation were generated with a low degree of confidence, and conveying this information when the translation is presented. For example, if a source language word is encountered in source material for translation, but the source language word was only encountered a few times (or not at all) in the training data used to train the translation system, then the resulting translation may be flagged as being 
Axelrod et al. (US 2012/0203539) discloses architecture that provides the capability to subselect the most relevant data from an out-domain corpus to use either in isolation or in combination conjunction with in-domain data. The architecture is a domain adaptation for machine translation that selects the most relevant sentences from a larger general-domain corpus of parallel translated sentences. The methods for selecting the data include monolingual cross-entropy measure, monolingual cross-entropy difference, bilingual cross entropy, and bilingual cross-entropy difference. A translation model is trained on both the in-domain data and an out-domain subset, and the models can be interpolated together to boost performance on in-domain translation tasks. However, fails to disclose the subject matter as recited above in independent claims.
However, none of the above references teaches or fairly suggests the combination of the limitations as recited in the claims listed above.
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANTIM G SHAH whose telephone number is (571)270-5214.  The examiner can normally be reached on Mon-Fri 7:30am-4pm.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/ANTIM G SHAH/Primary Examiner, Art Unit 2652