DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The present application, filed on 07/16/2019. Claim 1-20 are pending and have been examined. Claims 1,11 and 20 are independent claim. 
The present application claims benefits of provisional application 61/934,674 (filed on 01/31/2014) and claims Parent priority to application no. 14/609,869 (field on 01/30/2015).
Priority 
Receipt is acknowledged certified copies of papers required by 27 CFR 1.55.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 02/06/2020; 12/01/2020; 04/21/2021 and 08/06/2021. The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Specification
The specification is objected to as failing to provide proper antecedent basis for the claimed subject matter.  See 37 CFR 1.75(d)(1) and MPEP § 608.01(o).  Correction of the following is required: 
Claim 20 recites “computer-readable storage media” but the Specification does not recites “computer-readable storage media”, therefore the Specification is .
Claim Objections
Claims 1-20 are objected to because of the following informalities:
In claim 1, line 10, “The sequence” should read “the sequence of words”.
In claim 5, line 2, “the neural network system” should read “the trained neural network system”.
In claim 7, line 2 “the plurality of sequences” should read “the plurality of sequences of words” 
In claim 7, line 3 “the sequence” should read “the sequence of words” 
In claim 7, line 5 “the sequence” should read “the sequence of words” 
In claim 7, line 9  “the sequence” should read “the sequence of words” 
In claim 8, line 2 “the sequence” should read “the sequence of words” 
In claim 9, line 2 “the sequence” should read “the sequence of words” 
In claim 11, line 14, “The sequence” should read “the sequence of words”.
In claim 15, line 2, “the neural network system” should read “the trained neural network system”.
In claim 7, line 2 “the plurality of sequences” should read “the plurality of sequences of words” 
In claim 17, line 3 “the sequence” should read “the sequence of words” 
In claim 17, line 5 “the sequence” should read “the sequence of words” 
In claim 17, line 9 “the sequence” should read “the sequence of words” 
In claim 18, line 2 “the sequence” should read “the sequence of words” 
In claim 19, line 2 “the sequence” should read “the sequence of words” 
In claim 20, line 13, “The sequence” should read “the sequence of words”.
Claims 12-19 depend on claim 11 and do not cure the deficiencies of the claim 11 therefore claims 12-19 are rejected for the same rationales.
Claims 2-10 depend on claim 1 and do not cure the deficiencies of the claim 1 therefore claims 2-10 are rejected for the same rationales. 
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, 
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 

This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: 
Claim 2: 
	further comprising providing the vector representation of the new document as an input to a machine learning system configured to process the vector representation;
Claim 3: 
an embedding layer configured to map the input document and each word in the sequence of words from the input document to respective vector representations
a combining layer configured to combine the vector representations into a combined representation
a classifier layer configured to generate the word scores using the combined representation
Claim 8:
the combining layer is configured to concatenate the vector representations of the words in the sequence with the vector representation of the input document. 
Claim 9: 
	the combining layer is configured to average the vector representations of the words in the sequence and the vector representation of the input document
Claim 12: 
	a machine learning system configured to process the vector representation.
Claim 13: 
	an embedding layer configured to map the input document and each word in the sequence of words from the input document to respective vector representation
a combining layer configured to combine the vector representations into a combined representation,
a classifier layer configured to generate the word scores using the combined representation.
Claim 18:
	the combining layer is configured to concatenate the vector representations of the words in the sequence with the vector representation of the input document.
Claim 19:
	the combining layer is configured to average the vector representations of the words in the sequence and the vector representation of the input document.
Upon a review of the Specification, each of the bolded generic placeholder in the claims above is described in Drawings Fig. 1 and the following paragraph: 
[0028] “the document representation can be used as a feature of the input document 
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1-20  are rejected under 35 U.S.C 112(b)  or 35 U.S.C 112 (pre-AIA ), second paragraph, as failing to set forth the subject matter which the inventor or a joint inventor, or for application subject to pre-AIA  35 U.S.C 112, the application regards, as the invention. 
Each of the claim limitations in claim 2, 3, 8, 9,12, 13, 18, 19  as identified in section 6 of this Office Action invokes 35 U.S.C.112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function. 


Therefore, 
the claim is indefinite and is rejected under 35 U.S.C. 112(b) or pre-AIA  35 U.S.C. 112, second paragraph.
Applicant may:
(a)        Amend the claim so that the claim limitation will no longer be interpreted as a limitation under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph; 
(b)        Amend the written description of the specification such that it expressly recites what structure, material, or acts perform the entire claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(c)        Amend the written description of the specification such that it clearly links the structure, material, or acts disclosed therein to the function recited in the claim, without introducing any new matter (35 U.S.C. 132(a)).
If applicant is of the opinion that the written description of the specification already implicitly or inherently discloses the corresponding structure, material, or acts and clearly links them to the function so that one of ordinary skill in the art would recognize what structure, material, or acts perform the claimed function, applicant should clarify the record by either: 
(a)        Amending the written description of the specification such that it expressly recites the 
(b)        Stating on the record what the corresponding structure, material, or acts, which are implicitly or inherently set forth in the written description of the specification, perform the claimed function. For more information, see 37 CFR 1.75(d) and MPEP §§ 608.01(o) and 2181.
Claim 1 recites the limitation "the respective word scores" in line 9.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes examiner has interpreted to be “a respective word scores”.
Claim 1 recites the limitation "the corresponding word " in line 10.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes examiner has interpreted to be “a corresponding word”.
Claim 4 recites the limitation "the words" in line 2.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes examiner has interpreted to be “a plurality of words”.
Claim 5 recites the limitation "the values" in line 1.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes examiner has interpreted to be “a plurality of values”.
Claim 11 recites the limitation "the respective word scores" in line 13.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes examiner has interpreted to be “a respective word scores”.
Claim 11 recites the limitation "the corresponding word " in line 14.  There is 
Claim 14 recites the limitation "the words" in line 2.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes examiner has interpreted to be “a words”.
Claim 15 recites the limitation "the values" in line 1.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes examiner has interpreted to be “a values”.
Claim 20 recites the limitation "the respective word scores" in line 12.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes examiner has interpreted to be “a respective word scores”.
Claim 20 recites the limitation "the corresponding word " in line 13.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes examiner has interpreted to be “a corresponding word”.
Claims 2-10 depend on claim 1 and do not cure the deficiencies of the claim 1 therefore claims 2-10 are rejected for the same rationales. 
Claims 12-19 depend on claim 11 and do not cure the deficiencies of the claim 11 therefore claims 12-19 are rejected for the same rationales.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.



Regarding Claim 20:
	The broadest reasonable interpretation of claim covers a signal per se, the claim must rejected under 35 U.S.C. § 101 as covering non-statutory subject matter. See in re Nuijten, 500F. 3d 1346, 1356-57 (Fed. Cir. 207) (transitory embodiment are not directed to statutory subject matter) and Interim Examination instructions are Evaluating Subject Matter Eligibility  under 35 U.S.C. § 101, Aug 24, 2009; p.2.1351 Off. Gaz. Pat. Off. 212(2010). Under broadest reasonable interpretation, “computer-readable storage media” recited in claim 20 encompasses any form of modulated data signal, carrier wave, and so forth. Nuijten, 500 F.3d at 1357. The claim covers materials not found in any of the four statutory categories [and thus] falls outside the plainly expressed scope of  § 101. “Id. At 1354. A recommended amendment is to recite “non- transitory computer-readable storage media” (emphasis added) without contradiction. 
Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
s 1, 3, 4, 6-10 are rejected on the ground of nonstatutory double patenting as being unpatentable over claim 1-6 of U.S. Patent No. US 10366327B2. Although the claims at issue are not identical, they are not patentably distinct from each other because all the claimed limitations recited in the present application are transparently found in the U.S. Patent No. US 10366327 B2 with obvious wording variations. The instant application’s claims are anticipated by the reference patent's claims.
Regarding Claim 1, 3, 4, 6-10:
Instant Application No. 16/523,766 
Patent  No. US 10366327B2 
Claim 1:
A method comprising: 

obtaining a new document, wherein the new document includes a plurality of sequences of words, and, for each sequence of words, a word that follows a last word in the sequence of words in the new document; 


and determining a vector representation for the new document using a trained neural network system, 

wherein the trained neural network system has been trained on a plurality of unlabeled documents and has been trained to receive an input document and a sequence of words from the input document and to generate a respective word score for each word in a pre-determined set of words, wherein each of the respective word scores represents a predicted likelihood that the corresponding word follows a last word in the sequence in the input document, 










and wherein determining the vector representation for the new document using the trained neural network system comprises iteratively providing each of the plurality of sequences of words to the trained neural network system to determine the vector representation for the new document using gradient descent.
Claim 1: 
A method comprising: 

obtaining a new document; extracting, from the new document, (i) a plurality of sequences of words that are each a pre-determined fixed length, and, (ii) for each sequence of words, a word that follows a last word in the sequence of words in the new document; 

and determining a vector representation for the new document using a trained neural network system, 

wherein the trained neural network system has been trained on a plurality of unlabeled documents and has been trained to: receive data identifying an input document and a sequence of words from the input document, generate, from the data identifying the input document, a vector representation of the input document, and process the vector representation of the input document and the sequence of words from the input document in accordance with trained values of a third set of parameters to generate a respective word score for each word in a pre-determined set of words, wherein each 

and wherein determining the vector representation for the new document using the trained neural network system comprises iteratively providing each of the plurality of sequences of words to the trained neural network system to determine the vector representation for the new document using gradient descent, comprising, for each sequence of words: providing the sequence of words to the trained neural network system to obtain a respective word score for each word in the pre-determined set of words generated using the vector representation of the new document and in accordance with the trained values of the third set of parameters, computing a gradient with respect to the vector representation of an error function that measures an error between the respective word scores and a target set of word scores that identifies the word that follows the last word in the sequence of words in the new document, and adjusting the vector representation for the new document based on the gradient using gradient descent while holding the trained values of the third set of parameters fixed.

Claim 3: 
    The method of claim 1, wherein the trained neural network system comprises an embedding layer configured to map the input document and each word in the sequence of words from the input document to respective vector representations, a combining layer configured to combine the vector 
Claim 2:
The method of claim 1, wherein the trained neural network system comprises an embedding layer configured to map the input document and each word in the sequence of words from the input document to respective vector representations, a combining layer configured to combine 
Claim 4: 
The method of claim 3, wherein the embedding layer maps the words in the sequence of words to vector representations in accordance with a first set of parameters, and wherein the classifier layer generates the word scores from the combined representation in accordance with a second set of parameters.
Claim 3:
The method of claim 2, wherein the third set of parameters includes a first set of parameters and a second set of parameters, and wherein the embedding layer maps the words in the sequence of words to vector representations in accordance with the first set of parameters, and wherein the classifier layer generates the word scores from the combined representation in accordance with the second set of parameters.
Claim 6:  
































The method of claim 3, wherein determining the vector representation for the new document using the trained neural network system comprises performing a respective iteration of gradient descent for each of the plurality of sequences of words to adjust the vector representation of the new document from a previous iteration of gradient descent.


Claim 1: 
A method comprising: obtaining a new document; extracting, from the new document, (i) a plurality of sequences of words that are each a pre-determined fixed length, and, (ii) for each sequence of words, a word that follows a last word in the sequence of words in the new document; and determining a vector representation for the new document using a trained neural network system, wherein the trained neural network system has been trained on a plurality of unlabeled documents and has been trained to: receive data identifying an input document and a sequence of words from the input document, generate, from the data identifying the input document, a vector representation of the input document, and process the vector representation of the input document and the sequence of words from the input document in accordance with trained values of a third set of parameters to generate a respective word score for each word in a pre-determined set of words, wherein each 

and wherein determining the vector representation for the new document using the trained neural network system comprises iteratively providing each of the plurality of sequences of words to the trained neural network system to determine the vector representation for the new document using gradient descent, 

comprising, for each sequence of words: providing the sequence of words to the trained neural network system to obtain a respective word score for each word in the pre-determined set of words generated using the vector representation of the new document and in accordance with the trained values of the third set of parameters, computing a gradient with respect to the vector representation of an error function that measures an error between the respective word scores and a target set of word scores that identifies the word that follows the last word in the sequence of words in the new document, and adjusting the vector representation for the new document based on the gradient using gradient descent while holding the trained values of the third set of parameters fixed.

Claim 7:

The method of claim 6, 

wherein the performing the respective iteration of gradient descent for each of the plurality of sequences comprises: 

mapping each of the words in the sequence to a vector representation using the embedding layer; 

combining the vector representation for the words in the sequence with the vector representation for the new document from the previous iteration to generate a combined representation; 

generating word scores from the combined representation; 



















































computing a gradient using the word scores and the word that follows the sequence in the new document; 






and adjusting the vector representation for the new document from the previous iteration using the gradient.
Claim 4:

 The method of claim 2, 

wherein, for each of the plurality of sequences the trained neural network system is configured to: 

map each of the words in the sequence to a vector representation using the embedding layer; 

combine the vector representation for the words in the sequence with the vector representation for the new document from the previous sequence of words to generate a combined representation;

 and generate the word scores from the combined representation.

Claim 1: 
A method comprising: obtaining a new document; extracting, from the new document, (i) a plurality of sequences of words that are each a pre-determined fixed length, and, (ii) for each sequence of words, a word that follows a last word in the sequence of words in the new document; and determining a vector representation for the new document using a trained neural network system, wherein the trained neural network system has been trained on a plurality of unlabeled documents and has been trained to: receive data identifying an input document and a sequence of words from the input document, generate, from the data identifying the input document, a vector representation of the input document, and process the vector representation of the input document and the sequence of words from the input document in accordance with trained values of a third set of parameters to generate a respective word score for each word in a pre-determined set of words, wherein each of the respective word scores represents a predicted likelihood that
the corresponding word follows a last 

computing a gradient with respect to the vector representation of an error function that measures an error between the respective word scores and a target set of word scores that identifies the word that follows the last word in the sequence of words in the new document, 

and adjusting the vector representation for the new document based on the gradient using gradient descent while holding the trained values of the third set of parameters fixed.

Claim 8: 
The method of claim 3, wherein the combining layer is configured to concatenate the vector representations of the words in the sequence with the vector representation of the input document.
Claim 5 : 
The method of claim 2, wherein the combining layer is configured to concatenate the vector representations of the words in the sequence with the vector representation of the input document.
Claim 9:
The method of claim 3, wherein the combining layer is configured to average the vector representations of the words in 

Claim 6: 
The method of claim 2, wherein the combining layer is configured to average the vector representations of 
Claim 10:  


The method of claim 1, wherein each of the plurality of sequences of words contains a fixed number of words.
Claim 1: 
A method comprising: 

obtaining a new document; extracting, from the new document, (i) a plurality of sequences of words that are each a pre-determined fixed length, 

and, (ii) for each sequence of words, a word that follows a last word in the sequence of words in the new document; 

and determining a vector representation for the new document using a trained neural network system, 

wherein the trained neural network system has been trained on a plurality of unlabeled documents and has been trained to: receive data identifying an input document and a sequence of words from the input document, generate, from the data identifying the input document, a vector representation of the input document, and process the vector representation of the input document and the sequence of words from the input document in accordance with trained values of a third set of parameters to generate a respective word score for each word in a pre-determined set of words, wherein each of the respective word scores represents a predicted likelihood that the corresponding word follows a last word in the sequence in the input document, 

and wherein determining the vector representation for the new document using the trained neural network system comprises iteratively providing each of 



Claim 1 of the current application differs from claim 1 of the reference application in that claim 1 (instant) recites “wherein the new document includes” whereas Claim 1 (reference) recites “extracting, from the new document“. The difference between the recitations are minor and do not distinguish the overall appearance of one over the other; the reference "new document" reads on the instant's "new document".
Claim 6 of the current application differs from claim 1 of the reference application in that claim 6 (instant) recites  “performing a respective iteration of gradient descent…. to adjust the vector representation… from a previous iteration of gradient descent” whereas Claim 1 (reference) recites “iteratively providing each of the plurality of sequences of words to the trained neural network system to determine the vector representation for the new document using gradient descent”.  The difference between the recitations are minor and do not distinguish the overall appearance of one over the other; the reference’s iteratively provide vector representation using gradient descent reads on the instant's iteration of gradient descent and adjusting the vector representation .  
Claim 7 of the current application differs from claim 4 and 1 of the reference application in that claim 7 (instant) recites  “wherein the performing the respective iteration of gradient descent for each of the plurality of sequences comprises…the previous iteration…using the word scores and the word that follows the sequence….the previous iteration using the gradient.” whereas Claim 4 (reference) recites “wherein, for each of the plurality of sequences the trained neural network system is configured to:… the previous sequence of words to generate a combined representation;” and Claim 1 (reference) recites “the respective word scores and a target set of word scores that identifies the word that follows the last word in the sequence of words in the new document… based on the gradient using gradient descent” The difference between the recitations are minor and do not distinguish the overall appearance of one over the other; the reference “Plurality of sequence the trained neural network system" reads on the instant's "the respective iteration of gradient descent of each of plurality of sequence"; the reference “previous sequence of words" reads on the instant's "previous iteration"; the reference “the respective word scores and a target set of word scores that identifies the word that follows the last word in the sequence of words" reads on the instant's "using the word scores and the word that follows the sequence " and the 
Dependent claims 3, 4, and 8-10 of the instant application are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-6 of U.S. Patent No. US 10366327B2 (reference application) for the same rationale as discussed with respect to instant application claim 1. 
Claims 11, 13, 14, 16-20  are rejected on the ground of nonstatutory double patenting as being unpatentable over claim 1-6 of U.S. Patent No. US 10366327B2. The instant application’s claims are obvious variation of the reference patent's claims.
Regarding Claim 11, 13, 14, 16-20

Claim 11:
A system comprising: 

one or more computers and one or more storage devices on which are stored instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: 

obtaining a new document, wherein the new document includes a plurality of sequences of words, and, for each sequence of words, a word that follows a last word in the sequence of words in the new document;


and determining a vector representation for the new document using a trained neural network system, 

wherein the trained neural network system has been trained on a plurality of unlabeled documents and has been and to generate a respective word score for each word in a pre-determined set of words, wherein each of the respective word scores represents a predicted likelihood that the corresponding word follows a last word in the sequence in the input document, and 










wherein determining the vector representation for the new document using the trained neural network system comprises iteratively providing each of the plurality of sequences of words to the trained neural network system to determine the vector representation for the new document using gradient descent.
Claim 1: 
A method comprising: 









obtaining a new document; extracting, from the new document, (i) a plurality of sequences of words that are each a pre-determined fixed length, and, (ii) for each sequence of words, a word that follows a last word in the sequence of words in the new document; 

and determining a vector representation for the new document using a trained neural network system, 

wherein the trained neural network system has been trained on a plurality of unlabeled documents and has been 

and wherein determining the vector representation for the new document using the trained neural network system comprises iteratively providing each of the plurality of sequences of words to the trained neural network system to determine the vector representation for the new document using gradient descent, comprising, for each sequence of words: providing the sequence of words to the trained neural network system to obtain a respective word score for each word in the pre-determined set of words generated using the vector representation of the new document and in accordance with the trained values of the third set of parameters, computing a gradient with respect to the vector representation of an error function that measures an error between the respective word scores and a target set of word scores that identifies the word that follows the last word in the sequence of words in the new document, and adjusting the vector representation for the new document based on the gradient using gradient 

Claim 13:  
  The system of claim 11, wherein the trained neural network system comprises an embedding layer configured to map the input document and each word in the sequence of words from the input document to respective vector representations, a combining layer configured to combine the vector representations into a combined representation, and a classifier layer configured to generate the word scores using the combined representation.
Claim 2:
The method of claim 1, wherein the trained neural network system comprises an embedding layer configured to map the input document and each word in the sequence of words from the input document to respective vector representations, a combining layer configured to combine the vector representations into a combined representation, and a classifier layer configured to generate the word scores using the combined representation
Claim 14: 
    The system of claim 13, wherein the embedding layer maps the words in the sequence of words to vector representations in accordance with a first set of parameters, and wherein the classifier layer generates the word scores from the combined representation in accordance with a second set of parameters.

Claim 3:
The method of claim 2, wherein the third set of parameters includes a first set of parameters and a second set of parameters, and wherein the embedding layer maps the words in the sequence of words to vector representations in accordance with the first set of parameters, and wherein the classifier layer generates the word scores from the combined representation in accordance with the second set of parameters.
Claim 16: 
































The system of claim 13, wherein determining the vector representation for the new document using the trained neural network system comprises performing a respective iteration of gradient descent for each of the plurality of sequences of words to adjust the vector representation of the new document from a previous iteration of gradient descent.
Claim 1: 
A method comprising: obtaining a new document; extracting, from the new document, (i) a plurality of sequences of words that are each a pre-determined fixed length, and, (ii) for each sequence of words, a word that follows a last word in the sequence of words in the new document; and determining a vector representation for the new document using a trained neural network system, wherein the trained neural network system has been trained on a plurality of unlabeled documents and has been trained to: receive data identifying an input document and a sequence of words from the input document, generate, from the data identifying the input document, a vector representation of the input document, and process the vector representation of the input document and the sequence of words from the input document in accordance with trained values of a third set of parameters to generate a respective word score for each word in a pre-determined set of words, wherein each of the respective word scores represents a predicted likelihood that the corresponding word follows a last word in the sequence in the input document, 

and wherein determining the vector representation for the new document using the trained neural network system comprises iteratively providing each of the plurality of sequences of words to the trained neural network system to determine the vector representation for the new document using gradient descent, 

comprising, for each sequence of words: providing the sequence of words 
Claim 17: 
The system of claim 16, 

wherein the performing the respective iteration of gradient descent for each of the plurality of sequences comprises: 

mapping each of the words in the sequence to a vector representation using the embedding layer; 

combining the vector representation for the words in the sequence with the vector representation for the new document from the previous iteration to generate a combined representation; 

generating word scores from the combined representation;





















































computing a gradient using the word scores and the word that follows the sequence in the new document; 





and adjusting the vector representation for the new document from the previous iteration using the gradient.
Claim 4:
 The method of claim 2, 

wherein, for each of the plurality of sequences the trained neural network system is configured to: 

map each of the words in the sequence to a vector representation using the embedding layer; 

combine the vector representation for the words in the sequence with the vector representation for the new document from the previous sequence of words to generate a combined representation;

 and generate the word scores from the combined representation.

Claim 1: 
A method comprising: obtaining a new document; extracting, from the new document, (i) a plurality of sequences of words that are each a pre-determined fixed length, and, (ii) for each sequence of words, a word that follows a last word in the sequence of words in the new document; and determining a vector representation for the new document using a trained neural network system, wherein the trained neural network system has been trained on a plurality of unlabeled documents and has been trained to: receive data identifying an input document and a sequence of words from the input document, generate, from the data identifying the input document, a vector representation of the input document, and process the vector representation of the input document and the sequence of words from the input document in accordance with trained values of a third set of 
the corresponding word follows a last word in the sequence in the input document, and wherein determining the vector representation for the new document using the trained neural network system comprises iteratively providing each of the plurality of sequences of words to the trained neural network system to determine the vector representation for the new document using gradient descent, comprising, for each sequence of words: providing the sequence of words to the trained neural network system to obtain a respective word score for each word in the pre-determined set of words generated using the vector representation of the new document and in accordance with the trained values of the third set of parameters, 

computing a gradient with respect to the vector representation of an error function that measures an error between the respective word scores and a target set of word scores that identifies the word that follows the last word in the sequence of words in the new document, 

and adjusting the vector representation for the new document based on the gradient using gradient descent while holding the trained values of the third set of parameters fixed.

Claim 18: 
The system of claim 13, wherein the combining layer is configured to concatenate the vector representations of the words in the sequence with the vector representation of the input document.
Claim 5 : 
The method of claim 2, wherein the combining layer is configured to concatenate the vector representations of the words in the sequence with the vector representation of the input document.
Claim 19: 
The system of claim 13, wherein the combining layer is configured to average the vector representations of the words in the sequence and the vector representation of the input document.
Claim 6: 
The method of claim 2, wherein the combining layer is configured to average the vector representations of the words in the sequence and the vector representation of the input document.
Claim 20:
One or more computer-readable storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform operations comprising: 

obtaining a new document, wherein the new document includes a plurality of sequences of words, and, for each sequence of words, a word that follows a last word in the sequence of words in the new document;


 and determining a vector representation for the new document using a trained neural network system, 


wherein the trained neural network system has been trained on a plurality of unlabeled documents and has been trained to receive an input document and a sequence of words from the input document and to generate a respective word score for each word in a pre-determined set of words, wherein each of the respective word scores represents a predicted likelihood that the corresponding word follows a last word in the sequence in the input document, 









and wherein determining the vector representation for the new document using the trained neural network system 
Claim 1: 
A method comprising: 







obtaining a new document; extracting, from the new document, (i) a plurality of sequences of words that are each a pre-determined fixed length, and, (ii) for each sequence of words, a word that follows a last word in the sequence of words in the new document; 

and determining a vector representation for the new document using a trained neural network system, 

wherein the trained neural network system has been trained on a plurality of unlabeled documents and has been trained to: receive data identifying an input document and a sequence of words from the input document, generate, from the data identifying the input document, a vector representation of the input document, and process the vector representation of the input document and the sequence of words from the input document in accordance with trained values of a third set of parameters to generate a respective word score for each word in a pre-determined set of words, wherein each of the respective word scores represents a predicted likelihood that the corresponding word follows a last word in the sequence in the input document, 

and wherein determining the vector representation for the new document using the trained neural network system 



Claim 11 of the current application differs from claim 1 of the reference application in that claim 11 (instant) recites “A system comprising: one or more computers and one or more storage devices on which are stored instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising…..wherein the new document includes” whereas Claim 1 (reference) recites “A method comprising…extracting, from the new document“. The difference between the recitations are minor and do not distinguish the overall appearance of one over the other; the reference "new document" 
Claim 16 of the current application differs from claim 1 of the reference application in that claim 16 (instant) recites  “performing a respective iteration of gradient descent…. to adjust the vector representation… from a previous iteration of gradient descent” whereas Claim 1 (reference) recites “iteratively providing each of the plurality of sequences of words to the trained neural network system to determine the vector representation for the new document using gradient descent”.  The difference between the recitations are minor and do not distinguish the overall appearance of one over the other; the reference’s iteratively provide vector representation using gradient descent reads on the instant's iteration of gradient descent and adjusting the vector representation .
Claim 17 of the current application differs from claims 4 and 1 of the reference application in that claim 17 (instant) recites  “wherein the performing the respective iteration of gradient descent for each of the plurality of sequences comprises…the previous iteration…using the word scores and the word that follows the sequence….the previous iteration using the gradient” whereas Claim 4 (reference) recites “wherein, for each of the plurality of sequences the trained neural network system is configured to:… the previous sequence of words to generate a combined representation;” and Claim 1 (reference) recites “the respective word scores and a target set of word scores that identifies the word that follows the last word in the sequence of words in the new 
Claim 20 of the instant application differs from claim 1 of the reference application in that claim 20 (instant) “One or more computer-readable storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform operations comprising…wherein the new document includes” whereas claim 1 (reference) recites “A method comprising…extracting, from the new document”. The difference between the recitations are minor and do not distinguish the overall appearance of one over the other; the reference "new document" reads on the instant's "new document”. It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to implement the method of claim 1 reference application as a computer readable storage device by trained neural network of the method utilizing computer with generic computer components. 
Dependent claims 13-14 and 16-19 of the instant application are rejected on the . 
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-8, 10-18 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Zweig et al. (US 2014/0229158A1) in view of Collobert et al. (“Natural Language Processing (Almost) from Scratch”). 
Regarding Claim 1: 
Zweig et al. teach A method comprising: obtaining a new document (Page 6, Para [0084] “In block 606, the CIPM 114 receives a new document to analyZeusing the LDA technique” teaches CIPM  obtain new document), wherein the new document includes a plurality of sequences of words (Pg. 5, Para [0078] “the current docu-ment corresponds to a sliding window of words in a sequence of words (e.g., in FIG. 2, the block 212 of words w-w,)” teaches current document (corresponding to new document) include sequences of words), and, for each sequence of words, a word that follows a last word in the sequence of words in the new document (Pg. 3 Para [0046] “the sliding block 212 includes words w2-w7. Here, the block 212 includes the word w7 that is used to encode the input vector; but in other implementations, the block 212 can include just the words which precede the word w7, excluding the word w7.” teaches words w2-w7 (corresponding to sequence of word) which precede the word w7 (last word) in document); and determining a vector representation for the new document using a trained neural network system (Pg. 6, Para [0084] “In block 608, the CIPM 114 computes a feature vector f(t) associated with the new document, which expresses a distribution of topics associated with the new document” teach vector representation for new document using computation (corresponding to training)), 
and has been trained to receive an input document and a sequence of words from the input document (Pg. 7, Para [0093] “the system 102 generates an input vector that represents a word or words in the sequence of words…. the system 102 can rely on any of the external sources described in Section A to obtain the context information. In block 908, the system 102 uses the neural network 104 to provide an output vector, based on the input vector and the feature vector” teach system receive input vector (corresponding to document) and word of sequence from the input vector (corresponding to document))
Pg. 5, Para [0067] “The training system 108 of FIG. 1 may train the neural network 402 of FIG. 4 using stochastic gradient descent and back-propagation of errors, based on the training examples in the training corpus. Each example constitutes a case in which a particular input vector w(t) and a particular feature vector f(t) are mapped to a particular output vector y(t)” teach training neural network to receive vector using gradient descent).
Zweig et al. doesn’t teach wherein the trained neural network system has been trained on a plurality of unlabeled documents…..and to generate a respective word score for each word in a pre-determined set of words, wherein each of the respective word scores represents a predicted likelihood that the corresponding word follows a last word in the sequence in the input document. 
However, Collobert et al. teach wherein the trained neural network system has been trained on a plurality of unlabeled documents (Pg. 2511 Section  4.2 Ranking Criterion versus Entropy Criterion “We used these unlabeled data sets to train language models that compute scores describing the acceptability of a piece of text. These language models are again large neural networks using the window approach described in Section 3.3.1 and in Figure 1” teach neural network trained on unlabeled data sets (corresponding to unlabeled documents))
Pg. 2511 Section 4.2 Ranking Criterion versus Entropy Criterion “We used these unlabeled data sets to train language models that compute scores describing the acceptability of a piece of text. These language models are again large neural networks using the window approach described in Section 3.3.1 and in Figure 1” and Pg. 2512 Section 4.2 Ranking Criterion versus Entropy Criterion “We seek a network that computes a higher score when given a legal phrase than when given an incorrect phrase” teach computing score for legal phrase (set of words)).
wherein each of the respective word scores represents a predicted likelihood that the corresponding word follows a last word in the sequence in the input document (Pg. 2506, Section 3.4.2 sentence-level log-likelihood “given the predictions of all tags by our network for all words in a sentence, and given a score for going from one tag to another tag, we want to encourage valid paths of tags during training, while discouraging all other paths” teach score represent predication based on the previous words in the document),
Zweig et al. and Collobert et al.  are analogous art because they are directed to using neural network to train document.  
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate, wherein the trained neural network system has been trained on a plurality of unlabeled documents…..and to generate a respective word score for each word in a pre-determined set of words; wherein each of the respective word scores represents a predicted likelihood that the corresponding Collobert et al. to the disclosed invention of Zweig et al.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, “Table 8 clearly shows that this simple initialization significantly boosts the generalization performance of the supervised networks for each task. It is worth mentioning the larger language model led to even better performance. This suggests that we could still take advantage of even bigger unlabeled data sets.” (Collobert, Pg. 2514 Section 4.5 Semi-supervised Benchmark Results).  
Regarding Claim 2: 
Zweig et al. in view of Collobert et al.  teach The method of claim 1, 
Zweig et al. further teach further comprising providing the vector representation of the new document as an input to a machine learning system configured to process the vector representation (Pg. 5, Para [0067] “The training system 108 of FIG. 1 may train the neural network 402 of FIG. 4 using stochastic gradient descent and back-propagation of errors, based on the training examples in the training corpus. Each example constitutes a case in which a particular input vector w(t) and a particular feature vector f(t) are mapped to a particular output vector y(t)” gradient descent (machine learning) process vector representation of input vector (document) into).
Regarding Claim 3: 
Zweig et al. in view of Collobert et al.  teach The method of claim 1, 
Zweig et al. further teach a combining layer configured to combine the vector representations into a combined representation, and a classifier layer configured to generate the word scores using the combined representation (Pg.4-5 Para [0065] “The hidden layer 406 generates its output s(t) based on the input vector w(t) as modified by a first matrix U, the feature vector f(t) as modified by a second input matrix F….. The output layer 408 generates its output vector y(t) based on the output s(t) of the hidden layer 406 as modified by a fourth matrix V, and based on the feature vector f(t) as modified by a fifth matrix G” and Figure 4 teach hidden layer (corresponding to combine layer) which modified input vector to matrix V (corresponding to combining the vector) and output layer (corresponding to classifier layer) generate output vector (word scores) .
Collobert et al. further teach wherein the trained neural network system comprises an embedding layer configured to map the input document and each word in the sequence of words from the input document to respective vector representations (Pg.2513 Section 4.3 Training Language Models “Language model LM1 has a window size dwin = 11 and a hidden layer with n1hu = 100 units. The embedding layers were dimensioned like those of the supervised networks (Table 5). Model LM1 was trained on our first English corpus (Wikipedia) using successive dictionaries composed of the 5000, 10,000, 30,000, 50,000 and finally 100,000 most common WSJ words. The total training time was about four weeks” and Page 2501 Section 3.2 Transforming Words into Feature Vectors “a relevant representation of each word is then given by the corresponding lookup table feature vector” and Figure 2 teach using embedding layer map English corpus (input ). 
Zweig et al. and Collobert et al.  are analogous art because they are directed to using neural network to train document.  
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate, wherein the trained neural network system comprises an embedding layer configured to map the input document and each word in the sequence of words from the input document to respective vector representations as taught by Collobert et al. to the disclosed invention of Zweig et al.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, “Table 8 clearly shows that this simple initialization significantly boosts the generalization performance of the supervised networks for each task. It is worth mentioning the larger language model led to even better performance. This suggests that we could still take advantage of even bigger unlabeled data sets.” (Collobert, Pg. 2514 Section 4.5 Semi-supervised Benchmark Results).  
Regarding Claim 4: 
Zweig et al. in view of Collobert et al. teach The method of claim 3,
Zweig et al. further teach and wherein the classifier layer generates the word scores from the combined representation in accordance with a second set of parameters (Pg. 4 Para [0065] “The output layer 408 generates its output vector y(t) based on the output s(t) of the hidden layer 406 as modified by a fourth matrix V, and based on the feature vector f(t) as modified by a fifth matrix G.” teach 
Collobert et al. further teach wherein the embedding layer maps the words in the sequence of words to vector representations in accordance with a first set of parameters (Pg.2513 Section 4.3 Training Language Models “Language model LM1 has a window size dwin = 11 and a hidden layer with n1hu = 100 units. The embedding layers were dimensioned like those of the supervised networks (Table 5). Model LM1 was trained on our first English corpus (Wikipedia) using successive dictionaries composed of the 5000, 10,000, 30,000, 50,000 and finally 100,000 most common WSJ words. The total training time was about four weeks” and Page 2501 Section 3.2 Transforming Words into Feature Vectors “a relevant representation of each word is then given by the corresponding lookup table feature vector…..for each word w ∈ D, an internal dwrd-dimensional feature vector representation is given by the lookup table layer LTW (·): LTW (w) = hWi1w, where W ∈ R dwrd×|D| is a matrix of parameters to be learned,” and Figure 2 teach using embedding layer to maps English corpus (corresponding to words) into words wherein words representation given by feature vector (vector representation) in accordance with parameter).
Zweig et al. and Collobert et al.  are analogous art because they are directed to using neural network to train document.  
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate, wherein the embedding layer maps the words in the sequence of words to vector representations in accordance with a first Collobert et al. to the disclosed invention of Zweig et al.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, “Table 8 clearly shows that this simple initialization significantly boosts the generalization performance of the supervised networks for each task. It is worth mentioning the larger language model led to even better performance. This suggests that we could still take advantage of even bigger unlabeled data sets.” (Collobert, Pg. 2514 Section 4.5 Semi-supervised Benchmark Results).  
Regarding Claim 5: 
Zweig et al. in view of Collobert et al.  teach The method of claim 4, 
Collobert et al. further teach wherein the values of the first set of parameters and the values of the second set of parameters are fixed from training the neural network system to generate the word scores (Pg. 5, Para [0075] “word to tag, we consider a fixed size ksz (a hyper-parameter) window of words around this word. Each word in the window is first passed through the lookup table layer (1) or (2)…. where Wl ∈ Rnlhu×nl−1hu and bl ∈ Rnlhu are the parameters to be trained” and Page 2517 Section 5.2 Multi-Task Benchmark Results “In both cases,
all models share the lookup table parameters (2). The parameters of the first linear layers (4) were shared in the window approach case (see Figure 5), and the first the convolution layer parameters (6) were shared in the sentence approach networks… best results were obtained by enlarging the first hidden layer size to n1hu = 500 (chosen by validation) in order to account for its shared responsibilities” teach parameter of lookup table (first set of parameters) and parameter of first linear layers are fixed and trained neural network to provide output).
Zweig et al. and Collobert et al.  are analogous art because they are directed to using neural network to train document.  
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate, wherein the values of the first set of parameters and the values of the second set of parameters are fixed from training the neural network system to generate the word scores as taught by Collobert et al. to the disclosed invention of Zweig et al.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, “Table 8 clearly shows that this simple initialization significantly boosts the generalization performance of the supervised networks for each task. It is worth mentioning the larger language model led to even better performance. This suggests that we could still take advantage of even bigger unlabeled data sets.” (Collobert, Pg. 2514 Section 4.5 Semi-supervised Benchmark Results).  
Regarding Claim 6: 
Zweig et al. in view of Collobert et al.  teach The method of claim 3, 
Zweig et al. further teach wherein determining the vector representation for the new document using the trained neural network system comprises performing a respective iteration of gradient descent for each of the plurality of sequences of words to adjust the vector representation of the new document from a previous iteration of gradient descent (Pg. 5, Para [0067] “The training system 108 of FIG. 1 may train the neural network 402 of FIG. 4 using stochastic gradient descent and back-propagation of errors, based on the training examples in the training corpus. Each example constitutes a case in which a particular input vector w(t) and a particular feature vector f(t) are mapped to a particular output vector y(t)” teach training neural network to receive vector using gradient descent).
Regarding Claim 7: 
Zweig et al. in view of Collobert et al.  teach The method of claim 6, 
Zweig et al. further teach wherein the performing the respective iteration of gradient descent for each of the plurality of sequences comprises (Pg. 5, Para [0067] “The training system 108 of FIG. 1 may train the neural network 402 of FIG. 4 using stochastic gradient descent and back-propagation of errors, based on the training examples in the training corpus. Each example constitutes a case in which a particular input vector w(t) and a particular feature vector f(t) are mapped to a particular output vector y(t)” and Pg. 6 Para [0082] “the generation module 508 can optionally compute a next feature vector in an incremental fashion based on the previous feature vector, together with a decay factor γ. On manner of performing this incremental update is expressed” teach compute a feature vector in an incremental manner (iteration) when training using gradient descent): 
combining the vector representation for the words in the sequence with the vector representation for the new document from the previous iteration to generate a combined representation (and Pg.4-5 Para [0065] “The hidden layer 406 generates its output s(t) based on the input vector w(t) as modified by a first matrix U, the feature vector f(t) as modified by a second input matrix F” and Pg. 6 Para [0088] “the generation module 712 can compute a feature vector in an incremental manner based on the previously-computed feature vector at time t−1, together with a decay factor γ” and Figure 4  teach matrix v (combining the vector representation) which increment based on previously computed vector); generating word scores from the combined representation (and Pg.4-5 Para [0065] “The output layer 408 generates its output vector y(t) based on the output s(t) of the hidden layer 406 as modified by a fourth matrix V, and based on the feature vector f(t) as modified by a fifth matrix G” and Figure 4 teach generate output vector (word scores) from the matrix v (combined representation)); computing a gradient using the word scores and the word that follows the sequence in the new document (Pg. 5, Para [0067] “The training system 108 of FIG. 1 may train the neural network 402 of FIG. 4 using stochastic gradient descent and back-propagation of errors, based on the training examples in the training corpus. Each example constitutes a case in which a particular input vector w(t) and a particular feature vector f(t) are mapped to a particular output vector y(t)” teach train using gradient descent at the output); and adjusting the vector representation for the new document from the previous iteration using the gradient (Pg. 6 Para [0082] “the generation module 508 can optionally compute a next feature vector in an incremental fashion based on the previous feature vector, together with a decay factor γ. On manner of performing this incremental update is expressed” teach updating using increment on the previous feature vector).
Collobert et al further teach mapping each of the words in the sequence to a vector representation using the embedding layer (Pg.2513 Section 4.3 Training Language Models “The embedding layers were dimensioned like those of the supervised networks (Table 5). Model LM1 was trained on our first English corpus (Wikipedia) using successive dictionaries composed of the 5000, 10,000, 30,000, 50,000 and finally 100,000 most common WSJ words. The total training time was about four weeks” teach using embedding layer to map each of the words). 
Zweig et al. and Collobert et al.  are analogous art because they are directed to using neural network to train document.  
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate, mapping each of the words in the sequence to a vector representation using the embedding layer as taught by Collobert et al. to the disclosed invention of Zweig et al.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, “Table 8 clearly shows that this simple initialization significantly boosts the generalization performance of the supervised networks for each task. It is worth mentioning the larger language model led to even better performance. This suggests that we could still take advantage of even bigger unlabeled data sets.” (Collobert, Pg. 2514 Section 4.5 Semi-supervised Benchmark Results).  
Regarding Claim 8: 
Zweig et al. in view of Collobert et al.  teach The method of claim 3,
Zweig et al. further teach wherein the combining layer is configured to concatenate the vector representations of the words in the sequence with the vector representation of the input document (Pg.4-5 Para [0065] “The hidden layer 406 generates its output s(t) based on the input vector w(t) as modified by a first matrix U, the feature vector f(t) as modified by a second input matrix F” teach hidden layer (combining layer) generate output based on the input vector (vector representation of input document) and feature vector (vector representation of word)).
Regarding Claim 10:
Zweig et al. in view of Collobert et al.  teach The method of claim 1, 
Zweig et al. further teach wherein each of the plurality of sequences of words contains a fixed number of words (Pg. 5, Para [0078] “the current document corresponds to a sliding window of words in a sequence of words (e.g., in FIG. 2, the block 212 of words w2-w7” teach sequences of words contains six words (fixed number of words)). 
Regarding Claim 11:
Zweig et al. teach A system comprising: one or more computers and one or more storage devices on which are stored instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising (Pg. 9, para [0115] “instructions and other information can be stored on any computer readable medium 201” and Pg. 9 Para [0114] “The computing functionality 2000 can perform various operations identified above when the processing device(s) 2006 execute instructions that are maintained by memory (e.g., RAM 2002, ROM 2004, or elsewhere)” teach computer and storage store instruction and execute): obtaining a new document (Page 6, Para [0084] “In block 606, the CIPM 114 receives a new document to analyZeusing the LDA technique” teaches CIPM  obtain new document), wherein the new document Pg. 5, Para [0078] “the current docu-ment corresponds to a sliding window of words in a sequence of words (e.g., in FIG. 2, the block 212 of words w-w,)” teaches current document (corresponding to new document) include sequences of words), and, for each sequence of words, a word that follows a last word in the sequence of words in the new document (Pg. 3 Para [0046] “the sliding block 212 includes words w2-w7. Here, the block 212 includes the word w7 that is used to encode the input vector; but in other implementations, the block 212 can include just the words which precede the word w7, excluding the word w7.” teaches words w2-w7 (corresponding to sequence of word) which precede the word w7 (last word) in document); and determining a vector representation for the new document using a trained neural network system (Pg. 6, Para [0084] “In block 608, the CIPM 114 computes a feature vector f(t) associated with the new document, which expresses a distribution of topics associated with the new document” teach vector representation for new document using computation (corresponding to training)),
and has been trained to receive an input document and a sequence of words from the input document (Pg. 7, Para [0093] “the system 102 generates an input vector that represents a word or words in the sequence of words…. the system 102 can rely on any of the external sources described in Section A to obtain the context information. In block 908, the system 102 uses the neural network 104 to provide an output vector, based on the input vector and the feature vector” teach system receive input vector (corresponding to document) and word of sequence from the input vector (corresponding to document))
Pg. 5, Para [0067] “The training system 108 of FIG. 1 may train the neural network 402 of FIG. 4 using stochastic gradient descent and back-propagation of errors, based on the training examples in the training corpus. Each example constitutes a case in which a particular input vector w(t) and a particular feature vector f(t) are mapped to a particular output vector y(t)” teach training neural network to receive vector using gradient descent).
Zweig et al. doesn’t teach wherein the trained neural network system has been trained on a plurality of unlabeled documents…..and to generate a respective word score for each word in a pre-determined set of words, wherein each of the respective word scores represents a predicted likelihood that the corresponding word follows a last word in the sequence in the input document
However, Collobert et al. teach wherein the trained neural network system has been trained on a plurality of unlabeled documents (Pg. 2511 Section  4.2 Ranking Criterion versus Entropy Criterion “We used these unlabeled data sets to train language models that compute scores describing the acceptability of a piece of text. These language models are again large neural networks using the window approach described in Section 3.3.1 and in Figure 1” teach neural network trained on unlabeled data sets (corresponding to unlabeled documents))
Pg. 2511 Section 4.2 Ranking Criterion versus Entropy Criterion “We used these unlabeled data sets to train language models that compute scores describing the acceptability of a piece of text. These language models are again large neural networks using the window approach described in Section 3.3.1 and in Figure 1” and Pg. 2512 Section 4.2 Ranking Criterion versus Entropy Criterion “We seek a network that computes a higher score when given a legal phrase than when given an incorrect phrase” teach computing score for legal phrase (set of words)).
wherein each of the respective word scores represents a predicted likelihood that the corresponding word follows a last word in the sequence in the input document (Pg. 2506, Section 3.4.2 sentence-level log-likelihood “given the predictions of all tags by our network for all words in a sentence, and given a score for going from one tag to another tag, we want to encourage valid paths of tags during training, while discouraging all other paths” teach score represent predication based on the previous words in the document).
Zweig et al. and Collobert et al.  are analogous art because they are directed to using neural network to train document.  
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate, wherein the trained neural network system has been trained on a plurality of unlabeled documents…..and to generate a respective word score for each word in a pre-determined set of words, wherein each of the respective word scores represents a predicted likelihood that the corresponding word Collobert et al. to the disclosed invention of Zweig et al.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, “Table 8 clearly shows that this simple initialization significantly boosts the generalization performance of the supervised networks for each task. It is worth mentioning the larger language model led to even better performance. This suggests that we could still take advantage of even bigger unlabeled data sets.” (Collobert, Pg. 2514 Section 4.5 Semi-supervised Benchmark Results).  
Regarding Claim 12:
Zweig et al. in view of Collobert et al.  teach The system of claim 11,
Zweig et al. further teach further comprising providing the vector representation of the new document as an input to a machine learning system configured to process the vector representation (Pg. 5, Para [0067] “The training system 108 of FIG. 1 may train the neural network 402 of FIG. 4 using stochastic gradient descent and back-propagation of errors, based on the training examples in the training corpus. Each example constitutes a case in which a particular input vector w(t) and a particular feature vector f(t) are mapped to a particular output vector y(t)” gradient descent (machine learning) process vector representation of input vector (document) into).
Regarding Claim 13:
Zweig et al. in view of Collobert et al.  teach The system of claim 11, 
Zweig et al. further teach a combining layer configured to combine the vector representations into a combined representation, and a classifier layer configured to generate the word scores using the combined representation (Pg.4-5 Para [0065] “The hidden layer 406 generates its output s(t) based on the input vector w(t) as modified by a first matrix U, the feature vector f(t) as modified by a second input matrix F….. The output layer 408 generates its output vector y(t) based on the output s(t) of the hidden layer 406 as modified by a fourth matrix V, and based on the feature vector f(t) as modified by a fifth matrix G” and Figure 4 teach hidden layer (corresponding to combine layer) which modified input vector to matrix V (corresponding to combining the vector) and output layer (corresponding to classifier layer) generate output vector (word scores) .
Collobert et al. further teach wherein the trained neural network system comprises an embedding layer configured to map the input document and each word in the sequence of words from the input document to respective vector representations (Pg.2513 Section 4.3 Training Language Models “Language model LM1 has a window size dwin = 11 and a hidden layer with n1hu = 100 units. The embedding layers were dimensioned like those of the supervised networks (Table 5). Model LM1 was trained on our first English corpus (Wikipedia) using successive dictionaries composed of the 5000, 10,000, 30,000, 50,000 and finally 100,000 most common WSJ words. The total training time was about four weeks” and Page 2501 Section 3.2 Transforming Words into Feature Vectors “a relevant representation of each word is then given by the corresponding lookup table feature vector” and Figure 2 teach using embedding layer map English corpus (input ). 
Zweig et al. and Collobert et al.  are analogous art because they are directed to using neural network to train document.  
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate, wherein the trained neural network system comprises an embedding layer configured to map the input document and each word in the sequence of words from the input document to respective vector representations as taught by Collobert et al. to the disclosed invention of Zweig et al.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, “Table 8 clearly shows that this simple initialization significantly boosts the generalization performance of the supervised networks for each task. It is worth mentioning the larger language model led to even better performance. This suggests that we could still take advantage of even bigger unlabeled data sets.” (Collobert, Pg. 2514 Section 4.5 Semi-supervised Benchmark Results). 
Regarding Claim 14:
Zweig et al. in view of Collobert et al.  The system of claim 13, 
Zweig et al. further teach and wherein the classifier layer generates the word scores from the combined representation in accordance with a second set of parameters (Pg. 4 Para [0065] “The output layer 408 generates its output vector y(t) based on the output s(t) of the hidden layer 406 as modified by a fourth matrix V, and based on the feature vector f(t) as modified by a fifth matrix G.” teach 
Collobert et al. further teach wherein the embedding layer maps the words in the sequence of words to vector representations in accordance with a first set of parameters (Pg.2513 Section 4.3 Training Language Models “Language model LM1 has a window size dwin = 11 and a hidden layer with n1hu = 100 units. The embedding layers were dimensioned like those of the supervised networks (Table 5). Model LM1 was trained on our first English corpus (Wikipedia) using successive dictionaries composed of the 5000, 10,000, 30,000, 50,000 and finally 100,000 most common WSJ words. The total training time was about four weeks” and Page 2501 Section 3.2 Transforming Words into Feature Vectors “a relevant representation of each word is then given by the corresponding lookup table feature vector…..for each word w ∈ D, an internal dwrd-dimensional feature vector representation is given by the lookup table layer LTW (·): LTW (w) = hWi1w, where W ∈ R dwrd×|D| is a matrix of parameters to be learned,” and Figure 2 teach using embedding layer to maps English corpus (corresponding to words) into words wherein words representation given by feature vector (vector representation) in accordance with parameter).
Zweig et al. and Collobert et al.  are analogous art because they are directed to using neural network to train document.  
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate, wherein the embedding layer maps the words in the sequence of words to vector representations in accordance with a first Collobert et al. to the disclosed invention of Zweig et al.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, “Table 8 clearly shows that this simple initialization significantly boosts the generalization performance of the supervised networks for each task. It is worth mentioning the larger language model led to even better performance. This suggests that we could still take advantage of even bigger unlabeled data sets.” (Collobert, Pg. 2514 Section 4.5 Semi-supervised Benchmark Results).  
Regarding Claim 15:
Zweig et al. in view of Collobert et al.  The system of claim 14, 
Collobert et al. further teach wherein the values of the first set of parameters and the values of the second set of parameters are fixed from training the neural network system to generate the word scores (Pg. 5, Para [0075] “word to tag, we consider a fixed size ksz (a hyper-parameter) window of words around this word. Each word in the window is first passed through the lookup table layer (1) or (2)…. where Wl ∈ Rnlhu×nl−1hu and bl ∈ Rnlhu are the parameters to be trained” and Page 2517 Section 5.2 Multi-Task Benchmark Results “In both cases,
all models share the lookup table parameters (2). The parameters of the first linear layers (4) were shared in the window approach case (see Figure 5), and the first the convolution layer parameters (6) were shared in the sentence approach networks… best results were obtained by enlarging the first hidden layer size to n1hu = 500 (chosen by validation) in order to account for its shared responsibilities” teach parameter of lookup table (first set of parameters) and parameter of first linear layers are fixed and trained neural network to provide output).
Zweig et al. and Collobert et al.  are analogous art because they are directed to using neural network to train document.  
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate, wherein the values of the first set of parameters and the values of the second set of parameters are fixed from training the neural network system to generate the word scores as taught by Collobert et al. to the disclosed invention of Zweig et al.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, “Table 8 clearly shows that this simple initialization significantly boosts the generalization performance of the supervised networks for each task. It is worth mentioning the larger language model led to even better performance. This suggests that we could still take advantage of even bigger unlabeled data sets.” (Collobert, Pg. 2514 Section 4.5 Semi-supervised Benchmark Results).  
Regarding Claim 16:
Zweig et al. in view of Collobert et al.  teach The system of claim 13, 
Zweig et al. further teach wherein determining the vector representation for the new document using the trained neural network system comprises performing a respective iteration of gradient descent for each of the plurality of sequences of words to adjust the vector representation of the new document from a previous iteration of gradient descent (Pg. 5, Para [0067] “The training system 108 of FIG. 1 may train the neural network 402 of FIG. 4 using stochastic gradient descent and back-propagation of errors, based on the training examples in the training corpus. Each example constitutes a case in which a particular input vector w(t) and a particular feature vector f(t) are mapped to a particular output vector y(t)” teach training neural network to receive vector using gradient descent).
Regarding Claim 17:
Zweig et al. in view of Collobert et al.  teach The system of claim 16, 
Zweig et al. further teach wherein the performing the respective iteration of gradient descent for each of the plurality of sequences comprises (Pg. 5, Para [0067] “The training system 108 of FIG. 1 may train the neural network 402 of FIG. 4 using stochastic gradient descent and back-propagation of errors, based on the training examples in the training corpus. Each example constitutes a case in which a particular input vector w(t) and a particular feature vector f(t) are mapped to a particular output vector y(t)” and Pg. 6 Para [0082] “the generation module 508 can optionally compute a next feature vector in an incremental fashion based on the previous feature vector, together with a decay factor γ. On manner of performing this incremental update is expressed” teach compute a feature vector in an incremental manner (iteration) when training using gradient descent):
combining the vector representation for the words in the sequence with the vector representation for the new document from the previous iteration to generate a combined representation ) which increment based on previously computed vector); (and Pg.4-5 Para [0065] “The hidden layer 406 generates its output s(t) based on the input vector w(t) as modified by a first matrix U, the feature vector f(t) as modified by a second input matrix F” and Pg. 6 Para [0088] “the generation module 712 can compute a feature vector in an incremental manner based on the previously-computed feature vector at time t−1, together with a decay factor γ” and Figure 4  teach matrix v (combining the vector representation); generating word scores from the combined representation (and Pg.4-5 Para [0065] “The output layer 408 generates its output vector y(t) based on the output s(t) of the hidden layer 406 as modified by a fourth matrix V, and based on the feature vector f(t) as modified by a fifth matrix G” and Figure 4 teach generate output vector (word scores) from the matrix v (combined representation)); computing a gradient using the word scores and the word that follows the sequence in the new document (Pg. 5, Para [0067] “The training system 108 of FIG. 1 may train the neural network 402 of FIG. 4 using stochastic gradient descent and back-propagation of errors, based on the training examples in the training corpus. Each example constitutes a case in which a particular input vector w(t) and a particular feature vector f(t) are mapped to a particular output vector y(t)” teach train using gradient descent at the output); and adjusting the vector representation for the new document from the previous iteration using the gradient (Pg. 6 Para [0082] “the generation module 508 can optionally compute a next feature vector in an incremental fashion based on the previous feature vector, together with a decay factor γ. On manner of performing this incremental update is expressed” teach updating using increment on the previous feature vector).
Collobert et al further teach mapping each of the words in the sequence to a vector representation using the embedding layer (Pg.2513 Section 4.3 Training Language Models “The embedding layers were dimensioned like those of the supervised networks (Table 5). Model LM1 was trained on our first English corpus (Wikipedia) using successive dictionaries composed of the 5000, 10,000, 30,000, 50,000 and finally 100,000 most common WSJ words. The total training time was about four weeks” teach using embedding layer to map each of the words).
Zweig et al. and Collobert et al.  are analogous art because they are directed to using neural network to train document.  
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate, mapping each of the words in the sequence to a vector representation using the embedding layer as taught by Collobert et al. to the disclosed invention of Zweig et al.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, “Table 8 clearly shows that this simple initialization significantly boosts the generalization performance of the supervised networks for each task. It is worth mentioning the larger language model led to even better performance. This suggests that we could still take advantage of even bigger unlabeled data sets.” (Collobert, Pg. 2514 Section 4.5 Semi-supervised Benchmark Results).  
Regarding Claim 18:
Zweig et al. in view of Collobert et al.  teach The system of claim 13, 
Zweig et al. further teach wherein the combining layer is configured to concatenate the vector representations of the words in the sequence with the vector representation of the input document (Pg.4-5 Para [0065] “The hidden layer 406 generates its output s(t) based on the input vector w(t) as modified by a first matrix U, the feature vector f(t) as modified by a second input matrix F” teach hidden layer (combining layer) generate output based on the input vector (vector representation of input document) and feature vector (vector representation of word)).
Regarding Claim 20:
Zweig et al. teach One or more computer-readable storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform operations comprising (Pg. 9, Para [0115] “instructions and other information can be stored on any computer readable medium 2010…. The term computer readable medium also encompasses propagated signals” teach computer readable storage media) : obtaining a new document (Page 6, Para [0084] “In block 606, the CIPM 114 receives a new document to analyZeusing the LDA technique” teaches CIPM  obtain new document), wherein the new document includes a plurality of sequences of words (Pg. 5, Para [0078] “the current docu-ment corresponds to a sliding window of words in a sequence of words (e.g., in FIG. 2, the block 212 of words w-w,)” teaches current document (corresponding to new document) include sequences of words), and, for each sequence of words, a word that follows a last word in the sequence of words in the new document (Pg. 3 Para [0046] “the sliding block 212 includes words w2-w7. Here, the block 212 includes the word w7 that is used to encode the input vector; but in other implementations, the block 212 can include just the words which precede the word w7, excluding the word w7.” teaches words w2-w7 (corresponding to sequence of word) which precede the word w7 (last word) in document); and determining a vector representation for the new document using a trained neural network system (Pg. 6, Para [0084] “In block 608, the CIPM 114 computes a feature vector f(t) associated with the new document, which expresses a distribution of topics associated with the new document” teach vector representation for new document using computation (corresponding to training)),,
and has been trained to receive an input document and a sequence of words from the input document (Pg. 7, Para [0093] “the system 102 generates an input vector that represents a word or words in the sequence of words…. the system 102 can rely on any of the external sources described in Section A to obtain the context information. In block 908, the system 102 uses the neural network 104 to provide an output vector, based on the input vector and the feature vector” teach system receive input vector (corresponding to document) and word of sequence from the input vector (corresponding to document))
and wherein determining the vector representation for the new document using the trained neural network system comprises iteratively providing each of the plurality of sequences of words to the trained neural network system to determine the vector representation for the new document using gradient descent (Pg. 5, Para [0067] “The training system 108 of FIG. 1 may train the neural network 402 of FIG. 4 using stochastic gradient descent and back-propagation of errors, based on the training examples in the training corpus. Each example constitutes a case in which a particular input vector w(t) and a particular feature vector f(t) are mapped to a particular output vector y(t)” teach training neural network to receive vector using gradient descent).
Zweig et al. doesn’t teach wherein the trained neural network system has been trained on a plurality of unlabeled documents……and to generate a respective word score for each word in a pre-determined set of words, wherein each of the respective word scores represents a predicted likelihood that the corresponding word follows a last word in the sequence in the input document.
However, Collobert et al. teach wherein the trained neural network system has been trained on a plurality of unlabeled documents (Pg. 2511 Section  4.2 Ranking Criterion versus Entropy Criterion “We used these unlabeled data sets to train language models that compute scores describing the acceptability of a piece of text. These language models are again large neural networks using the window approach described in Section 3.3.1 and in Figure 1” teach neural network trained on unlabeled data sets (corresponding to unlabeled documents))
and to generate a respective word score for each word in a pre-determined set of words(Pg. 2511 Section 4.2 Ranking Criterion versus Entropy Criterion “We used these unlabeled data sets to train language models that compute scores describing the acceptability of a piece of text. These language models are again large neural networks using the window approach described in Section 3.3.1 and in Figure 1” and Pg. 2512 Section 4.2 Ranking Criterion versus Entropy Criterion “We seek a network that computes a higher score when given a legal phrase than when given an incorrect phrase” teach computing score for legal phrase (set of words)).
wherein each of the respective word scores represents a predicted likelihood that the corresponding word follows a last word in the sequence in the input document (Pg. 2506, Section 3.4.2 sentence-level log-likelihood “given the predictions of all tags by our network for all words in a sentence, and given a score for going from one tag to another tag, we want to encourage valid paths of tags during training, while discouraging all other paths” teach score represent predication based on the previous words in the document).
Zweig et al. and Collobert et al.  are analogous art because they are directed to using neural network to train document.  
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate, wherein the trained neural network system has been trained on a plurality of unlabeled documents……and to generate a respective word score for each word in a pre-determined set of words, wherein each of the respective word scores represents a predicted likelihood that the corresponding word follows a last word in the sequence in the input document as taught by Collobert et al. to the disclosed invention of Zweig et al.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, “Table 8 clearly shows that this simple initialization significantly boosts the generalization performance of the supervised networks for each task. It is worth mentioning the larger language model led to even better performance. This suggests that we could still take advantage of even bigger unlabeled data sets.” (Collobert, Pg. 2514 Section 4.5 Semi-supervised Benchmark Results).  
Claims 9 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Zweig et al. (US 2014/0229158A1) in view of Collobert et al. (“Natural Language Processing Mikolov et al. (“Efficient Estimation of Word Representations in Vector Space”). 
Regarding Claim 9: 
Zweig et al. in view of Collobert et al.  teach The method of claim 3, 
Zweig et al. in view of Collobert et al.  doesn’t teach wherein the combining layer is configured to average the vector representations of the words in the sequence and the vector representation of the input document.
However, Mikolov et al.  teach wherein the combining layer is configured to average the vector representations of the words in the sequence and the vector representation of the input document (Pg. 4 Section 3.1 Continuous Bag-of-Words Model "where the non-linear hidden layer is removed and the projection layer is shared for all words (not just the projection matrix); thus, all words get projected into the same position (their vectors are averaged)" and Figure 1 teach projection layer (computing layer) provide average the vector of the words).
Zweig et al., Collobert et al.  and Mikolov et al. are analogous art because they are directed to providing vector representation of the word document in neural network.  
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate, wherein the combining layer is configured to average the vector representations of the words in the sequence and the vector representation of the input document as taught by Mikolov et al. to the disclosed invention of Zweig et al. in view of Collobert et al.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, “The CBOW architecture works better than the Mikolov, Pg.7 4.3 Comparison of Model Architectures).  
Regarding Claim 19:
Zweig et al. in view of Collobert et al.  teach The system of claim 13, 
Zweig et al. in view of Collobert et al.  doesn’t teach wherein the combining layer is configured to average the vector representations of the words in the sequence and the vector representation of the input document. 
However, Mikolov et al. teach wherein the combining layer is configured to average the vector representations of the words in the sequence and the vector representation of the input document (Pg. 4 Section 3.1 Continuous Bag-of-Words Model "where the non-linear hidden layer is removed and the projection layer is shared for all words (not just the projection matrix); thus, all words get projected into the same position (their vectors are averaged)" and Figure 1 teach projection layer (computing layer) provide average the vector of the words).
Zweig et al., Collobert et al.  and Mikolov et al. are analogous art because they are directed to providing vector representation of the word document in neural network.  
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate, wherein the combining layer is configured to average the vector representations of the words in the sequence and the vector representation of the input document as taught by Mikolov et al. to the disclosed invention of Zweig et al. in view of Collobert et al.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, “The CBOW architecture works better than the Mikolov, Pg.7 4.3 Comparison of Model Architectures).  
Prior Art
The prior art made of record and not relied upon is considered pertinent to application’s disclosure 
Schwenk et al. (“Connectionist language modeling for large vocabulary continuous speech recognition”) language model trained on the documents.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LOKESHA G PATEL whose telephone number is (571)272-6267. The examiner can normally be reached Monday-Friday 8am-5pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Afshar, Kamran can be reached on (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and 





/LOKESHA G PATEL/Examiner, Art Unit 2125                                                                                                                                                                                                        

/KAMRAN AFSHAR/Supervisory Patent Examiner, Art Unit 2125