Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
1.	This action is responsive to remarks filed 12/7/22.
	No amendments have been filed.
Response to Arguments
2.	Applicant’s arguments filed have been fully considered but are not persuasive.
	Regarding claim 1 Applicant argues on pages 8-9 that cited prior art Nunez does not teach
based on an offset value, determine a set of successive position values for the set of tokens, wherein each position value in the set of successive position values represents a position of a token in the set of tokens relative to other tokens in the set of tokens.

	Examiner respectfully disagrees.
Regarding claim 1 Vaswani teaches A system (5.2 Hardware and Schedule: We trained our models on one machine with 8 NVIDIA P100 GPUs) comprising: 
a set of processing units (5.2); and 
a non-transitory machine-readable medium storing instructions that when executed by at least one processing unit in the set of processing units cause the at least one processing unit (5.2) to: 
receive a set of input data for training a transformer model, the set of input data comprising a set of tokens (5.1 Training; 3.5: inject some information about the relative or absolute position of the tokens in the sequence.); 
[based on an offset value,] determine a set of [successive] position values for the set of tokens, [wherein each position value in the set of successive position values represents a position of a token in the set of tokens relative to other tokens in the set of tokens] (3.5: inject some information about the relative or absolute position of the tokens in the sequence.); 
generate a set of training data to comprise the set of tokens and the set of [successive] position values (3.5: inject some information about the relative or absolute position of the tokens in the sequence; 5.1 training); and 
train the transformer model using the set of training data (5.1 training 
where Vaswani teaches
Abstract: We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
Introduction: In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
3.5 Positional Encoding Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [9].
5.1 Training Data and Batching);

But does not specifically teach
Where Nunez teaches
based on an offset value, determine a set of successive position values for the set of tokens, wherein each position value in the set of successive position values represents a position of a token in the set of tokens relative to other tokens in the set of tokens (fig 3; 31 token, offset, number, sentence).  
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Nunez to better represent the positions of the tokens for improved transformer training.

Nunez teaches systems and methods analyze text ([0006]); text is received and parsed by tokenizer to generate a plurality of tokens ([0028]); tokenizer sends tokens to indexer which stores them within an index ([0029]).
Nunez teaches an offset, and determining position of a token relative to other tokens.  Nunez shows an example sentence, generated tokens for each sentence, and offset, number, and sentence number for an index, which allows for organized storage of locations of terms presented within a document (fig 3; 28; 31).  The offset, number, and sentence all work together to provide the location value, which provides specific information for each token.  The offset of the application merely appears to be a number that helps to identify position of tokens relative to other tokens, and the location identification values of Nunez teach values that help to categorize and identify positions of tokens relative to other tokens.

Thus, the claim as currently recited does not overcome the current art of record and the rejection is maintained.

	The rejections are maintained for the other additional claims, based on arguments presented above and art rejections below.

Claim Rejections - 35 USC § 103
3.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

4.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

5.	Claims 1-9, 11-14, 16-19 are rejected under 35 U.S.C. 103 as being unpatentable over Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017) in view of Nunez et al (2014/0358923).

Regarding claim 1 Vaswani teaches A system (5.2 Hardware and Schedule: We trained our models on one machine with 8 NVIDIA P100 GPUs) comprising: 
a set of processing units (5.2); and 
a non-transitory machine-readable medium storing instructions that when executed by at least one processing unit in the set of processing units cause the at least one processing unit (5.2) to: 
receive a set of input data for training a transformer model, the set of input data comprising a set of tokens (5.1 Training; 3.5: inject some information about the relative or absolute position of the tokens in the sequence.); 
[based on an offset value,] determine a set of [successive] position values for the set of tokens, [wherein each position value in the set of successive position values represents a position of a token in the set of tokens relative to other tokens in the set of tokens] (3.5: inject some information about the relative or absolute position of the tokens in the sequence.); 
generate a set of training data to comprise the set of tokens and the set of [successive] position values (3.5: inject some information about the relative or absolute position of the tokens in the sequence; 5.1 training); and 
train the transformer model using the set of training data (5.1 training 
where Vaswani teaches
Abstract: We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
Introduction: In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
3.5 Positional Encoding Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [9].
5.1 Training Data and Batching);

But does not specifically teach
Where Nunez teaches
based on an offset value, determine a set of successive position values for the set of tokens, wherein each position value in the set of successive position values represents a position of a token in the set of tokens relative to other tokens in the set of tokens (fig 3; 31 token, offset, number, sentence).  
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Nunez to better represent the positions of the tokens for improved transformer training.


Regarding claim 2 Vaswani teaches generating the set of training data comprises generate the set of training data to comprise tokens and 
Does not specifically teach where Nunez teaches The system of claim 1, wherein the set of tokens is a first set of tokens, wherein the set of input data further comprises a second set of tokens, wherein the offset value is a first offset value, wherein the set of successive position values is a first set of successive position values (fig 3; 28: text is received and parsed by tokenizer to generate plurality of tokens; 31: sentence – where tokenizer goes through a document to create tokens for each sentence of document), 
wherein the instructions further cause the at least one processing unit to, based on a second offset value, determine a second set of successive position values for the second set of tokens, wherein each position value in the second set of successive position values represents a position of a token in the second set of tokens relative to other tokens in the second set of tokens (fig 3; 28; 31 – where tokenizer goes through a document to create tokens for each sentence of document), 
and Vaswani and Nunez would teach
wherein generating the set of training data comprises generate the set of training data to comprise the first set of tokens, the first set of successive position values, the second set of tokens, and the second set of successive position value 
Rejected for similar rationale and reasoning as claim 1

Regarding claim 3 Nunez teaches The system of claim 2, wherein the first offset value and the second offset value are the same (Nunez fig 3; 31
where the index would be created for multiple sentences; i.e. sentence 2, with successive numbers for word positions of that sentence, and offsets; where some offset values may be the same, while others may be different).  
Rejected for similar rationale and reasoning as claim 1

Regarding claim 4 Nunez teaches The system of claim 3, wherein the first offset value is the value zero (Nunez fig 3 offset 0).  
Rejected for similar rationale and reasoning as claim 1

Regarding claim 5 Nunez teaches The system of claim 2, wherein the first offset value and the second offset value are different (Nunez fig 3; 31
where the index would be created for multiple sentences; i.e. sentence 2, with successive numbers for word positions of that sentence, and offsets; where some offset values may be the same, while others may be different; and the offsets represent different sentences).  
Rejected for similar rationale and reasoning as claim 1

Regarding claim 6 Nunez teaches The system of claim 2, wherein a position value in the first set of successive position values has the same value as a position value in the second set of successive position values (Nunez fig 3
where the index would be created for multiple sentences; i.e. sentence 2, with successive numbers for word positions of that sentence, and offsets). 
Rejected for similar rationale and reasoning as claim 1 

Regarding claim 7 Vaswani and Nunez teach The system of claim 1, wherein the set of training data is a first set of training data, wherein the offset value is a first offset value, wherein the set of successive position values is a first set of successive position values, wherein the set of training data is a first set of training data (Nunez fig 3; 31 tokens, offset, position, sentence number), wherein the instructions further cause the at least one processing unit to: 
based on a second offset value, determine a second set of successive position values for the set of tokens, wherein each position value in the second set of successive position values represents a position of a token in the set of tokens relative to other tokens in the set of token (Nunez fig 3; 28: text is received and parsed by tokenizer to generate plurality of tokens; 31: sentence – where tokenizer goes through a document to create tokens for each sentence of document; where the index would be created for multiple sentences; i.e. sentence 2, with successive numbers for word positions of that sentence, and offsets); 
generate a second set of training data to comprise the set of tokens and the second set of successive position values (Nunez fig 3; 28; 31); and 
train the transformer model using the second set of training data (where Vaswani teaches training the transformer with tokens (3.5; 5.1) and Nunez teaches the first and second set of training data).  
Rejected for similar rationale and reasoning as claim 1

Regarding claim 8 Nunez teaches The system of claim 7, wherein a difference between the first offset value and the second offset value is a defined difference value (fig 3 – multiple offset values for each sentence of document; where differences would correspond to some value).  
Rejected for similar rationale and reasoning as claim 1

Regarding claim 9 Nunez teaches The system of claim 1, wherein the instructions further cause the at least one processing unit to determine the offset value by randomly selecting the offset value from a range of candidate offset values (fig 3 where offset values can be assigned to any set of values).  
Rejected for similar rationale and reasoning as claim 1



Regarding claim 11 Vaswani and Nunez teach A method comprising: 
receiving a set of input data for training a transformer model, the set of input data comprising a set of tokens;22Attorney Docket No.: 000169-015300US
Client Reference No.: 408678-US-NP based on an offset value, determining a set of successive position values for the set of tokens, wherein each position value in the set of successive position values represents a position of a token in the set of tokens relative to other tokens in the set of tokens; 
generating a set of training data to comprise the set of tokens and the set of successive position values; and 
training the transformer model using the set of training data.  
	Recites limitations similar to claim 1 and is rejected for similar rationale and reasoning 

	Claim 12 recites limitations similar to claim 2 and is rejected for similar rationale and reasoning
Claim 13 recites limitations similar to claim 7 and is rejected for similar rationale and reasoning
Claim 14 recites limitations similar to claim 9 and is rejected for similar rationale and reasoning


Regarding claim 16 Vaswani and Nunez teach A non-transitory machine-readable medium storing a program executable by at least one processing unit of a computer system, the program comprising sets of instructions for: 
receiving a set of input data for training a transformer model, the set of input data comprising a set of tokens; 
based on an offset value, determining a set of successive position values for the set of tokens, wherein each position value in the set of successive position values represents a position of a token in the set of tokens relative to other tokens in the set of tokens; 
generating a set of training data to comprise the set of tokens and the set of successive position values; and 
training the transformer model using the set of training data.  
Recites limitations similar to claim 1 and is rejected for similar rationale and reasoning

Claim 17 recites limitations similar to claim 2 and is rejected for similar rationale and reasoning
Claim 18 recites limitations similar to claim 7 and is rejected for similar rationale and reasoning
Claim 19 recites limitations similar to claim 9 and is rejected for similar rationale and reasoning


6.	Claims 10, 15, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017) in view of Nunez et al (2014/0358923) in further view of Dean 2007/0220023.

Regarding claim 10 Vaswani and Nunez do not specifically teach where Dean teaches The system of claim 1, wherein the transformer model is configured to train on training data comprising a sequence of tokens that is less than a defined maximum number of tokens, wherein each position value in the set of successive position values is less than the defined maximum number of tokens (Dean 0052 tokens).  
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Dean for improved tokenization.


	Claim 15 Recites limitations similar to claim 10 and is rejected for similar rationale and reasoning 

Claim 20 Recites limitations similar to claim 10 and is rejected for similar rationale and reasoning

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAUN A ROBERTS whose telephone number is (571)270-7541.  The examiner can normally be reached Monday-Friday 9-5 EST.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on 571-272-7516.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/SHAUN ROBERTS/
Primary Examiner, Art Unit 2655