DETAILED ACTION
This Office Action is in response to the correspondence filed by the applicant on 5/22/2020.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Regarding the rejection of claims under 112 (b), Applicants have amended claims 3-6, 10, 11, and 16-19.  The rejections are now withdrawn in the light of the amendment.

Regarding the rejection of claims under 101, Applicant’s arguments with respect to rejections have been fully considered, but they are not persuasive.
The analysis of patent eligibility is performed under the new patent eligible guidance (2019 PEG) (https://www.federalregister.gov/documents/2019/01/07/2018-28282/2019-revised-patent-subject-matter-eligibility-guidance)
For Step 1, the Examiner determines that the claims fall into statutory category. For example, the independent claim 1 recites a series of steps, therefore, is a process.

For Step 2A Prolong 1, the Examiner determines that claims recite judicial exception. The independent claims 1, 8, and 14 recite the limitations of “receive a plurality of data sets for training …; generate a set of training data ….; determining, for the sequence of tokens of each dataset in the plurality of datasets, a set of groups of tokens ….; determining .. a length of the group tokens …; identifying a group of tokens … having a longest length …; packing the data structure; and train the sequence model using the set of training data.”  These limitations, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitations in the mind but for the recitation of generic computer components.  That is, other than reciting “a set of processing units; and a non-transitory machine-readable medium …”, nothing in the claim element preclude the step from practically being performed in the mind.  For example, the claims encompass a person can gather data, formulate the data / arrange the data, and study the data.  The mere nominal recitation of “a set of processing units; and a non-transitory machine-readable medium …” does not take the claim limitations out of the mental processes grouping. Thus, the claims recite a mental process.

For Step 2A Prolong 2, Examiner determines that claims are not integrated in to a practical application.  The claim recite additional elements: “a set of processing units; and a non-transitory machine-readable medium …”
Each of the additional elements and/or the combination of the additional elements is no more than mere instructions to apply the exception using generic computer components.  Accordingly, even in combination, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaning limits on practicing the abstract idea.  The claims are thus directed to the abstract idea.

For Step 2B, the Examiner determines that claims do not provide an inventive concept.  As discussed with respect to Step 2A Prolong Two, the additional elements in the claim amount to no more than mere instructions to apply the exception using a generic computer component.  The same analysis applies here in 2B, i.e., mere instructions to apply an exception on a generic computer cannot integrate a judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B.  The claims are ineligible.



Applicant’s arguments with respect to rejections have been fully considered, but they are not persuasive.

The Examiner suggests further amend the independent claims by specifying how the trained model is implemented and applied in the field of natural language processing to overcome the current rejections under 101 (e.g., receiving an input text and interpret the input text / find an intent of the input text using the model).

Regarding the rejection of claims under 102 / 103, Applicant’s arguments with respect to rejections have been fully considered, but they are not persuasive.
Applicants asserts “the tensor in Ward is not used for training a sequence model … combining the tensor of audio files in Ward with the sequence of words in Devlin would not work as the training data in Ward and the training data Ward are different.”  However, the Examiner respectfully disagrees.  Devlin already teaches packing the data for training, wherein the data is a sequence of tokens, and the Examiner relies of Ward for identifying the longest data for packing.  Moreover, both Devlin and Ward preparing data by “padding” the data for training in the field of machine learning.  One of ordinary skill in the would recognize the “padding” technique of Ward and apply the technique (identifying the longest data for padding) for word data.  For at the reasons above, the Examiner maintains the rejections.  Please see the rejections below for more details.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The independent claims 1, 8, and 14 recite, “receive a plurality of datasets for training …; generate a set of training data …; and train the sequence model using the set of training data.”
The limitations of, “receive a plurality of data sets for training …; generate a set of training data ….; determining, for the sequence of tokens of each dataset in the plurality of datasets, a set of groups of tokens ….; determining .. a length of the group tokens …; identifying a group of tokens … having a longest length …; packing the data structure; and train the sequence model using the set of training data.”, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. That is, other than reciting “a set of processing units …” in claims 1, 8 and 14, nothing in the claim element precludes the step from practically being performed in the mind. For example, a person can gather data, formulate the data / arrange the data, and study the data. The limitations, as drafted, are processes that, under its broadest reasonable interpretation, cover performance of the limitations in the mind. 
If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas.  Accordingly, the claim recites an abstract idea.
This judicial exception is not integrated into a practical application. In particular, the claims only recite additional elements – “a set of processing units; and a non-transitory machine-readable medium …”. The additional elements in both steps is recited at a high-level of generality (i.e., as a generic processor performing a generic computer function of “receive a plurality of data sets for training …; generate a set of training data ….; determining, for the sequence of tokens of each dataset in the plurality of datasets, a set of groups of tokens ….; determining .. a length of the group tokens …; identifying a group of tokens … having a longest length …; packing the data structure; and train the sequence model using the set of training data.”) such that it amounts no more than mere instructions to apply the exception using a generic computer component.  Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea.  The claim is directed to an abstract idea.
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using a set of processors to perform the recited steps amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept.  The claim is not patent eligible.

Regarding the dependent claims, Claims 2 and 15 recite a set of position values; Claims 3 and 16 recite an embedding for each position value; Claims 4 and 17 recites iteratively packing the data structure; Claims 5 and 18 recite packing the second data structure with the longest length; Claims 6 and 19 recite adding label data; Claims 7 and 20 recite a first paragraph and a second paragraph; Claim 9 recites copies of a second group of tokens; Claim 10 recites packing first and second rows; Claim 11 recites combining the first and the second embeddings; Claim 12 recites repeating the copies; and Claim 13 recites packing the data structure with the defined length.
  Even though the disclosed invention is described in the specification as improving computer technology, the claim provides no meaningful limitations such that this improvement is realized. Therefore, the claim does not amount to significantly more than the abstract idea itself. 
Accordingly, the limitations of the Claims, whether considered individually or as an ordered combination, are not sufficient to add significantly more to improve technological functionality. As such, Claims 1-20 are rejected under 35 U.S.C. 101 as being directed to non-statutory subject matter.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.



Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over DEVLIN (Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2019), and further in view of WARD (US 2020/0035224 A1).

REGARDING CLAIM 1, DEVLIN discloses a system comprising: 
a set of processing units (DEVLIN Pg.13 2nd Col – “Training of BERTBASE was performed on 4 Cloud TPUs in Pod configuration (16 TPU chips total).13 Training of BERTLARGE was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete.”); and 
a non-transitory machine-readable medium storing instructions that when executed by at least one processing unit in the set of processing units cause the at least one processing unit (DEVLIN pg.2 1st Col – “BERT advances the state of the art for eleven NLP tasks. The code and pre-trained models are available at https://github.com/ google-research/bert.”) to: 
receive a plurality of datasets for training a sequence model (DEVLIN pg. 4 1st Col – “To make BERT handle a variety of down-stream tasks, our input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., h Question, Answer i) in one token sequence. Throughout this work, a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together. We use WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary.”), each dataset in the plurality of datasets comprising a sequence of correlated tokens (DEVLIN Fig. 1; Pg. 4 1st Col – “To make BERT handle a variety of down-stream tasks, our input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., h Question, Answer i) in one token sequence. Throughout this work, a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together.”; In other words, the tokens are the words in a same sentence (i.e., token sequence); thus, they are correlated.); 
generate a set of training data comprising a subset of a sequence of tokens from a first dataset in the plurality of datasets (DEVLIN pg. 13 1st Col – “To generate each training input sequence, we sample two spans of text from the corpus, which we refer to as “sentences” even though they are typically much longer than single sentences (but can be shorter also). The first sentence receives the A embedding and the second receives the B embedding. 50% of the time B is the actual next sentence that follows A and 50% of the time it is a random sentence, which is done for the “next sentence prediction” task. They are sampled such that the combined length is ≤ 512 tokens. The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15%, and no special consideration given to partial word pieces.”) and a subset of a sequence of tokens from a second (DEVLIN pg. 13 1st Col – “To generate each training input sequence, we sample two spans of text from the corpus, which we refer to as “sentences” even though they are typically much longer than single sentences (but can be shorter also). The first sentence receives the A embedding and the second receives the B embedding.”) different dataset in the plurality of datasets (DEVLIN pg. 13 1st Col – “To generate each training input sequence, we sample two spans of text from the corpus, which we refer to as “sentences” even though they are typically much longer than single sentences (but can be shorter also). The first sentence receives the A embedding and the second receives the B embedding. 50% of the time B is the actual next sentence that follows A and 50% of the time it is a random sentence, which is done for the “next sentence prediction” task.”; Pg. 15 Fig. 4 – “Paragraph: Tok1 .. To M”; pg. 6 Section 4.2 – “The Stanford Question Answering Dataset (SQuAD v1.1) is a collection of 100k crowdsourced question/answer pairs (Rajpurkar et al.,2016). Given a question and a passage from Wikipedia containing the answer, the task is to predict the answer text span in the passage.”; Pg. 7 1st Col – “12The TriviaQA data we used consists of paragraphs from TriviaQA-Wiki formed of the first 400 tokens in documents, that contain at least one of the provided possible answers.”), wherein the set of training data comprises a data structure having a defined length (DEVLIN Pg. 5 Fig. 2 – “BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.”; pg. 13 – “They are sampled such that the combined length is ≤ 512 tokens. The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15%, and no special consideration given to partial word pieces. We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40” epochs over the 3.3 billion word corpus…. Longer sequences are disproportionately expensive because attention is quadratic to the sequence length. To speed up pretraing in our experiments, we pre-train the model with sequence length of 128 for 90% of the steps. Then, we train the rest 10% of the steps of sequence of 512 to learn the positional embeddings.”), wherein generating the set of training data (DEVLIN pg. 13 1st Col – “To generate each training input sequence, we sample two spans of text from the corpus, which we refer to as “sentences” even though they are typically much longer than single sentences (but can be shorter also). The first sentence receives the A embedding and the second receives the B embedding. 50% of the time B is the actual next sentence that follows A and 50% of the time it is a random sentence, which is done for the “next sentence prediction” task. They are sampled such that the combined length is ≤ 512 tokens. The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15%, and no special consideration given to partial word pieces.”) comprises:
determining, for the sequence of tokens of each dataset in the plurality of datasets, a set of groups of tokens in the sequence of tokens (DEVLIN pg.3 Fig. 1 – “Tok1 … Tok N … Tok1 … Tok M”; Pg. 5 Fig. 2 – “BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings”; pg. 4 1st Col – “To make BERT handle a variety of down-stream tasks, our input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., h Question, Answer i) in one token sequence. Throughout this work, a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together.”),
determining, for each group of tokens in the set of groups of tokens of each sequence of tokens, a length of the group of tokens (DEVLIN pg. 13 1st Col – “The first sentence receives the A embedding and the second receives the B embedding. 50% of the time B is the actual next sentence that follows A and 50% of the time it is a random sentence, which is done for the “next sentence prediction” task. They are sampled such that the combined length is ≤ 512 tokens. The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15%, and no special consideration given to partial word pieces.”), wherein generating the set of training data is based on the lengths of the groups of tokens (DEVLIN pg. 13 1st Col – “They are sampled such that the combined length is ≤ 512 tokens. The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15%, and no special consideration given to partial word pieces.”; pg. 13 2nd Col – “Longer sequences are disproportionately expensive because attention is quadratic to the sequence length. To speed up pretraing in our experiments, we pre-train the model with sequence length of 128 for 90% of the steps. Then, we train the rest 10% of the steps of sequence of 512 to learn the positional embeddings.”),
[identifying] using a group of tokens in the plurality of datasets having a longest length equal to or less than the defined length of the data structure (DEVLIN Pg. 5 Fig. 2 – “BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.”; pg. 13 – “They are sampled such that the combined length is ≤ 512 tokens. … To speed up pretraing in our experiments, we pre-train the model with sequence length of 128 for 90% of the steps. Then, we train the rest 10% of the steps of U to learn the positional embeddings.”), and
packing the data structure with the identified group of tokens (DEVLIN pg. 4 1st col – “A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together.”); and 
train the sequence model using the set of training data (DEVLIN pg. 13 1st Col – “We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus.”).

DEVLIN does not explicitly teach the [square-bracketed] limitations and teaches the underlined features. In other words, DEVLIN teaches the length of the input embeddings used for training is less than or equal to 512 tokens, but does not explicitly teach identifying the group of tokens (i.e., a sentence) whose length is less than or equal to the defined length.

WARD discloses the [square-bracketed] limitations.  WARD discloses a method/system for generating training data for machine learning comprising: [identifying] a group of tokens in the plurality of datasets having the longest length equal to or less than the defined length of the data structure (WARD Par 98 – “The length of the tensor 1000 in the time dimension is determined by the length of the longest sample 1002 in terms of the number of frames. Longest sample 1002 is not changed.”); and 
packing the data structure with the identified group of tokens (WARD Par 98 – “Each of the shorter samples 1001, 1003, 1004, 1005, 1006 in the batch are repeated until they are the same length as the longest sample 1002, so that every row of the tensor has the same length. The shorter samples are repeated exactly in all of their elements starting from the first element through the last element of the sample.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of DEVLIN to include identifying a group of tokens with the longest length, as taught by WARD.
One of ordinary skill would have been motivated to include identifying a group of tokens with the longest length, in order to achieve faster convergence and learning (WARD Par 101).


REGARDING CLAIM 2, DEVLIN discloses the system of claim 1, wherein generating the set of training data (DEVLIN pg. 13 1st Col – “To generate each training input sequence, we sample two spans of text from the corpus, which we refer to as “sentences” even though they are typically much longer than single sentences (but can be shorter also). The first sentence receives the A embedding and the second receives the B embedding. 50% of the time B is the actual next sentence that follows A and 50% of the time it is a random sentence, which is done for the “next sentence prediction” task. They are sampled such that the combined length is ≤ 512 tokens. The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15%, and no special consideration given to partial word pieces.”) further comprises: 
determining a set of position values for the identified group (DEVLIN Pg. 5 Fig. 2 – “Position Embeddings: E0 E1 E2 … E10 ..  BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.”; Pg. 4 1st Col – “For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings. A visualization of this construction can be seen in Figure 2.”), and
packing the data structure with the set of position values (DEVLIN Pg. 4 1st Col –“ A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together. … Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embedding to every token indicating whether it belongs to sentence A or sentence B. As shown in Figure 1, we denote input embedding as E, the final hidden vector of the special [CLS] token as C ϵ RH, and the final hidden vector for the ith input token as Ti ϵ RH. ”).

REGARDING CLAIM 3, DEVLIN discloses the system of claim 2, wherein generating the set of training data (DEVLIN pg. 13 1st Col – “To generate each training input sequence, we sample two spans of text from the corpus, which we refer to as “sentences” even though they are typically much longer than single sentences (but can be shorter also)….”) further comprises: 
	determining a set of embeddings comprising an embedding for each position value in the set of position values (DEVLIN Pg. 5 Fig. 2 – “Position Embeddings: E0 E1 E2 … E10 ..  BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.”; Pg. 4 1st Col – “For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings. A visualization of this construction can be seen in Figure 2.”).

REGARDING CLAIM 4, DEVLIN in view of WARD discloses the system of claim 1, wherein generating the set of training data further comprises 
iteratively packing the data structure with remaining groups of tokens in the plurality of datasets having the longest length that is equal to or less than a remaining length in the data structure (WARD Par 98 – “Each of the shorter samples 1001, 1003, 1004, 1005, 1006 in the batch are repeated until they are the same length as the longest sample 1002, so that every row of the tensor has the same length. The shorter samples are repeated exactly in all of their elements starting from the first element through the last element of the sample.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of DEVLIN to include iteratively packing the data with remaining groups of tokens, as taught by WARD.
One of ordinary skill would have been motivated to include iteratively packing the data with remaining groups of tokens, in order to achieve faster convergence and learning (WARD Par 101).

REGARDING CLAIM 5, DEVLIN in view of WARD discloses the system of claim 1, wherein the data structure is a first data structure (DEVLIN Pg. 5 Fig. 2 – “BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.”; pg. 13 1st Col – “pg. 13 1st Col – “The first sentence receives the A embedding and the second receives the B embedding.), wherein the set of training data further comprises a second data structure having the defined length (DEVLIN pg. 13 1st Col – “pg. 13 1st Col – “The first sentence receives the A embedding and the second receives the B embedding. 50% of the time B is the actual next sentence that follows A and 50% of the time it is a random sentence, which is done for the “next sentence prediction” task. They are sampled such that the combined length is ≤ 512 tokens. The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15%, and no special consideration given to partial word pieces. We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40” epochs over the 3.3 billion word corpus.”; In other words, data structures (i.e., input embeddings) are formed together as a batch.  Thus, 256 sequences (i.e., input embeddings) are processed.), wherein generating the set of training data (DEVLIN pg. 13 1st Col – “To generate each training input sequence, we sample two spans of text from the corpus, which we refer to as “sentences” even though they are typically much longer than single sentences (but can be shorter also)….”) comprises: 
[identifying] using a remaining group of tokens in the plurality of datasets having the longest length equal to or less than the defined length of the data structure (DEVLIN pg. 13 1st Col – “They are sampled such that the combined length is ≤ 512 tokens. The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15%, and no special consideration given to partial word pieces. We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40” epochs over the 3.3 billion word corpus.”; In other words, data structures (i.e., input embeddings) are formed together as a batch.  Thus, 256 sequences (i.e., input embeddings), whose lengths are less than or equal to 512, are processed.); and 
packing the second data structure with the identified group of tokens (DEVLIN pg. 4 1st col – “A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together.”).
DEVLIN does not explicitly teach the [square-bracketed] limitations and teaches the underlined features. In other words, DEVLIN teaches the length of the input embeddings used for training is less than or equal to 512 tokens, but does not explicitly teach identifying the group of tokens (i.e., a sentence) whose length is less than or equal to the defined length.

WARD discloses the [square-bracketed] limitations.  WARD discloses a method/system for generating training data for machine learning comprising: [identifying] a remaining group of tokens in the plurality of datasets having the longest length equal to or less than the defined length of the data structure (WARD Par 98 – “The length of the tensor 1000 in the time dimension is determined by the length of the longest sample 1002 in terms of the number of frames. Longest sample 1002 is not changed.”); and 
packing the second data structure with the identified group of tokens (WARD Par 98 – “Each of the shorter samples 1001, 1003, 1004, 1005, 1006 in the batch are repeated until they are the same length as the longest sample 1002, so that every row of the tensor has the same length. The shorter samples are repeated exactly in all of their elements starting from the first element through the last element of the sample.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of DEVLIN to include identifying a group of tokens with the longest length, as taught by WARD.


REGARDING CLAIM 6, DEVLIN discloses the system of claim 1, wherein generating the set of training data (DEVLIN pg. 13 1st Col – “To generate each training input sequence, we sample two spans of text from the corpus, which we refer to as “sentences” even though they are typically much longer than single sentences (but can be shorter also). The first sentence receives the A embedding and the second receives the B embedding. 50% of the time B is the actual next sentence that follows A and 50% of the time it is a random sentence, which is done for the “next sentence prediction” task. They are sampled such that the combined length is ≤ 512 tokens. The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15%, and no special consideration given to partial word pieces.”) comprises adding label data to the set of training data (DEVLIN Pg. 5 Fig. 2 – “BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.”) indicating that (1) the subset of the sequence of tokens from the first dataset and (2) the subset of the sequence of tokens from the second different dataset (DEVILN pg. 5 Fig. 2 – “Segment Embedding: EA EA EA … EB EB EB”; pg. 13 1st Col – “The first sentence receives the A embedding and the second receives the B embedding. 50% of the time B is the actual next sentence that follows A and 50% of the time it is a random sentence, which is done for the “next sentence prediction” task.”;  In other words, the two sentences, “my dog is cute” and “he likes to play ##ing”, are labeled with the segment embeddings indicating which sequence/sentence each token belongs to.) are not correlated (DEVILN pg. 13 1st Col – “The first sentence receives the A embedding and the second receives the B embedding. 50% of the time B is the actual next sentence that follows A and 50% of the time it is a random sentence, which is done for the “next sentence prediction” task.”).


REGARDING CLAIM 7, DEVLIN discloses the system of claim 1, wherein the sequence of correlated tokens in the first dataset is a first set of sentences from a first paragraph of text (DEVLIN Pg. 15 Fig. 4 – “Question: Tok1 … Tok N”;), wherein the sequence of correlated tokens in the second dataset is a second set of sentences from a second paragraph of text (DEVLIN Pg. 15 Fig. 4 – “Paragraph: Tok1 .. To M”; pg. 6 Section 4.2 – “The Stanford Question Answering Dataset (SQuAD v1.1) is a collection of 100k crowdsourced question/answer pairs (Rajpurkar et al.,2016). Given a question and a passage from Wikipedia containing the answer, the task is to predict the answer text span in the passage.”; Pg. 7 1st Col – “12The TriviaQA data we used consists of paragraphs from TriviaQA-Wiki formed of the first 400 tokens in documents, that contain at least one of the provided possible answers.”).

REGARDING CLAIM 8, DEVLIN discloses a system comprising: a set of processing units (DEVLIN Pg.13 2nd Col – “Training of BERTBASE was performed on 4 Cloud TPUs in Pod configuration (16 TPU chips total).13 Training of BERTLARGE was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete.”); and a non-transitory machine-readable medium storing instructions that when executed by at least one processing unit in the set of processing units cause the at least one processing unit (DEVLIN pg.2 1st Col – “BERT advances the state of the art for eleven NLP tasks. The code and pre-trained models are available at https://github.com/ google-research/bert.”) to: 
receive a set of input data for training a sequence model (DEVLIN pg. 4 1st Col – “To make BERT handle a variety of down-stream tasks, our input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., h Question, Answer i) in one token sequence. Throughout this work, a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together. We use WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary.”), the input data comprising a sequence of tokens (DEVLIN Fig. 1; Pg. 4 1st Col – “To make BERT handle a variety of down-stream tasks, our input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., h Question, Answer i) in one token sequence. Throughout this work, a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together.”); 
group the sequence of tokens into a set of groups of tokens (DEVLIN pg.3 Fig. 1 – “Tok1 … Tok N … Tok1 … Tok M”; Pg. 5 Fig. 2 – “BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings”; pg. 4 1st Col – “To make BERT handle a variety of down-stream tasks, our input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., h Question, Answer i) in one token sequence. Throughout this work, a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together.”); 
generate a set of training data comprising the set of groups of tokens (DEVLIN pg. 13 1st Col – “To generate each training input sequence, we sample two spans of text from the corpus, which we refer to as “sentences” even though they are typically much longer than single sentences (but can be shorter also). The first sentence receives the A embedding and the second receives the B embedding. 50% of the time B is the actual next sentence that follows A and 50% of the time it is a random sentence, which is done for the “next sentence prediction” task. They are sampled such that the combined length is ≤ 512 tokens. The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15%, and no special consideration given to partial word pieces.”) [and copies of at least a portion of a group of tokens in the set of groups of tokens] with a fixed length (DEVLIN Pg. 5 Fig. 2 – “BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.”; pg. 13 – “They are sampled such that the combined length is ≤ 512 tokens. The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15%, and no special consideration given to partial word pieces. We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40” epochs over the 3.3 billion word corpus…. Longer sequences are disproportionately expensive because attention is quadratic to the sequence length. To speed up pretraing in our experiments, we pre-train the model with sequence length of 128 for 90% of the steps. Then, we train the rest 10% of the steps of sequence of 512 to learn the positional embeddings.”); and 
train the sequence model using the set of training data (DEVLIN pg. 13 1st Col – “We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus.”).
DEVLIN does not teach the [square-bracketed] limitations, but teaches the underlined features instead.  In other words, DEVLIN teaches a set of training data comprises of multiple embeddings with a fixed size (e.g., 512). The size of the set of training data (i.e., batch) is 256 sequences x 512 tokens. However, DEVLIN does not explicitly teach filling in the embeddings so that the length is 512.  One of ordinary skill in the art would know a “zero” padding method (i.e., filling in the data with dummy values) is used when the size/length of the embedding is fixed. However, DEVLIN is silent to filling in the embeddings with “copies of at least a portion of a group of tokens” as recited in the claim.

WARD discloses the [square-bracketed] limitations.  WARD discloses a method/system for generating training data for machine learning comprising:
generate a set of training data comprising the set of groups of tokens [and copies of at least a portion of a group of tokens in the set of groups of tokens] (WARD Fig. 10; Par 101 – “Padding the shorter training samples by repetition has several advantages over padding with zeros or special characters indicating no data. When zeros or other meaningless data is used, no information is encoded and computation time is wasted in processing that data leading to slower learning or model convergence. By repeating the input sequence, the neural network can learn from all elements of the input, and there is no meaningless or throw-away padding present. The result is faster convergence and learning, better computational utilization, and better behaved and regularized models.”; Par 102 – “Although looping of shorter samples in a batch was described above with reference to training, the repetition of shorter samples to be the same length as a longest sequence may also be performed during inference. In some embodiments, inference is performed on a tensor similar to tensor 1000 with multiple samples obtained by splitting an audio file. Each sample may be stored in a row of the tensor. The same process described above for training may be applied during inference. A longest sample may be unchanged, and each of the shorter samples may be repeated until they are the same length as the longest sample so that every row of the tensor is the same length. The tensor, with the repetitions of shorter samples, may be input to the neural network for inference.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of DEVLIN to include copies of the input sequence, as taught by WARD.
One of ordinary skill would have been motivated to include copies of the input sequence, in order to achieve faster convergence and learning (WARD Par 101).

REGARDING CLAIM 9, DEVLIN in view of WARD discloses the system of claim 8, wherein the copies of at least the portion of the group of tokens are copies of at least the portion of a first group of tokens (WARD Fig. 10; Par 101 – “Padding the shorter training samples by repetition has several advantages over padding with zeros or special characters indicating no data. When zeros or other meaningless data is used, no information is encoded and computation time is wasted in processing that data leading to slower learning or model convergence. By repeating the input sequence, the neural network can learn from all elements of the input, and there is no meaningless or throw-away padding present. The result is faster convergence and learning, better computational utilization, and better behaved and regularized models.”), wherein the set of training data further comprises copies of at least a portion of a second group of tokens in the set of groups of tokens (WARD Fig. 10; Par 98 – “Each of the training samples may be stored as a row of tensor 1000 to create a training batch. The length of the tensor 1000 in the time dimension is determined by the length of the longest sample 1002 in terms of the number of frames. Longest sample 1002 is not changed. Each of the shorter samples 1001, 1003, 1004, 1005, 1006 in the batch are repeated until they are the same length as the longest sample 1002, so that every row of the tensor has the same length.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of DEVLIN to include copies of the input sequence, as taught by WARD.
One of ordinary skill would have been motivated to include copies of the input sequence, in order to achieve faster convergence and learning (WARD Par 101).


REGARDING CLAIM 10, DEVLIN in view of WARD discloses the system of claim 9, wherein generating the set of training data comprises: 
generating a data structure having a defined length (WARD Par 98 – “Each of the training samples may be stored as a row of tensor 1000 to create a training batch. The length of the tensor 1000 in the time dimension is determined by the length of the longest sample 1002 in terms of the number of frames. Longest sample 1002 is not changed. Each of the shorter samples 1001, 1003, 1004, 1005, 1006 in the batch are repeated until they are the same length as the longest sample 1002, so that every row of the tensor has the same length.”); 
packing a first row of the data structure with the first group of tokens and the copies of at least the portion of the first group of tokens until the length of the first row of the data structure is filled up with tokens from the first group of tokens  (WARD Par 98 – “Each of the training samples may be stored as a row of tensor 1000 to create a training batch. The length of the tensor 1000 in the time dimension is determined by the length of the longest sample 1002 in terms of the number of frames. Longest sample 1002 is not changed. Each of the shorter samples 1001, 1003, 1004, 1005, 1006 in the batch are repeated until they are the same length as the longest sample 1002, so that every row of the tensor has the same length.”); and 
packing the second row of the data structure with the second group of tokens and the copies of at least the portion of the second group of tokens until the length of the second row of the data structure is filled up with tokens from the second group of tokens (WARD Par 98 – “Each of the training samples may be stored as a row of tensor 1000 to create a training batch. The length of the tensor 1000 in the time dimension is determined by the length of the longest sample 1002 in terms of the number of frames. Longest sample 1002 is not changed. Each of the shorter samples 1001, 1003, 1004, 1005, 1006 in the batch are repeated until they are the same length as the longest sample 1002, so that every row of the tensor has the same length.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of DEVLIN to include copies of the input sequence, as taught by WARD.
One of ordinary skill would have been motivated to include copies of the input sequence, in order to achieve faster convergence and learning (WARD Par 101).

REGARDING CLAIM 11, DEVLIN in view of WARD discloses the system of claim 10, wherein the instructions further cause the at least one processing unit to: 
determine a first set of embeddings comprising an embedding for each token in the first group of tokens and the copies of at least the portion of the first group of tokens (WARD Par 98 – “Each of the training samples may be stored as a row of tensor 1000 to create a training batch. The length of the tensor 1000 in the time dimension is determined by the length of the longest sample 1002 in terms of the number of frames. Longest sample 1002 is not changed. Each of the shorter samples 1001, 1003, 1004, 1005, 1006 in the batch are repeated until they are the same length as the longest sample 1002, so that every row of the tensor has the same length.”); 
determine a second set of embeddings comprising an embedding for each token in the second group of tokens and the copies of at least the portion of the second group of tokens (WARD Par 98 – “Each of the training samples may be stored as a row of tensor 1000 to create a training batch. The length of the tensor 1000 in the time dimension is determined by the length of the longest sample 1002 in terms of the number of frames. Longest sample 1002 is not changed. Each of the shorter samples 1001, 1003, 1004, 1005, 1006 in the batch are repeated until they are the same length as the longest sample 1002, so that every row of the tensor has the same length.”); and 
add the first set of embeddings to the second set of embeddings (WARD Fig. 10 Par 98 – “FIG. 10 illustrates an example of looping each of the shorter training samples in a training batch so that the shorter training samples are repeated until they are the same length as the longest training sample. … In an embodiment, shorter sample 1001 is repeated k times where k = floor ( N/M ) where N is the length of the longest sample and M is the length of shorter sample 1001, and the last repetition of shorter sample 1001 is of length Z=N mod M. Although only two dimensions of the tensor 1000 are illustrated, the tensor 1000 may have many more dimensions. For example, each row may be a multi-dimensional tensor, such as when the frames in the rows are multi-dimensional tensors.”; Note that Fig. 10 shows 6 rows of inputs stacked together as a batch.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of DEVLIN to include copies of the input sequence, as taught by WARD.
One of ordinary skill would have been motivated to include copies of the input sequence, in order to achieve faster convergence and learning (WARD Par 101).

REGARDING CLAIM 12, DEVLIN in view of WARD discloses the system of claim 9, wherein generating the set of training data comprises repeating the copies of at least the portion of the first group of tokens and the copies of at least the portion of the second group of tokens (WARD Par 98 – “Each of the training samples may be stored as a row of tensor 1000 to create a training batch. The length of the tensor 1000 in the time dimension is determined by the length of the longest sample 1002 in terms of the number of frames. Longest sample 1002 is not changed. Each of the shorter samples 1001, 1003, 1004, 1005, 1006 in the batch are repeated until they are the same length as the longest sample 1002, so that every row of the tensor has the same length.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of DEVLIN to include copies of the input sequence, as taught by WARD.
One of ordinary skill would have been motivated to include copies of the input sequence, in order to achieve faster convergence and learning (WARD Par 101).


REGARDING CLAIM 13, DEVLIN in view of WARD discloses the system of claim 8, wherein generating the set of training data comprises: 
generating a data structure having a defined length (WARD Par 96 – “In an embodiment, training is performed on a batch of utterances at a time. In some embodiments, the utterances in a training batch must be of the same length. Having samples of the same length may simplify tensor operations performed in the forward propagation and backward propagation stages, which may be implemented in part through matrix multiplications with matrices of fixed dimension. For the matrix operations to be performed, it may be necessary that each of the training samples have the same length”); and 
packing the data structure with the set of groups of tokens and the copies of the at least one portion of the group of tokens in the set of groups of tokens so that a total number of tokens packed into the data structure is equal to the defined length (WARD Par 98 – “The length of the tensor 1000 in the time dimension is determined by the length of the longest sample 1002 in terms of the number of frames. Longest sample 1002 is not changed. Each of the shorter samples 1001, 1003, 1004, 1005, 1006 in the batch are repeated until they are the same length as the longest sample 1002, so that every row of the tensor has the same length. The shorter samples are repeated exactly in all of their elements starting from the first element through the last element of the sample. When the length of a sample does not divide evenly into the length of the tensor, the last repetition of the sample may only be a partial repetition until the desired length is reached. The partial repetition is a repetition of the shorter sample starting from the first element and iteratively repeating through subsequent elements of the sample until the desired length is reached.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of DEVLIN to include copies of the input sequence, as taught by WARD.
One of ordinary skill would have been motivated to include copies of the input sequence, in order to achieve faster convergence and learning (WARD Par 101).


CLAIM 14 is a method performing the steps of Claim 1; thus, it is rejected under the same rationale.

CLAIM 15 is a method performing the steps of Claim 2; thus, it is rejected under the same rationale.

CLAIM 16 is a method performing the steps of Claim 3; thus, it is rejected under the same rationale.

CLAIM 17 is a method performing the steps of Claim 4; thus, it is rejected under the same rationale.

CLAIM 18 is a method performing the steps of Claim 5; thus, it is rejected under the same rationale.

CLAIM 19 is a method performing the steps of Claim 6; thus, it is rejected under the same rationale.

CLAIM 20 is a method performing the steps of Claim 7; thus, it is rejected under the same rationale.

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JONATHAN C KIM whose telephone number is (571)272-3327. The examiner can normally be reached Monday to Friday 8:00 AM thru 4:00 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew C Flanders can be reached on 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JONATHAN C KIM/Primary Examiner, Art Unit 2655