Detailed Action
This action is in response to Applicant's communications filed 10 May 2021.  
Claim(s) 1, 9, and 17 was/were amended.  No claims were cancelled.  No claims were withdrawn.  Therefore, claims 1, 3-4, 6-9, 11-12, 14-17, and 19-25 are pending in this Application.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendments/Arguments
Applicant's arguments/amendments, filed 10 May 2021, regarding the rejections of claims 1, 3-4, 6-9, 11-12, 14-17, and 19-25 under 35 USC 103 have been fully considered but are not persuasive.
Applicant argues that the prior art (Flake, Yaghoobzadeh, Ma, and Dredze) does not either alone or in combination disclose or suggest amended independent claims 1, 9, and 17.  The amended claim language from claim 1 is reproduced below:
"in response to determining that the service provider is not known to the information extraction system, as indicated by the generation of the predetermined class label by the CNN processing module, triggering a recurrent neural network (RNN) processing module of the information extraction system, comprising


Applicant argues that the cited portions of Dredze and Ma do not disclose the amended limitations.  Applicant argues (Remarks, p. 11-12) that while Dredze discloses a NIL label regarding unknown entities, Dredze only discusses the use of a single model, so does not disclose a second model that can be triggered in response to the output of the first model.  While Dredze discloses using SVM models for ranking terms, it uses several variations of the model using different features, thus teaching multiple models.  For example, Dredze treats known entities one way (sec. 5, p. 280-282) and unknown entities (sec. 6, p. 282) a different way.  In sec. 6, p. 282, Dredze teaches a modified model that specifically incorporates features for unknown entities ("We learn when to predict NIL using the SVM ranker by augmenting Y to include NIL, which then has a single feature unique to NIL answers.  It can be shown that (modulo slack variables) this is equivalent to learning a single threshold ˝ for NIL predictions as in Bunescu and Pasca (2006).  Incorporating NIL into the ranker has several advantages. First, the ranker can set the threshold optimally without hand tuning. Second, since the SVM scores are relative within a single example and cannot be compared across examples, setting a single threshold is difficult. Third, a threshold sets a uniform standard across all examples, whereas in practice we may have reasons to favor a NIL prediction in a given example. We design features for NIL prediction that cannot be captured in a single parameter." sec. 6, p. 282).  Therefore Dredze teaches that in response to entities being unknown, a second model is used to teach the limitations.
Applicant argues (Remarks, p. 12-13) that while Ma teaches a CNN in combination with a BLSTM network, it does not disclose or suggest triggering the BLSTM network in response to a determination that the sequence or words in the sequence are not known to the system.  Examiner notes, as discussed above, that Dredze teaches the limitations of triggering a model in response to determining that the service provider is not known to the information extraction system.  Dredze does not disclose that the model is a RNN.  Instead, Ma is relied on to teach using an RNN as the second model.  It would have been obvious to one of ordinary skill in the art at the time of filing to replace the second model of Dredze with the RNN of Ma, as Ma's architecture taught a comprehensive solution with high accuracy for word embeddings.  Therefore the combination of Dredze and Ma teach the limitations of the claim.
The rejection of the dependent claims for depending from rejected claims is maintained.
For the aforementioned reasons, claims 1, 3-4, 6-9, 11-12, 14-17, and 19-25 are rejected under 35 USC 103.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to 
Claims 1, 3, 6-9, 11-12, 14-17, 19-25 is/are rejected under 35 U.S.C. 103 as being unpatentable over Flake et al. (US 2008/0154704, Hereinafter "Flake") in view of Yaghoobzadeh et al. (Multi-level Representations for Fine-Grained Typing of Knowledge Base Entities, Hereinafter "Yaghoobzadeh"), Ma et al. (End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF, Hereinafter "Ma"), and Dredze et al. (Entity Disambiguation for Knowledge Base Population, hereinafter "Dredze").

Regarding Claim 1,
Flake teaches a method comprising: 
receiving, by an information extraction system (FIG. 1, Extraction component 110) including one or more processors ("processor" [0072]), a transaction record (FIG. 1, Receipt 104), the transaction record including a plurality of tokens (FIG. 2; "descriptor 210 will contain abbreviations" [0037]; Legend 308 can provide vendor-specific information as to all or portions of the data included in receipt 104 such as, e.g. abbreviations used" [0042]), the transaction record describing a transaction served by a service provider ("Generally, system 100 can include scanning component 102 that can read receipt 104, wherein receipt 104 can relate to a transaction between vendor 106 and consumer 108... Appreciably, vendor 106 is intended to include retailers, advertisers, or agents thereof, or substantially any business establishment that solicits transactions from consumers and/or customer 108." [0025]);
determining a classification of a service provider ("Identification of relevant transaction data 114 can be accomplished in a general manner in which all or substantially all data included in receipt 104 that can potentially be relevant for transaction verification or feedback purposes can be identified and extracted... As another example, a third application (e.g., an application directed to engagement-based rewards) might consider identification of vendor 106 to be relevant" [0028])
using neural networks ("Various classification (explicitly and/or implicitly trained) schemes and/or systems (e.g. support vector machines, neural networks, expert systems, Bayesian belief networks, fuzzy logic, data fusion engines . . . ) can be employed in connection with performing automatic and/or inferred action in connection with the claimed subject matter." [0055]), and
wherein the particular token ("It should be appreciated that in most cases, descriptor 210 will contain abbreviations for the item and/or brand information that can be expanded, augmented, or translated to full item descriptions by extraction component 110. For example, the initial item listed in block 210 is “H C SAL MIX”. Extraction component 110 can translate this to text-based data from an image of receipt 104, and can further expand this particular descriptor 210" [0037]) identifies the name of the service provider ("Generally, system 100 can include scanning component 102 that can read receipt 104, wherein receipt 104 can relate to a transaction between vendor 106 and consumer 108... Appreciably, vendor 106 is intended to include retailers, advertisers, or agents thereof, or substantially any business establishment that solicits transactions from consumers and/or customer 108." [0025]); and
generating, by the information extraction system, a report specifying the name of the service provider in the transaction record ("For example, extraction component 110 can communicate with a data store maintained by vendor 106 such as vendor data store 310 in order to retrieve legend 308. Legend 308 can provide vendor-specific information as to all or portions of the data included in receipt 104 such as, e.g. abbreviations used" [0042]).

Flake does not explicitly teach determining, by a convolutional neural network (CNN) processing module of the information extraction system, a classification of the transaction record based on a collection of parameters that the CNN processing module learned from first training data, and if determining that the service provider is known to the information extraction system, generating a class label identifying a name of the service provider.
Flake also does not explicitly teach the determining comprising determining whether the service provider is known to the information extraction system or is not known to the information extraction system; and if determining that the service provider is not known to the information extraction system, generating a predetermined class label indicating that the service provider is not a known service provider, wherein the CNN processing module has been configured through training to generate the predetermined class label in response to processing a transaction record corresponding to a service 
Flake also does not explicitly teach in response to the CNN processing module, triggering a recurrent neural network (RNN) processing module of the information extraction system, comprising locating, by the RNN processing module and based at least in part on a character embedding model and a word embedding model both of which the RNN processing module learned from second training data, a particular token of the plurality of tokens of the transaction record.

Yaghoobzadeh teaches determining, by a convolutional neural network (CNN) processing module of the information extraction system ("We experiment with four architectures to produce character-level representations in this paper: FORWARD (direct forwarding of character embeddings), CNNs, LSTMs and BiLSTMs." sec. 3.3, p. 4), a classification of the transaction record (Table 4, "Table 4 shows that the CNN works only slightly better... on known entities, but works much better on unknown entities" sec. 4.3, p. 8) based on a collection of parameters that the CNN processing module learned from first training data ("we divide test entities into known entities – at least one word of the entity’s name appears in a train entity – and unknown entities (the complement). There are 45,000 (resp. 15,000) known (resp. unknown) test entities." sec. 4.3, p. 8); and
if determining that the service provider is known to the information extraction system, generating a class label identifying a name of the service provider ("We can take the standard method of learning embeddings for words and extend it to learning embeddings for entities. This requires the use of an entity linker and can be implemented by replacing all occurrences of the entity by a unique token." sec. 1, p. 1), and 
Flake and Yaghoobzadeh are analogous art because both are directed to using neural networks for automatic information processing of entities. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the information processing system of Flake with the convolutional neural network of Yaghoobzadeh.  The modification would have been obvious because one of ordinary skill would be motivated to use models proven more effective, as suggested by Yaghoobzadeh (Yaghoobzadeh: sec. 4.2, p.7).

Dredze teaches the determining comprising determining whether the service provider is known to the information extraction system or is not known to the information extraction system ("We define entity linking as matching a textual entity mention, possibly identified by a named entity recognizer, to a KB entry, such as a Wikipedia page that is a canonical entry for that entity. An entity linking query is a request to link a textual entity mention in a given document to an entry in a KB. The system can either return a matching entry or NIL to indicate there is no matching entry." sec. 3, p. 278);
if determining that the service provider is not known to the information extraction system, generating a predetermined class label indicating that the service provider is not a known service provider ("We just recently became aware of a system fielded by Li et al. at the TAC-KBP 2009 evaluation (2009). Their approach bears a number of similarities to ours; both systems create candidate sets and then rank possibilities using differing learning methods, but the principal difference is in our approach to NIL prediction. Where we simply consider absence (i.e., the NIL candidate) as another entry to rank, and select the top-ranked option, they use a separate binary classifier to decide whether their top prediction is correct, or whether NIL should be output. We believe relying on features that are designed to inform whether absence is correct is the better alternative." sec. 2, p. 278; NIL teaches the predetermined class label; "We learn when to predict NIL using the SVM ranker by augmenting Y to include NIL, which then has a single feature unique to NIL answers.  It can be shown that (modulo slack variables) this is equivalent to learning a single threshold ˝ for NIL predictions as in Bunescu and Pasca (2006).  Incorporating NIL into the ranker has several advantages." sec. 6, p. 282),
wherein the processing module has been configured through training ("We learned a new model on the training data above using a reduced feature set to increase speed." sec. 7.3, p. 284) to generate the predetermined class label in response to processing a transaction record corresponding to a service provider that is not known to the information extraction system ("The system can either return a matching entry or NIL to indicate there is no matching entry." sec. 3, p. 278); and
determining that the service provider is not known to the information extraction system ("We define entity linking as matching a textual entity mention, possibly identified by a named entity recognizer, to a KB entry, such as a Wikipedia page that is a canonical entry for that entity. An entity linking query is a request to link a textual entity mention in a given document to an entry in a KB. The system can either return a matching entry or NIL to indicate there is no matching entry." sec. 3, p. 278), as indicated by the generation of the predetermined class label ("We just recently became aware of a system fielded by Li et al. at the TAC-KBP 2009 evaluation (2009). Their approach bears a number of similarities to ours; both systems create candidate sets and then rank possibilities using differing learning methods, but the principal difference is in our approach to NIL prediction. Where we simply consider absence (i.e., the NIL candidate) as another entry to rank, and select the top-ranked option, they use a separate binary classifier to decide whether their top prediction is correct, or whether NIL should be output. We believe relying on features that are designed to inform whether absence is correct is the better alternative." sec. 2, p. 278; NIL teaches the predetermined class label; "We learn when to predict NIL using the SVM ranker by augmenting Y to include NIL, which then has a single feature unique to NIL answers.  It can be shown that (modulo slack variables) this is equivalent to learning a single threshold ˝ for NIL predictions as in Bunescu and Pasca (2006).  Incorporating NIL into the ranker has several advantages." sec. 6, p. 282).
	Flake and Dredze are analogous art because both are directed to identifying text entities using machine learning. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the information processing system of the Flake/Yaghoobzadeh combination with the NIL identifier of Dredze.  The modification would have been obvious because one of ordinary skill in the art would be motivated to incorporating NIL to optimally set thresholds without hand tuning and handling variations 

Dredze further teaches in response to determining that the service provider is not known to the information extraction system ("We define entity linking as matching a textual entity mention, possibly identified by a named entity recognizer, to a KB entry, such as a Wikipedia page that is a canonical entry for that entity. An entity linking query is a request to link a textual entity mention in a given document to an entry in a KB. The system can either return a matching entry or NIL to indicate there is no matching entry." sec. 3, p. 278), as indicated by generating the generation of the predetermined class label by the CNN processing module ("We just recently became aware of a system fielded by Li et al. at the TAC-KBP 2009 evaluation (2009). Their approach bears a number of similarities to ours; both systems create candidate sets and then rank possibilities using differing learning methods, but the principal difference is in our approach to NIL prediction. Where we simply consider absence (i.e., the NIL candidate) as another entry to rank, and select the top-ranked option, they use a separate binary classifier to decide whether their top prediction is correct, or whether NIL should be output. We believe relying on features that are designed to inform whether absence is correct is the better alternative." sec. 2, p. 278; NIL teaches the predetermined class label; "We learn when to predict NIL using the SVM ranker by augmenting Y to include NIL, which then has a single feature unique to NIL answers.  It can be shown that (modulo slack variables) this is equivalent to learning a single threshold ˝ for NIL predictions as in Bunescu and Pasca (2006).  Incorporating NIL into the ranker has several advantages." sec. 6, p. 282), triggering a classifier ("So far we have assumed that each example has a correct KB entry; however, when run over a large corpus, such as news articles, we expect a significant number of entities will not appear in the KB. Hence it will be useful to predict NILs. We learn when to predict NIL using the SVM ranker by augmenting Y to include NIL, which then has a single feature unique to NIL answers.  It can be shown that (modulo slack variables) this is equivalent to learning a single threshold ˝ for NIL predictions as in Bunescu and Pasca (2006).  Incorporating NIL into the ranker has several advantages. First, the ranker can set the thresh-old optimally without hand tuning. Second, since the SVM scores are relative within a single exam-ple and cannot be compared across examples, set-ting a single threshold is difficult. Third, a thresh-old sets a uniform standard across all examples, whereas in practice we may have reasons to favor a NIL prediction in a given example. We design features for NIL prediction that cannot be captured in a single parameter." sec. 6, p. 282) based at least in part on a character embedding model ("We represent each query by a D dimensional vector x" sec. 5, p. 280; "We added features for character Dice, skip bigram Dice, and left and right Hamming distance scores.  Features were set based on quantized scores.  These were useful for detecting minor spelling variations or mistakes") and a word embedding model ("Another measure of sur-face similarity between a query and a candidate was computed by training finite-state transducers similar to those described in Dreyer et al. (2008). These transducers assign a score to any string pair by summing over all alignments and scoring all contained character n-grams; we used n-grams of length 3 and less. The scores are combined using a global log-linear model. Since different spellings of a name may vary considerably in length (e.g., J Miller vs. Jennifer Miller) we eliminated the limit on consecutive insertions used in previous applications." sec. 5.2, p. 281).
However, Dredze does not teach that the classifier is a recurrent neural network (RNN) processing module of the information extraction system, comprising locating, by the RNN processing module and based at least in part on a character embedding model and a word embedding model both of which the RNN processing module learned from second training data, a particular token of the plurality of tokens of the transaction record.
Ma discloses CNN-RNN combinations for named entity recognition.  Ma teaches triggering a recurrent neural network (RNN) processing module of the information extraction system ("For each word, the character-level representation is computed by the CNN in Figure 1 with character embeddings as inputs. Then the character-level representation vector is concatenated with the word embedding vector to feed into the BLSTM network" sec. 2.4, p. 3), comprising
locating, by the RNN processing module ("LSTMs are variants of RNNs" sec. 2.2.1, p. 2; "bi-directional LSTM" sec. 2.2.2, p.3) and based at least in part on a character embedding model and a word embedding model ("For each word, the character-level representation is computed by the CNN in Figure 1 with character embeddings as inputs. Then the character-level representation vector is concatenated with the word embedding vector to feed into the BLSTM network. Finally, the output vectors of BLSTM are fed to the CRF layer to jointly de-code the best label sequence." sec. 2.4, p.3) both of which the RNN processing module learned (Network Training, sec. 3, p. 4-5), a particular token of the plurality of tokens of the transaction record (Table 2, tokens; Table 8, number of tokens).
 Flake and Ma are analogous art because both are directed to using neural networks for automatic information processing of named entities. It would have been obvious to one of ordinary skill in the art before the effective filing date to replaces the classifier taught Dredze in the Flake/Yaghoobzadeh/Dredze combination with the RNN-based structure of Ma.  The modification would have been obvious because one of ordinary skill would be motivated to have a universal system applicable to many tasks (Ma: "Our system is truly end-to-end, requiring no feature engineering or data pre-processing, thus making it applicable to a wide range of sequence labeling tasks on different languages." Abstract, p.1) that achieves high accuracy (Ma: "We obtain state-of-the-art performance on both the two data – "97.55% accuracy for POS tagging and 91.21% F1 for NER" Abstract, p. 1).

Regarding Claim 3,
The Flake/Yaghoobzadeh/Ma/Dredze combination teaches the method of claim 1.  Flake, Yaghoobzadeh, and Ma combined further teach wherein:
each service provider that is known to the information extraction system serves more transactions than each service provider that is not known to the information extraction system (Yaghoobzadeh: "We take the most frequent name for dev and test entities and three most frequent names for train (each one tagged with entity types)." sec. 4.1, p. 5; Flake: "Generally, system 100 can include scanning component 102 that can read receipt 104, wherein receipt 104 can relate to a transaction between vendor 106 and consumer 108... Appreciably, vendor 106 is intended to include retailers, advertisers, or agents thereof, or substantially any business establishment that solicits transactions from consumers and/or customer 108." [0025]; it would have been obvious to one of ordinary skill in the art that the most frequent vendors on receipts are serving more transactions),
each token is a character sequence having an arbitrary length (Yaghoobzadeh: Figure 1, Size of output layer is |T|; "We implement the following two feature sets from the literature as a hand-crafted baseline for our character and word level models. (i) BOW: in-dividual words of entity name (both as-is and lowercased); (ii) NSL (ngram-shape-length): shape and length of the entity name (cf. Ling and Weld (2012)), character n-grams, 1 ≤ n ≤ nmax, nmax = 5 (we also tried nmax = 7, but results were worse on dev) and normalized character n-grams" sec. 4.1, p. 5),
the first training data includes a plurality of first examples, wherein each first example includes a class number associated with a name (Flake: "Receipt 104 can include information 202 associated with vendor 106 such as name, location or mailing address, phone number or other contact information, a universal resource locator, and so forth." [0035]; phone numbers teach a class number associated with a name), and
the second training data includes a plurality of second examples, wherein each second example includes character begin and end positions of one or more entities (Ma: Figure 1, char representation; Figure 3, char representation).



Regarding Claim 6,
The Flake/Yaghoobzadeh/Ma/Dredze combination teaches the method of claim 1.  Flake and Yaghoobzadeh further teaches wherein the training data comprises a sample of transactions served by a subset of all service providers, the subset being selected from all service providers based on a ratio between a number of transactions served by the subset of service providers over a number of transactions served by all service providers (Yaghoobzadeh: "We use the same train (50%), dev (20%) and test (30%) partitions ... and extract the names from mentions of dataset entities in the corpus." sec. 4.1, p. 5; this teaches using a subset of entities for training that has the same frequency as the real data; Flake: "Generally, system 100 can include scanning component 102 that can read receipt 104, wherein receipt 104 can relate to a transaction between vendor 106 and consumer 108... Appreciably, vendor 106 is intended to include retailers, advertisers, or agents thereof, or substantially any business establishment that solicits transactions from consumers and/or customer 108." [0025]; this teaches that the entities are service providers).
The motivation to combine Flake and Yaghoobzadeh is the same as the motivation for claim 1.

Regarding Claim 7,
Yaghoobzadeh and Ma combined further teach: wherein locating the particular token comprises:
scanning each token in the transaction record by a character-level model to generate a character-level output (Yaghoobzadeh: "For computing the character level representation (CLR), we design models that try to type an entity based on the sequence of characters of its name." sec. 3.3, p. 4; Ma: "For each word, the character-level representation is computed by the CNN in Figure 1 with character embeddings as inputs, sec. 2.4, p. 3), wherein the scanning is based on a character embedding instances corresponding to characters in the transaction record (Yaghoobzadeh: "The first layer of the character-level models is a lookup table that maps each character to an em-bedding of size dc. These embeddings capture similarities between characters, e.g., similarity in type of phoneme encoded (consonant/vowel) or similarity in case (lower/upper)." sec. 3.3, p. 4; Ma: "For each word, the character-level representation is computed by the CNN in Figure 1 with character embeddings as inputs, sec. 2.4, p. 3);
generating a respective initial representation of each token in the transaction record from the character-level output (Yaghoobzadeh: "The BiLSTM entity representation is the concatenation of last states of forward and backward LSTMs, i.e., v(e) ∈ R2∗dh." sec. 3.3, p. 4; Ma: "Then the character-level representation vector is concatenated with the word embedding vector to feed into the BLSTM network.", sec. 2.4, p. 3);
(Ma: "An elegant solution whose effectiveness has been proven by previous work ... is bi-directional LSTM (BLSTM). The basic idea is to present each sequence forwards and backwards to two separate hidden states to capture past and future information, respectively. Then the two hidden states are concatenated to form the final output." sec. 2.2.2, p.3), the scanning by the token-level model including a token-level forward scan (Ma: Figure 3: Forward LSTM) and a token-level backward scan (Ma: Figure 3: Backward LSTM);
generating a respective final representation of each token in the transaction record from the token-level output (Ma: Figure 3, "Then the character-level representation vector is concatenated with the word embedding vector to feed into the BLSTM network.  Finally, the output vectors of BLSTM are fed to the CRF layer to jointly decode the best label sequence." sec. 3.3, p. 3);
classifying each token by feeding final representations of the tokens to a softmax layer (Yaghoobzadeh: softmax, sec. 4.1, p. 6) that produces a respective probability vector for each token (Ma: "The probabilistic model for sequence CRF defines a family of conditional probability" sec. 2.3, p. 3); and
determining that the particular token identifies the name of the service provider based on results of the classifying ("Decoding is to search for the label sequence y∗ with the highest conditional probability" sec. 2.3, p. 3; the entity with the highest probability is chosen, in this case, entities represent service providers as taught in Flake).


Regarding Claim 8,
The Flake/Yaghoobzadeh/Ma/Dredze combination teaches the method of claim 7.  Yaghoobzadeh and Ma combined further teach: wherein the character-level model is a long short-term memory (LSTM) model (Yaghoobzadeh: LSTM, sec. 3.3 Character-level representation, p. 4; it is noted that the claim language only requires one model.  However, the references teach additional models), a many-to-one bidirectional LSTM (BI-LSTM) model (Yaghoobzadeh: BiLSTM, sec. 3.3 Character-level representation, p. 4),, or a one-dimensional CNN model (Yaghoobzadeh: Figure 2, CNN, sec. 3.3 Character-level representation, p. 4), and 
the token-level model is a many-to-many BI-LSTM model (Ma: Figure 3, main architecture of the depicted BLSTM shows many input and many output).
The motivation to combine Flake, Yaghoobzadeh, and Ma is the same as the motivation for claim 1.

Regarding Claim 21,
The Flake/Yaghoobzadeh/Ma/Dredze combination teaches the method of claim 1.  Flake further teaches wherein the particular token includes a shortened name of the service provider (Flake: "It should be appreciated that in most cases, descriptor 210 will contain abbreviations" [0037], it is noted that the claim language only requires one feature),

Regarding Claim 22,
	The Flake/Yaghoobzadeh/Ma/Dredze combination teaches the method of claim 7.  Yaghoobzadeh and Ma further teach wherein the character-level model is a many-to-one bi-directional long short-term memory (BI-LSTM) model (Yaghoobzadeh: BiLSTM, sec. 3.3 Character-level representation, p. 4) and the token-level model is a many-to-many BI-LSTM (Ma: Figure 3, main architecture of the depicted BLSTM shows many input and many output);
	scanning each token in the transaction record by the character-level model (Ma: "An elegant solution whose effectiveness has been proven by previous work ... is bi-directional LSTM (BLSTM). The basic idea is to present each sequence forwards and backwards to two separate hidden states to capture past and future information, respectively. Then the two hidden states are concatenated to form the final output." sec. 2.2.2, p.3) comprises, for each token in the transaction record,
	performing a character-level forward scan of the character embedding instances of the token to generate a forward character-level representation of the token (Ma: Figure 3: Forward LSTM), and
performing a character-level backward scan of the character embedding instances of the token to generate a backwards character-level representation of the token (Ma: Figure 3: Backward LSTM); and
generating a respective initial representation of each token in the transaction record comprises, for each token in the transaction record, concatenating i) the forward character-level representation of the token, ii) the backward character-level (Yaghoobzadeh: "The BiLSTM entity representation is the concatenation of last states of forward and backward LSTMs, i.e., v(e) ∈ R2∗dh." sec. 3.3, p. 4; Ma: "Then the character-level representation vector is concatenated with the word embedding vector to feed into the BLSTM network.", sec. 2.4, p. 3).
The motivation to combine Flake, Yaghoobzadeh, and Ma is the same as the motivation for claim 1.

Regarding Claims 9, 11-12, 14-16, 23-24,
Claims 9, 11-12, 14-16, 23-24 recite(s) a system including a processor (Flake: "processor" [0072]) and memory (Flake: "computer-readable media" [0070]) comprising instructions for performing functions corresponding to the method steps performed by a processor recited in claims 1, 3-4, 6-8, 21-22, respectively.  The Flake/Yaghoobzadeh/Ma/Dredze combination teaches the limitations of claims 9, 11-12, 14-16, 23-24 as set forth above in connection with claims 1, 3-4, 6-8, 21-22.  Therefore, claims 9, 11-12, 14-16, 23-24 is rejected under the same rationale as respective claims 1, 3-4, 6-8, 21-22.

Regarding Claims 17, 19-20, 25,
Claims 17, 19-20, 25 recite(s) a non-transitory computer readable medium (Flake: "computer-readable media" [0070]) storing instructions that, when executed by a processor (Flake: "processor" [0072]), perform functions corresponding to the method steps performed by a processor as recited in claims 1, 3-4, 21, respectively.  The .

Claims 4, 12, 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Flake et al. (US 2008/0154704, Hereinafter "Flake") in view of Yaghoobzadeh et al. (Multi-level Representations for Fine-Grained Typing of Knowledge Base Entities, Hereinafter "Yaghoobzadeh"), Ma et al. (End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF, Hereinafter "Ma"), Dredze et al. (Entity Disambiguation for Knowledge Base Population, hereinafter "Dredze"), and Yao et al. (US2018/0032844, Hereinafter "Yao").

Regarding Claim 4,
The Flake/Yaghoobzadeh/Ma/Dredze combination teaches the method of claim 1.  Yaghoobzadeh further teaches wherein determining the classification of the transaction record comprises: transforming a representation of the transaction record through a series of convolutional layers and pooling layers (Yaghoobzadeh: Figure 2, Convolution layer, max pooling).

The Flake/Yaghoobzadeh/Ma/Dredze combination does not explicitly teach each layer being generated by sliding one or more kernels over output of a previous convolutional layer; and

Yao teaches each layer being generated by sliding one or more kernels over output of a previous convolutional layer (FIG. 3; "Also as shown, convolutional layer 302 may receive input layer 301 (e.g., having an input size of 225×225×3) and convolution kernels applied via convolutional layer 302 and ReLU, max pooling, and LRN 312 may provide feature maps 313 having an output size of 55×55×96. For example, at convolutional layer 302, multiple convolution kernels such as convolution kernel 311 may be applied to input layer 301. Such convolution kernels may be convolved with input layer 301 for example. In some instances, such convolution kernels may be characterized as filters, convolution filters, color filters, or the like. For example, the multiple convolution kernels applied at convolutional layer 302 may include 96 7×7 convolution kernels (e.g., with each convolution kernel associated with one of the 96 55×55 feature maps 313) having a stride of 2. For example, subsequent to applying convolution kernels such as convolutional kernels 311, ReLU, max pooling, and LRN 312 may be applied to generate feature maps 313." [0033]); and
determining the classification of the transaction record based on a final pooling layer of the pooling layers by feeding an output of the transforming to a fully connected feed forward network (FIG. 3; "In the illustrated example, deep CNN 206 includes 5 convolutional layers 302-306 and 3 fully connected layers 308-310." [0031]).
(Yao: [0004]).

Regarding Claims 12,
Claims 12 recite(s) a system including a processor (Flake: "processor" [0072]) and memory (Flake: "computer-readable media" [0070]) comprising instructions for performing functions corresponding to the method steps performed by a processor recited in claim 4.  The Flake/Yaghoobzadeh/Ma/Dredze/Yao combination teaches the limitations of claim 12 as set forth above in connection with claims 4.  Therefore, claim 12 is rejected under the same rationale as respective claim 4.

Regarding Claim 20,
Claim 20 recite(s) a non-transitory computer readable medium (Flake: "computer-readable media" [0070]) storing instructions that, when executed by a processor (Flake: "processor" [0072]), perform functions corresponding to the method steps performed by a processor as recited in claim 4.  The Flake/Yaghoobzadeh/Ma/Dredze/Yao combination teaches the limitations of claim 20 as .

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHARLES C KUO whose telephone number is (571)270-7477.  The examiner can normally be reached on M-F: 9:00 a.m. - 6:00 p.m..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.







/CHARLES C KUO/Examiner, Art Unit 2126 
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126