DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 2019-09-09 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.
Specification
The disclosure is objected to because of the following informality:  In Spec [0080], the following clause is repeated:  “the word BORN may have an attention distribution of 0.23”.  Appropriate correction is required.
Claim Objections
Claims 5 and 13 are objected to because of the following informality:  at the end of the claims, “plurality of sentence” should be changed to read “plurality of sentences”.  Appropriate correction is required.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention 

Claims 1-17 are rejected under 35 U.S.C. 103 as being unpatentable over See et. al. (“Get To The Point: Summarization with Pointer-Generator Networks”; hereinafter See) in view of Zeng et. al. (“Extracting Relational Facts by an End-to-End Neural Model with Copy Mechanism”; hereinafter Zeng).
As per Claim 1, See teaches receiving, by a processor, a plurality of sentences comprising a plurality of words, wherein the plurality of sentences comprises numerical data and textual data (See, Section 1 Intro Para 3, discloses:  “In this paper we present an architecture that addresses these three issues in the context of multi-sentence summaries. While most recent abstractive work has focused on headline generation tasks (reducing one or two sentences to a single headline), we believe that longer-text summarization is both more challenging (requiring higher levels of abstraction while avoiding repetition) and ultimately more useful. Therefore we apply our model to the recently-introduced CNN/Daily Mail dataset (Hermann et al., 2015; Nallapati et al., 2016), which contains news articles (39 sentences on average) paired with multi-sentence summaries, and show that we outperform the state-of-the-art abstractive system by at least 2 ROUGE points.” Here, See discloses receiving a plurality of sentences (“news articles” that are “39 sentences on average”).  See, Section 7.2 Figure 7, discloses that the plurality of sentences comprises a plurality of words, numerical data, and textual data.

    PNG
    media_image1.png
    208
    492
    media_image1.png
    Greyscale

See, Section 2.1 Lines 2-4, discloses:  “The tokens of the article wi are fed one-by-one into the
encoder (a single-layer bidirectional LSTM)”.  Here, See discloses using an LSTM, which cannot realistically be performed by a human with pen and paper, and thus necessitates the use of a processor.)
generating, by the processor, a plurality of encoded hidden state vectors associated with the plurality of sentences using a single layer bi-directional Long Short Term Memory (LSTM) neural network (See, Section 2.1, discloses:  “Our baseline model is similar to that of Nallapati et al. (2016), and is depicted in Figure 2. The tokens of the article wi are fed one-by-one into the encoder (a single-layer bidirectional LSTM), producing a sequence of encoder hidden states hi.”  Here, See discloses, generating a plurality of encoded hidden state vectors (“producing a sequence of encoder hidden states hi”, which are well known in the art to be a plurality of vector quantities) associated with the plurality of sentences (“The tokens of the article wi”, where in the article is the plurality of sentences), using a single layer bi-directional Long Short Term Memory (LSTM) neural network (“a single-layer bidirectional LSTM”)).
generating, by the processor, a plurality of current hidden state vectors based on word embedding associated with each word in the plurality of sentences at a time stamp 't' (See, Section 2.1 Continues:  “On each step t, the decoder (a single-layer unidirectional LSTM) receives the word embedding of the previous word (while training, this is the previous word of the reference summary; at test time it is the previous word emitted by the decoder), and has decoder state st.” Here, See discloses, at a time stamp 't' (“On each step t”), generating a plurality of current hidden state vectors (“decoder state st”, which is well known in the art to be a plurality of vector quantities), based on word embedding associated with each word in the plurality of sentences (“receives the word embedding of the previous word”)).
computing, by the processor, an attention distribution of each word in the plurality of sentences based on the plurality of encoded hidden state vectors and plurality of current hidden state vectors, wherein the attention distribution is indicative of importance of each word in the plurality of sentences (See, Section 2.1 Continues:  “The attention distribution at is calculated as in Bahdanau et al. (2015): 

    PNG
    media_image2.png
    329
    696
    media_image2.png
    Greyscale

where v, Wh, Ws and battn are learnable parameters. The attention distribution can be viewed as a probability distribution over the source words, that tells the decoder where to look to produce the next word”.  Here, See discloses computing an attention distribution of each word in the plurality of sentences (“The attention distribution can be viewed as a probability distribution over the source words”), wherein the attention distribution is indicative of importance of each word in the plurality of sentences (“tells the decoder where to look to produce the next word”), which is based on the plurality of encoded hidden state vectors and plurality of current hidden state vectors (See annotations on Eq 1, hi and st)).
computing, by the processor, a context vector of the plurality of sentences based on the attention distribution of each word in the plurality of sentences and the plurality of encoded hidden state vectors (See, Section 2.1, continues:  “Next, the attention distribution is used to produce a weighted sum of the encoder hidden states, known as the context vector ht*:

    PNG
    media_image3.png
    307
    475
    media_image3.png
    Greyscale

Here, See discloses computing a context vector of the plurality of sentences (“produce…the context vector”) based on the attention distribution of each word in the plurality of sentences and the plurality of encoded hidden state vectors (See annotations of Eq 3, ait and hi))
(See, Section 2.1, continues:  “The context vector, which can be seen as a fixed-size representation of what has been read from the source for this step, is concatenated with the decoder state st and fed through two linear layers to produce the vocabulary distribution Pvocab”.  

    PNG
    media_image4.png
    280
    607
    media_image4.png
    Greyscale



Here, See discloses computing a vocabulary distribution (“produce the vocabulary distribution”) at the time stamp "t" based on the context vector and the plurality of current hidden state vectors (See annotations of Eq 4, subscript t, st, ht*))
computing, by the processor, a probability distribution of each of the plurality of words in the plurality of sentences based on the plurality of encoded hidden state vectors, the plurality of current hidden state vectors, and the vocabulary distribution (See, Section 2.1, continues:  “Pvocab is a probability distribution over all words in the vocabulary, and provides us with our final distribution from which to predict words w”.  Here, See discloses a probability distribution of each of the plurality of words in the plurality of sentences (“over all words in the vocabulary”).  See goes on to build on Pvocab.  See, Section 2.2, continues: “Our pointer-generator network is a hybrid between our baseline and a pointer network (Vinyals et al., 2015), as it allows both copying words via pointing, and generating words from a fixed vocabulary. In the pointer-generator model (depicted in Figure 3) the attention distribution at and context vector ht* are calculated as in section 2.1. In addition, the generation probability pgen e [0,1] for timestep t is calculated from the context vector ht*, the decoder state st and the decoder input xt:”

    PNG
    media_image5.png
    360
    770
    media_image5.png
    Greyscale

where vectors wh*, ws, wx and scalar bptr are learnable parameters and s is the sigmoid function. Next, pgen is used as a soft switch to choose between generating a word from the vocabulary by sampling from Pvocab, or copying a word from the input sequence by sampling from the attention distribution at. For each document let the extended vocabulary denote the union of the vocabulary, and all words appearing in the source document. We obtain the following probability distribution over the extended vocabulary:

    PNG
    media_image6.png
    139
    629
    media_image6.png
    Greyscale

Note that if w is an out-of-vocabulary (OOV) word, then Pvocab(w) is zero; similarly if w does not appear in the source document, then Sumi:wi=w ait is zero. The ability to produce OOV words is one of the primary advantages of pointer-generator models; by contrast models such as our baseline are restricted to their pre-set vocabulary.”
Here, See discloses computing a probability distribution of each of the plurality of words in the plurality of sentences (Eq. 9, “For each document let the extended vocabulary denote the union of the vocabulary, and all words appearing in the source document. We obtain the following probability distribution over the extended vocabulary”) based on the plurality of encoded hidden state vectors, the plurality of current hidden state vectors, and the vocabulary distribution (See annotations for Eq 8 and 9.  Eq 9 is the probability distribution, which is based on the vocabulary distribution, which in turn is based on the plurality of current hidden state vectors and the context vector, which in turn is based on the plurality of encoded hidden state vectors)).
	However, See does not teach generating, by the processor, an output comprising a plurality of structured relations between the plurality of words based on the probability 
	Zeng teaches generating, by the processor, an output comprising a plurality of structured relations between the plurality of words based on the probability distribution (Zeng, Section 3.1.2 “Predict Relation” Below Eq. 8, discloses:  “We then concatenate qr and qNA to form the confidence vector of all relations (including the NA-relation) and apply softmax to obtain the probability distribution pr = [pr,…,pm+1r] as: pr = softmax([qr; qNA]).  We select the relation with the highest probability as the predict relation and use it’s embedding as the next time step input vt+1.”  Here, Zeng discloses generating structured relations between the plurality of words (“We select the relation”) based on the probability distribution (“with the highest probability”), which is based on probability distribution (pr).  Zeng, Section 3.1.2 “Decoder” discloses:  “The decoder is used to generate triplets directly. Firstly, the decoder generates a relation for the triplet. Secondly, the decoder copies an entity from the source sentence as the first entity of the triplet. Lastly, the decoder copies the second entity from the source sentence. Repeat this process, the decoder could generate multiple triplets.”  Here, Zeng discloses that the relations are structured (“triplets”), and generating a plurality of them (“multiple triplets”)).
and rendering, by the processor, a knowledge graph depicting the plurality of structured relations between the plurality of words (Zeng, Figure 1, discloses:

    PNG
    media_image7.png
    380
    814
    media_image7.png
    Greyscale



See and Zeng are analogous art because they are both in the field of endeavor of natural language processing.
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the pointer-generator network of See, with the extraction of relations neural model of Zeng. The modification would have been obvious because one of ordinary skill in the art would be motivated to achieve improved accuracy when generating overlapping relations (Zeng, Section 4.4 Last Paragraph: “We can also observe that, in both NYT and WebNLG dataset, the NovelTagging model achieves the highest precision value and lowest recall value. By contrast, our models are much more balanced. We think that the reason is in the structure of the proposed models. The NovelTagging method finds triplets through tagging the words. However, they assume that only one tag could be assigned to just one word. As a result, one word can participate at most one triplet. Therefore, the NovelTagging model can only recall a 

As per Claim 2, the combination of See and Zeng teaches the method of claim 1 as shown above.  See teaches further comprising computing a coverage vector based on the probability distribution and the attention distribution, wherein the coverage vector avoids duplicate generation [of relations] within the generated plurality [of structured relations] (See, Section 2.3 discloses:  “Repetition is a common problem for sequence-to-sequence models (Tu et al., 2016; Mi et al., 2016; Sankaran et al., 2016; Suzuki and Nagata, 2016), and is especially pronounced when generating multi-sentence text (see Figure 1). We adapt the coverage model of Tu et al. (2016) to solve the problem. In our coverage model, we maintain a coverage vector ct, which is the sum of attention distributions over all previous decoder timesteps”.  

    PNG
    media_image8.png
    154
    446
    media_image8.png
    Greyscale

Here, See discloses computing a coverage vector (“we maintain a coverage vector ct”) wherein the coverage vector avoids duplicate generation within the generated plurality (“Repetition is a common problem… We adapt the coverage model…to solve the problem”).  See also discloses that this is based on the attention distribution (“the sum of attention distributions over all previous decoder timesteps”).  This is also based on the probability distribution, as the attention distribution is also part of the probability distribution as shown in Eq. 9:

    PNG
    media_image9.png
    208
    656
    media_image9.png
    Greyscale

Thus, the coverage vector is based on the probability distribution and the attention distribution.)
However, See does not teach generation of structured relations.
Zeng teaches generation of structured relations (Zeng, Section 3.1.2 “Decoder” discloses:  “The decoder is used to generate triplets directly. Firstly, the decoder generates a relation for the triplet. Secondly, the decoder copies an entity from the source sentence as the first entity of the triplet. Lastly, the decoder copies the second entity from the source sentence. Repeat this process, the decoder could generate multiple triplets.”  Here, Zeng discloses generation of structured relations (“generates a relation for the triplet”)).

As per Claim 3, the combination of See and Zeng teaches the method of claim 1 as shown above.  See teaches further comprising selecting one of: generation [of structured relations] or sampling of the plurality of words from the plurality of sentences based on the probability distribution.  (See, Section 2.2 below Eq 8, discloses:  “Next, pgen is used as a soft switch to choose between generating a word from the vocabulary by sampling from Pvocab, or copying a word from the input sequence by sampling from the attention distribution at.”  Here, See discloses based on the probability distribution (“pgen is used”) for selecting (“switch”) one of generation (“generating a word from the vocabulary by sampling from Pvocab”) sampling of the plurality of words (“copying a word from the input sequence by sampling”)).
However, See does not teach generation of structured relations.
Zeng teaches generation of structured relations (Zeng, Section 3.1.2 “Decoder” discloses:  “The decoder is used to generate triplets directly. Firstly, the decoder generates a relation for the triplet. Secondly, the decoder copies an entity from the source sentence as the first entity of the triplet. Lastly, the decoder copies the second entity from the source sentence. Repeat this process, the decoder could generate multiple triplets.”  Here, Zeng discloses generation of structured relations (“generates a relation for the triplet”)).

As per Claim 4, the combination of See and Zeng teaches the method of claim 1 as shown above.  See teaches wherein the probability distribution includes a pointer mechanism to control when to generate a new [relation] or to copy the received words to the output, wherein the probability distribution is further indicative of a location of a word in the plurality of sentences, and wherein based on the pointer mechanism words in the plurality of sentences are directly copied from the received plurality of sentences to the output, and wherein the output is independent of an output length.  (See, Section 2.2, discloses:  “Our pointer-generator network is a hybrid between our baseline and a pointer network (Vinyals et al., 2015), as it allows both copying words via pointing, and generating words from a fixed vocabulary. In the pointer-generator model (depicted in Figure 3) the attention distribution at and context vector ht* are calculated as in section 2.1. In addition, the generation probability pgen e [0,1] for timestep t is calculated from the context vector ht*, the decoder state st and the decoder input xt:”

    PNG
    media_image10.png
    27
    248
    media_image10.png
    Greyscale


where vectors wh*, ws, wx and scalar bptr are learnable parameters and s is the sigmoid function. Next, pgen is used as a soft switch to choose between generating a word from the vocabulary by sampling from Pvocab, or copying a word from the input sequence by sampling from the attention distribution at. For each document let the extended vocabulary denote the union of the vocabulary, and all words appearing in the source document. We obtain the following probability distribution over the extended vocabulary:

    PNG
    media_image11.png
    245
    431
    media_image11.png
    Greyscale

Note that if w is an out-of-vocabulary (OOV) word, then Pvocab(w) is zero; similarly if w does not appear in the source document, then Sumi:wi=w ait is zero. The ability to produce OOV words is one of the primary advantages of pointer-generator models; by contrast models such as our baseline are restricted to their pre-set vocabulary.”  Here, See discloses probability distribution (P(w)) includes a pointer mechanism (second term of Eq 9, see above) to control when to generate a new [relation] or to copy the received words to the output (“it allows both copying words via pointing, and generating words from a fixed vocabulary”), wherein the probability distribution is further indicative of a location of a word in the plurality of sentences (this is the meaning of a “pointer” mechanism), and wherein based on the pointer mechanism words in the plurality of sentences are directly copied from the received plurality of sentences to the output (“copying a word from the input sequence by sampling”).  See also discloses wherein the output is independent of an output length, as the probability distribution comprises pgen, pvocab, and at, neither of which depend on the output length:

    PNG
    media_image12.png
    29
    252
    media_image12.png
    Greyscale

    PNG
    media_image13.png
    30
    298
    media_image13.png
    Greyscale


    PNG
    media_image14.png
    52
    241
    media_image14.png
    Greyscale

No output length term is included in these equations, and thus the output is independent of an output length.)
However, See does not teach generation of structured relations.
Zeng teaches generation of structured relations (Zeng, Section 3.1.2 “Decoder” discloses:  “The decoder is used to generate triplets directly. Firstly, the decoder generates a relation for the triplet. Secondly, the decoder copies an entity from the source sentence as the first entity of the triplet. Lastly, the decoder copies the second entity from the source sentence. Repeat this process, the decoder could generate multiple triplets.”  Here, Zeng discloses generation of structured relations (“generates a relation for the triplet”)).

As per Claim 5, the combination of See and Zeng teaches the method of claim 1 as shown above.  See teaches wherein the plurality of current hidden state vectors is used to generate a plurality of output words that is indicative of mapping of the plurality of words in the plurality of sentence. (See, Section 2.1, discloses:  “Our baseline model is similar to that of Nallapati et al. (2016), and is depicted in Figure 2. The tokens of the article wi are fed one-by-one into the encoder (a single-layer bidirectional LSTM), producing a sequence of encoder hidden states hi. On each step t, the decoder (a single-layer unidirectional LSTM) receives the word embedding of the previous word (while training, this is the previous word of the reference summary; at test time it is the previous word emitted by the decoder), and has decoder state st. The attention distribution at is calculated as in Bahdanau et al. (2015)”:

    PNG
    media_image15.png
    62
    238
    media_image15.png
    Greyscale

where v, Wh, Ws and battn are learnable parameters. The attention distribution can be viewed as a probability distribution over the source words, that tells the decoder where to look to produce the next word.
  Here, See discloses plurality of current hidden state vectors (“decoder state st”, for each t) vectors is used to generate a plurality of output words (“decoder…look to produce the next word”) that is indicative of mapping of the plurality of words in the plurality of sentence (“receives the word embedding of the previous word”)).

As per Claim 6, the combination of See and Zeng teaches the method of claim 1 as shown above.  See teaches wherein the context vector is a weighted sum between the attention distribution and the plurality of encoded hidden state vectors.  (See, Section 2.1, continues:  “Next, the attention distribution is used to produce a weighted sum of the encoder hidden states, known as the context vector ht*”.  Here, See discloses context vector (“context vector ht*”) is a weighted sum (“weighted sum”) between the attention distribution (“attention distribution”) and the plurality of encoded hidden state vectors (“encoder hidden states”)

    PNG
    media_image16.png
    319
    356
    media_image16.png
    Greyscale


As per Claim 7, the combination of See and Zeng teaches the method of claim 1 as shown above.  See teaches wherein the context vector is further concatenated with the plurality of current hidden state vectors at the time stamp "t".  (See, Section 2.1, continues:  “The context vector, which can be seen as a fixedsize representation of what has been read from the source for this step, is concatenated with the decoder state st and fed through two linear layers to produce the vocabulary distribution Pvocab”.  Here, See discloses context vector is further concatenated (“context vector…is concatenated”) with the plurality of current hidden state vectors at the time stamp "t" (“with the decoder state st”)).

As per Claim 8, the combination of See and Zeng teaches the method of claim 1 as shown above.  Zeng teaches wherein the plurality of words in the plurality of sentences correspond to plurality of annotated entities. (Zeng, Section 2, discloses:  “By giving a sentence with annotated entities, Hendrickx et al. (2010); Zeng et al. (2014); Xu et al. (2015a,b) treat identifying relations in sentences as a multi-class classification problem”.  Here, Zeng discloses that one can use annotated entities for identifying relations.)

As per Claim 9, Claim 9 is an application server claim corresponding to method Claim 1.  The difference is that it recites a processor and a memory.  (See, Section 2.1 Lines 2-4, discloses:  “The tokens of the article wi are fed one-by-one into the encoder (a single-layer bidirectional LSTM)”.  Here, See discloses using an LSTM, which cannot realistically be performed by a human with pen and paper, and thus necessitates the use of a processor and a memory.)  Claim 9 is rejected for the same reasons as Claim 1.

As per Claim 10, Claim 10 is an application server claim corresponding to method Claim 1.  The difference is that it recites a processor and a memory.  Claim 10 is rejected for the same reasons as Claim 2.

As per Claim 11, Claim 11 is an application server claim corresponding to method Claim 3.  The difference is that it recites a processor and a memory.  Claim 11 is rejected for the same reasons as Claim 3.

As per Claim 12, Claim 12 is an application server claim corresponding to method Claim 4.  The difference is that it recites a processor and a memory.  Claim 12 is rejected for the same reasons as Claim 4.

As per Claim 13, Claim 13 is an application server claim corresponding to method Claim 5.  The difference is that it recites a processor and a memory.  Claim 13 is rejected for the same reasons as Claim 5.

As per Claim 14, Claim 14 is an application server claim corresponding to method Claim 6.  The difference is that it recites a processor and a memory.  Claim 14 is rejected for the same reasons as Claim 6.

As per Claim 15, Claim 15 is an application server claim corresponding to method Claim 7.  The difference is that it recites a processor and a memory.  Claim 15 is rejected for the same reasons as Claim 7.

As per Claim 16, Claim 16 is an application server claim corresponding to method Claim 8.  The difference is that it recites a processor and a memory.  Claim 16 is rejected for the same reasons as Claim 8.

As per Claim 17, Claim 17 is a non-transitory computer-readable storage medium claim corresponding to method Claim 1.  The difference is that it recites a non-transitory computer-readable storage medium.  (See, Section 2.1 Lines 2-4, discloses:  “The tokens of the article wi are fed one-by-one into the encoder (a single-layer bidirectional LSTM)”.  Here, See discloses using an LSTM, which cannot realistically be performed by a human with pen and paper, and thus necessitates the use of a non-transitory computer-readable storage medium.)  Claim 17 is rejected for the same reasons as Claim 1.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Zheng et. al. ("Joint entity and relation extraction based on a hybrid neural network”) discloses generating relations using LSTM encoder-decoders and CNN
Liu et. al. (“Neural Networks Models for Entity Discovery and Linking”) discloses using attention-based encoder-decoder networks for entity linking
Miwa et. al. (“End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures”) discloses relation extraction using LSTMs
Zhang et. al. (“End-to-End Neural Relation Extraction with Global Optimization”) discloses relation extraction using LSTMs
Su et. al. (“Exploring Encoder-Decoder Model for Distant Supervised Relation Extraction”) discloses relation extraction using encoder-decoders
Zhou et. al. (“Trigger Words Detection by Integrating Attention Mechanism into Bi-LSTM Neural Network - A Case Study in PubMED-Wide Trigger Words Detection for Pancreatic Cancer”) discloses attention-based LSTM encoder-decoder networks for trigger words detection
Tu et. al. (“Modeling Coverage for Neural Machine Translation”) discloses coverage vectors
Vinyals et. al. (“Pointer Networks”) discloses pointer networks
Quirk et. al. (US 2018/0189269 A1) discloses using LSTMs to determine relationships between words in a piece of content
Kaiser et. al. (US 10,268,671 B2) discloses using attention-based LSTM to generate a parse tree for an input segment

Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD A SIEGER whose telephone number is (571)272-9710.  The examiner can normally be reached on M-F 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/L.A.S./Examiner, Art Unit 2126   
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126