DETAILED ACTION
Introduction
This office action is in response to Applicant’s submission filed on December 2nd, 2020. Claims 1-20 are pending in the application. As such, claims 1-20 have been examined.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statements (IDS) submitted on December 2nd, 2020 and January 25th, 2022 were filed.  The submissions are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.
However, the IDS submitted on December 2nd, 2020 cites a non-patent literature document (Citation number 3) for which no fully legible copy has been furnished. The submitted copy appears to be missing a portion of text on page 5. As such, the information therein has not been considered.
Specification
The disclosure is objected to because of the following informalities:
Page 7, line 15 reads: “correspond a parsing tree associated with the sentence”. It appears to the examiner that it should instead read “correspond to a parsing tree associated with the sentence”.
Page 42, lines 18-19 read: “Control may pass to end.” It is unclear what is meant by this statement.
Similarly, page 62, line 20 reads: “Control may pass to end.” It is unclear what is meant by this statement.
Similarly, Page 75, line 13 reads: “Control may pass to end.” It is unclear what is meant by this statement.
Appropriate correction is required.
The lengthy specification has not been checked to the extent necessary to determine the presence of all possible minor errors. Applicant’s cooperation is requested in correcting any errors of which applicant may become aware in the specification.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-4, 6-7, 10-14, 16, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Zheng et al. (2020, “Document Modeling with Graph Attention Networks for Multi-grained Machine Reading Comprehension”, hereinafter “Zheng”) in view of Hunter (U.S. Patent Application Publication 2021/0073287 A1).
In regards to claim 1, Zheng teaches:
A method, comprising: 
constructing a hierarchal graph associated with a document, the hierarchal graph includes a plurality of nodes including a document node, a set of paragraph nodes connected to the document node, a set of sentence nodes each connected to a corresponding one of the set of paragraph nodes, and a set of token nodes each connected to a corresponding one of the set of sentence nodes (Section 3.2: the document structure is treated as a tree (i.e. hierarchal graph) which has token nodes, sentence nodes, paragraph nodes, and a document node; see also Figure 4); 
determining, based on a language attention model (Section 3.3.1: Zheng describes a multi-head attention mechanism (i.e. language attention model)), a set of weights associated with a set of edges between a first node and each of a second set of nodes connected to the first node in the constructed hierarchal graph (Section 3.3.1: the attention coefficient eij (i.e. the weight associated with the edge between node j and node i) is calculated), the language attention model corresponds to a model to assign a contextual significance to each of a plurality of words in a sentence of the document (Fig. 4: Each of the nodes in the bottom most layer corresponds to a token (i.e. word); Section 3.3.1: the attention coefficient (i.e. weight) is calculated based on both the feature of node i and the feature of node j – that is, the context of node i); 
applying a graph neural network (GNN) model on the constructed hierarchal graph based on at least one of: a set of first features associated with each of the set of token nodes, and the determined set of weights (Section 3.3.1: a graph attention network (i.e. GNN model) is applied to model the information flow between nodes; Equation 2: the attention head takes in the normalized attention coefficients (i.e. weights) and node features (i.e. features associated with each of the set of token nodes) as input; see also Section 3.3.5); 
updating a set of features associated with each of the plurality of nodes based on the application of the GNN model on the constructed hierarchal graph (Section 3.3.1, equation 2: the output of the attention network is zi, which is used to update the node features; see also Section 3.3.5); 
generating a document vector for a natural language processing (NLP) task, based on the updated set of features associated with each of the plurality of nodes (Fig. 3: the output (i.e. document vector) is based on the output from the previous steps; see Also section 3.4, which describes a series of probabilities and scores (i.e. vectors) that identify areas of the document that are relevant to the NLP task); and 
displaying an output of the NLP task for the document, based on the generated document vector (Section 3.4: Zheng notes that they use the output scores, e.g. g(c, S, l) and g(c, S) in order to select candidates; in addition, Table 1 notes a comparison of their results (i.e. outputs) to other previous systems, which suggests that the results must have been displayed to them at some point).
	However, while Zheng does direct their teachings to machine reading comprehension and provide their models as code (Section 1), Zheng fails to explicitly teach the use of a processor.
In a related art, Hunter teaches a method for determining a set of features associated with a set of vertices (i.e. nodes) of a directed graph (Abstract). Hunter also teaches that their described graph system may be used, for example, to analyze text associated with vertices to perform NLP operations (Paragraph 358). Notably, Hunter teaches that their system may include processors to perform the processes described (Paragraph 7). Furthermore, Hunter teaches using graph neural networks to e.g. classify whether one or more vertices should be prioritized (Paragraph 373: i.e. considered a key node; see also Fig. 24, elements 3412, 3413, 3414, and 3420: several nodes in a hierarchical graph are highlighted (i.e. indicated as key nodes)). In addition, Hunter teaches determining prioritizing certain features and vertices based on e.g. a vertex’s relationship with other vertices in the directed graph and visually indicating the prioritized vertices (Paragraph 338). Hunter teaches that the prioritization and display of certain vertices may increase the interpretability of directed graphs (Paragraph 339).
It would have been obvious to one of ordinary skill in the art at the time of filing to modify Zheng to incorporate the teachings of Hunter to include the use of processors. Doing so would have been an example of applying a known technique to a known device ready for improvement to yield predictable results (see MPEP(2143)(D)).
Zheng teaches a method and provides code for Natural Language Processing utilizing a graph neural network on a hierarchal graph representing a text document in a manner similar to the instant application. However, Zheng does not teach the use of a processor to perform the method.
Hunter describes a system that utilizes a processor for a natural language process that involves utilizing a graph neural network on a hierarchal graph representing a text document (Paragraph 7).
One of ordinary skill in the art would have recognized that utilizing the processor to perform the method of Zheng would have yielded predictable results, as Zheng already provided code that may have been performed by a processor.
Thus, the combination of Zheng and Hunter teaches:
A method, comprising: 
In a processor (Hunter, Paragraph 7):
constructing a hierarchal graph associated with a document, the hierarchal graph includes a plurality of nodes including a document node, a set of paragraph nodes connected to the document node, a set of sentence nodes each connected to a corresponding one of the set of paragraph nodes, and a set of token nodes each connected to a corresponding one of the set of sentence nodes (Zheng, Section 3.2: the document structure is treated as a tree (i.e. hierarchal graph) which has token nodes, sentence nodes, paragraph nodes, and a document node; see also Zheng, Figure 4); 
determining, based on a language attention model (Zheng, Section 3.3.1: Zheng describes a multi-head attention mechanism (i.e. language attention model)), a set of weights associated with a set of edges between a first node and each of a second set of nodes connected to the first node in the constructed hierarchal graph (Zheng, Section 3.3.1: the attention coefficient eij (i.e. the weight associated with the edge between node j and node i) is calculated), the language attention model corresponds to a model to assign a contextual significance to each of a plurality of words in a sentence of the document (Zheng, Fig. 4: Each of the nodes in the bottom most layer corresponds to a token (i.e. word); Zheng, Section 3.3.1: the attention coefficient (i.e. weight) is calculated based on both the feature of node i and the feature of node j – that is, the context of node i); 
applying a graph neural network (GNN) model on the constructed hierarchal graph based on at least one of: a set of first features associated with each of the set of token nodes, and the determined set of weights (Zheng, Section 3.3.1: a graph attention network (i.e. GNN model) is applied to model the information flow between nodes; Zheng, Equation 2: the attention head takes in the normalized attention coefficients (i.e. weights) and node features (i.e. features associated with each of the set of token nodes) as input; see also Zheng, Section 3.3.5); 
updating a set of features associated with each of the plurality of nodes based on the application of the GNN model on the constructed hierarchal graph (Zheng, Section 3.3.1, equation 2: the output of the attention network is zi, which is used to update the node features; see also Zheng, Section 3.3.5); 
generating a document vector for a natural language processing (NLP) task, based on the updated set of features associated with each of the plurality of nodes (Zheng, Fig. 3: the output (i.e. document vector) is based on the output from the previous steps; see also Zheng, Section 3.4, which describes a series of probabilities and scores (i.e. vectors) that identify areas of the document that are relevant to the NLP task); and 
displaying an output of the NLP task for the document, based on the generated document vector (Zheng, Section 3.4: Zheng notes that they use the output scores, e.g. g(c, S, l) and g(c, S) in order to select candidates; in addition, Zheng, Table 1 notes a comparison of their results (i.e. outputs) to other previous systems, which suggests that the results must have been displayed to them at some point).
In regards to claim 2, Zheng and Hunter further teach:
The method according to claim 1, wherein the displayed output includes an indication of at least one of: one or more first words, one or more first sentences, or one or more first paragraphs in the document (Zheng, Figure 4: Zheng describes a graph structure comprising Paragraph nodes, Sentence nodes, and Token (i.e. word) nodes), and each of the one or more first words corresponds to a key word in the document, each of the one or more first sentences corresponds to a key sentence in the document, and each of the one or more first paragraphs corresponds to a key paragraph in the document (Hunter, Paragraph 373: certain vertices (i.e. vertices) may be prioritized (i.e. correspond to a key item in the document; also Hunter, Paragraph 338: some embodiments may visually indicate prioritized features or vertices; see also Hunter, Fig. 24, elements 3412, 3413, 3414, and 3420: examples of a hierarchical graph with indicated prioritized vertices (i.e. key nodes)).
In regards to claim 3, Zheng and Hunter further teach:
The method according to claim 1, wherein the displayed output includes a representation of the constructed hierarchal graph or a part of the constructed hierarchal graph (Hunter, Paragraph 338), and an indication of important nodes in the represented hierarchal graph or in the part of the hierarchal graph (Hunter, Paragraph 338) based on the determined set of weights (Hunter, Paragraph 338: certain vertices (i.e. nodes) may be prioritized (i.e. indicated as important) based on e.g. a vertex’s relationship with other vertices in the graph; Zheng, Section 3.3.1: Zheng teaches calculating (i.e. determining) attention coefficients (i.e. weights) that correspond to edges between nodes (i.e. a nodes relationship with another node in the graph)).
In regards to claim 4, Zheng further teaches:
The method according to claim 1, wherein the constructing of the hierarchal graph associated with the document further comprising: 
segmenting the document to identify a set of paragraphs (Section 3.2, a document is decomposed (i.e. segmented) to a list of paragraphs); 
parsing each paragraph from the set of paragraphs to identify a set of sentences (Section 3.2: a paragraph is decomposed to (i.e. parsed to identify) a list of sentences); 90FPC.20-00986.ORD 
parsing each sentence from the set of sentences to determine a parsing tree associated with a set of tokens associated with the parsed sentence (Section 3.2: a sentence is decomposed to (i.e. parsed to determine) a list of tokens; the document structure is treated as a tree; thus the set of document nodes could be considered a subtree (i.e. a parsing tree)); and 
assembling the hierarchal graph based on the document, the identified set of paragraphs, the identified set of sentences, and the determined parsing tree for each of the identified sentences (Fig. 4).
In regards to claim 6, Zheng further teaches:
The method according to claim 1, wherein the constructing of the hierarchal graph associated with the document, further comprising: 
adding, in the hierarchal graph, a first set of edges between the document node and one or more of the set of token nodes (Section 3.2: “we further add edges… between tokens and documents”); 
adding, in the hierarchal graph, a second set of edges between the document node and one or more of the set of sentence nodes (Section 3.2: “we further add edges… between sentences and the document”; 91FPC.20-00986.ORD 
adding, in the hierarchal graph, a third set of edges between each of the set of paragraph nodes and each associated token node from the set of token nodes (Section 3.2: “we further add edges between tokens and paragraphs”), the set of edges comprises at least one of: the first set of edges, the second set of edges, or the third set of edges; and 
labeling each edge in the hierarchal graph based on a type of the edge (Fig. 4: the graph illustrates (i.e. labels) edges with different colors and solidness to indicate information about them, such as dash lines indicating that the edge was additionally added).
In regards to claim 7, Zheng further teaches:
The method according to claim 1, further comprising determining the set of first features for each of the set of token nodes to represent each word associated with the set of token nodes as a vector (Section 3.3.6 notes that a BERT model is used to provide a token-level representation (i.e. a set of first features); Figure 3 shows the BERT encoder being used to encode the data before graph initialization; Section 2.2 and Section 4.1 describe how the documents are tokenized; Section 3.2 describes how the document structure is then treated as a tree that includes token nodes; these token nodes initially contain a set of first features as a result of the tokenization process, which are later used in e.g. Section 3.3.1, equation 1).
In regards to claim 10, Zheng further teaches:
The method according to claim 7, further comprising: 92FPC.20-00986.ORD 
determining a set of second features for each of the set of sentence nodes based on an average value or aggregate value of the determined set of first features for corresponding token nodes from the set of token nodes (Section 3.3.6: Zheng teaches using a bottom-up average-pooling strategy to initialize (i.e. determine a set of features for) the nodes other than the token level nodes, and describes an equation wherein the feature of a node h is determined by the nodes that are on hierarchal level below it; in this case, the sentence nodes are determined based on the corresponding token nodes); 
determining a set of third features for each of the set of paragraph nodes based on an average value or aggregate value of the determined set of second features for corresponding sentence nodes from the set of sentence nodes (Section 3.3.6: Zheng teaches using a bottom-up average-pooling strategy to initialize (i.e. determine a set of features for) the nodes other than the token level nodes, and describes an equation wherein the feature of a node h is determined by the nodes that are on hierarchal level below it; in this case, the paragraph nodes are determined based on the corresponding sentence nodes); and 
determining a set of fourth features for the document node based on an average value or aggregate value of the determined set of third features for each of the set of the paragraph nodes (Section 3.3.6: Zheng teaches using a bottom-up average-pooling strategy to initialize (i.e. determine a set of features for) the nodes other than the token level nodes, and describes an equation wherein the feature of a node h is determined by the nodes that are on hierarchal level below it; in this case, the document node is determined based on the corresponding paragraph nodes), wherein the applying the GNN model on the constructed hierarchal graph is further based on at least one of: the determined set of second features, the determined set of third features, or the determined set of fourth features (Fig. 3: graph initialization, which is described in Section 3.3.6 and includes the bottom-up average-pooling strategy, occurs prior to the graph encoding step, which is where the GNN model is applied; see e.g. Section 3.3 and Section 3.3.1).
In regards to claim 11, Zheng further teaches:
The method according to claim 1, further comprising: 
encoding first positional information associated with relative positions of each of a set of tokens associated with each of a set of words in each of the set of sentences (Section 3.3.5: Zheng discusses modeling the relative position information between nodes with a relational embedding (i.e. encoding) for each edge in the graph; notably, when the relational embedding is between different layers, e.g. sentence and its paragraph, the relational embedding represents the relative position of the sentence in the paragraph. Zheng also notes that it is the same for other types of edges; because there is a relational embedding for each edge in the graph, and there are edges between tokens and sentences (see Section 3.2, Paragraph 2), there are thus relational embeddings for (i.e. encoded first positional information associated with) the edges between tokens and sentences); 
encoding second positional information associated with relative positions of each of the set of sentences in each of a set of paragraphs in the document (Section 3.3.5: Zheng discusses modeling the relative position information between nodes with a relational embedding (i.e. encoding) for each edge in the graph; notably, when the relational embedding is between different layers, e.g. sentence and its paragraph, the relational embedding represents the relative position of the sentence in the paragraph); 
encoding third positional information associated with relative positions of each of the set of paragraphs in the document (Section 3.3.5: Zheng discusses modeling the relative position information between nodes with a relational embedding (i.e. encoding) for each edge in the graph; notably, when the relational embedding is between different layers, e.g. sentence and its paragraph, the relational embedding represents the relative position of the sentence in the paragraph. Zheng also notes that it is the same for other types of edges; because there is a relational embedding for each edge in the graph, and there are edges between paragraphs and documents (see Section 3.2, Paragraph 2), there are thus relational embeddings for (i.e. encoded third positional information associated with) the edges between paragraphs and documents); and 93FPC.20-00986.ORD 
determining a token embedding associated with each of the set of token nodes based on at least one of: the set of first features associated with each of the set of token nodes, the encoded first positional information, the encoded second positional information, and the encoded third positional information, wherein the applying the GNN model on the hierarchal graph is further based on the determined token embedding associated with each of the set of token nodes (Section 3.3.5: the attention result (i.e. token embedding) takes into account the newly encoded relational embedding)).
In regards to claim 12, Zheng further teaches:
The method according to claim 1, further comprising: determining a scalar dot product between a first vector associated with the first node and a second vector associated with a second node from the second set of nodes, wherein the first node is connected with the second node through a first edge from the set of edges, wherein the first vector is scaled based on a query weight-matrix and the second vector is scaled based on a key weight-matrix (Section 3.3.1, Equation 1: hi (i.e. the first vector associated with the first node) is multiplied by (that is, a scalar dot product is taken) hj (i.e. the second vector associated with a second node, wherein the first node is connected with the second node through a first edge from the set of edges), wherein the first vector is scaled by WQ (i.e. a query weight-matrix) and the second vector is scaled by WK (i.e. a key weight-matrix)); and 
determining a first weight of the first edge between the first node and the second node based on the determined scalar dot product (Section 3.3.1, Equation 1: the attention coefficient (i.e. the weight between the first and second node) is based on the scalar dot product).
In regards to claim 13, Zheng further teaches:
The method according to claim 1, further comprising: 
normalizing each of the set of weights to obtain a set of normalized weights (Section 3.3.1: Zheng teaches normalizing the attention coefficients (i.e. weights) using the softmax function); 
scaling each of a second set of vectors associated with a corresponding node from the second set of nodes (Section 3.3.1, Equation 2: hj) based on a value weight-matrix (Section 3.3.1, Equation 2: WV) and a corresponding normalized weight of the set of normalized weights (Section 3.3.1, Equation 2: aij); and 94FPC.20-00986.ORD 
aggregating each of the scaled second set of vectors to obtain an updated first vector associated with the first node (Section 3.3.1, Equation 2: the sum is taken to get the attention result (i.e. updated first vector)).
In regards to claim 14, Zheng further teaches:
The method according to claim 13, further comprising determining an updated second vector associated with the first node based on a concatenation of the updated first vector and one or more updated third vectors associated with the first node, wherein each of the updated first vector and the one or more updated third vectors are determined based on the application of the GNN model by use of the language attention model (Section 3.3.1: the multi-head attention result (i.e. updated second vector associated with the first node) is obtained by concatenating the outputs of m individual attention heads (i.e. the updated first and third vectors associated with the first node, which are determined based on the application of the GNN model by use of the language attention model)).
In regards to claim 16, Zheng further teaches:
The method according to claim 1, wherein the GNN corresponds to a Graph Attention Network (GAT) (Section 3.3.1: A Graph Attention Network is used to model the information flow).
In regards to claim 19, Zheng teaches:
constructing a hierarchal graph associated with a document, the hierarchal graph includes a plurality of nodes including a document node, a set of paragraph nodes connected to the document node, a set of sentence nodes each connected to a corresponding one of the set of paragraph nodes, and a set of token nodes each connected to a corresponding one of the set of sentence nodes (Section 3.2: the document structure is treated as a tree (i.e. hierarchal graph) which has token nodes, sentence nodes, paragraph nodes, and a document node; see also Figure 4); 
determining, based on a language attention model (Section 3.3.1: Zheng describes a multi-head attention mechanism (i.e. language attention model)), a set of weights associated with a set of edges between a first node and each of a second set of nodes connected to the first node in the constructed hierarchal graph (Section 3.3.1: the attention coefficient eij (i.e. the weight associated with the edge between node j and node i) is calculated), the language attention model corresponds to a model to assign a contextual significance to each of a plurality of words in a sentence of the document (Fig. 4: Each of the nodes in the bottom most layer corresponds to a token (i.e. word); Section 3.3.1: the attention coefficient (i.e. weight) is calculated based on both the feature of node i and the feature of node j – that is, the context of node i); 
applying a graph neural network (GNN) model on the constructed hierarchal graph based on at least one of: a set of first features associated with each of the set of token nodes, and the determined set of weights (Section 3.3.1: a graph attention network (i.e. GNN model) is applied to model the information flow between nodes; Equation 2: the attention head takes in the normalized attention coefficients (i.e. weights) and node features (i.e. features associated with each of the set of token nodes) as input; see also Section 3.3.5); 
updating a set of features associated with each of the plurality of nodes based on the application of the GNN model on the constructed hierarchal graph (Section 3.3.1, equation 2: the output of the attention network is zi, which is used to update the node features; see also Section 3.3.5); 
generating a document vector for a natural language processing (NLP) task, based on the updated set of features associated with each of the plurality of nodes (Fig. 3: the output (i.e. document vector) is based on the output from the previous steps; see Also section 3.4, which describes a series of probabilities and scores (i.e. vectors) that identify areas of the document that are relevant to the NLP task); and 
displaying an output of the NLP task for the document, based on the generated document vector (Section 3.4: Zheng notes that they use the output scores, e.g. g(c, S, l) and g(c, S) in order to select candidates; in addition, Table 1 notes a comparison of their results (i.e. outputs) to other previous systems, which suggests that the results must have been displayed to them at some point).
However, while Zheng does direct their teachings to machine reading comprehension and provide their models as code (Section 1; i.e. instructions), Zheng fails to explicitly teach the use of a non-transitory computer-readable storage media configured to store this code.
In a related art, Hunter teaches a method for determining a set of features associated with a set of vertices (i.e. nodes) of a directed graph (Abstract). Hunter also teaches that their described graph system may be used, for example, to analyze text associated with vertices to perform NLP operations (Paragraph 358). Notably, Hunter teaches that their system may include non-transitory machine-readable media storing instructions to perform the processes described (Paragraph 6). Furthermore, Hunter teaches using graph neural networks to e.g. classify whether one or more vertices should be prioritized (Paragraph 373: i.e. considered a key node; see also Fig. 24, elements 3412, 3413, 3414, and 3420: several nodes in a hierarchical graph are highlighted (i.e. indicated as key nodes)). In addition, Hunter teaches determining prioritizing certain features and vertices based on e.g. a vertex’s relationship with other vertices in the directed graph and visually indicating the prioritized vertices (Paragraph 338). Hunter teaches that the prioritization and display of certain vertices may increase the interpretability of directed graphs (Paragraph 339).
It would have been obvious to one of ordinary skill in the art at the time of filing to modify Zheng to incorporate the teachings of Hunter to include the use of processors. Doing so would have been an example of applying a known technique to a known device ready for improvement to yield predictable results (see MPEP(2143)(D)).
Zheng teaches a method and provides code for Natural Language Processing utilizing a graph neural network on a hierarchal graph representing a text document in a manner similar to the instant application. However, Zheng does not teach the use of a processor to perform the method.
Hunter describes a system that utilizes a non-transitory machine-readable media storing instructions for a natural language process that involves utilizing a graph neural network on a hierarchal graph representing a text document (Paragraph 6).
One of ordinary skill in the art would have recognized that utilizing the processor to perform the method of Zheng would have yielded predictable results, as Zheng already provided code that may have been performed by a processor.
Thus, the combination of Zheng and Hunter teaches:
One or more non-transitory computer-readable storage media configured to store instructions that, in response to being executed, cause a system to perform operations, the operations comprising (Hunter, Paragraph 6):
constructing a hierarchal graph associated with a document, the hierarchal graph includes a plurality of nodes including a document node, a set of paragraph nodes connected to the document node, a set of sentence nodes each connected to a corresponding one of the set of paragraph nodes, and a set of token nodes each connected to a corresponding one of the set of sentence nodes (Zheng, Section 3.2: the document structure is treated as a tree (i.e. hierarchal graph) which has token nodes, sentence nodes, paragraph nodes, and a document node; see also Zheng, Figure 4); 
determining, based on a language attention model (Zheng, Section 3.3.1: Zheng describes a multi-head attention mechanism (i.e. language attention model)), a set of weights associated with a set of edges between a first node and each of a second set of nodes connected to the first node in the constructed hierarchal graph (Zheng, Section 3.3.1: the attention coefficient eij (i.e. the weight associated with the edge between node j and node i) is calculated), the language attention model corresponds to a model to assign a contextual significance to each of a plurality of words in a sentence of the document (Zheng, Fig. 4: Each of the nodes in the bottom most layer corresponds to a token (i.e. word); Zheng, Section 3.3.1: the attention coefficient (i.e. weight) is calculated based on both the feature of node i and the feature of node j – that is, the context of node i); 
applying a graph neural network (GNN) model on the constructed hierarchal graph based on at least one of: a set of first features associated with each of the set of token nodes, and the determined set of weights (Zheng, Section 3.3.1: a graph attention network (i.e. GNN model) is applied to model the information flow between nodes; Zheng, Equation 2: the attention head takes in the normalized attention coefficients (i.e. weights) and node features (i.e. features associated with each of the set of token nodes) as input; see also Zheng, Section 3.3.5); 
updating a set of features associated with each of the plurality of nodes based on the application of the GNN model on the constructed hierarchal graph (Zheng, Section 3.3.1, equation 2: the output of the attention network is zi, which is used to update the node features; see also Zheng, Section 3.3.5); 
generating a document vector for a natural language processing (NLP) task, based on the updated set of features associated with each of the plurality of nodes (Zheng, Fig. 3: the output (i.e. document vector) is based on the output from the previous steps; see also Zheng, Section 3.4, which describes a series of probabilities and scores (i.e. vectors) that identify areas of the document that are relevant to the NLP task); and 
displaying an output of the NLP task for the document, based on the generated document vector (Zheng, Section 3.4: Zheng notes that they use the output scores, e.g. g(c, S, l) and g(c, S) in order to select candidates; in addition, Zheng, Table 1 notes a comparison of their results (i.e. outputs) to other previous systems, which suggests that the results must have been displayed to them at some point).
In regards to claim 20, Zheng teaches: 
constructing a hierarchal graph associated with a document, the hierarchal graph includes a plurality of nodes including a document node, a set of paragraph nodes connected to the document node, a set of sentence nodes each connected to a corresponding one of the set of paragraph nodes, and a set of token nodes each connected to a corresponding one of the set of sentence nodes (Section 3.2: the document structure is treated as a tree (i.e. hierarchal graph) which has token nodes, sentence nodes, paragraph nodes, and a document node; see also Figure 4); 
determining, based on a language attention model (Section 3.3.1: Zheng describes a multi-head attention mechanism (i.e. language attention model)), a set of weights associated with a set of edges between a first node and each of a second set of nodes connected to the first node in the constructed hierarchal graph (Section 3.3.1: the attention coefficient eij (i.e. the weight associated with the edge between node j and node i) is calculated), the language attention model corresponds to a model to assign a contextual significance to each of a plurality of words in a sentence of the document (Fig. 4: Each of the nodes in the bottom most layer corresponds to a token (i.e. word); Section 3.3.1: the attention coefficient (i.e. weight) is calculated based on both the feature of node i and the feature of node j – that is, the context of node i); 
applying a graph neural network (GNN) model on the constructed hierarchal graph based on at least one of: a set of first features associated with each of the set of token nodes, and the determined set of weights (Section 3.3.1: a graph attention network (i.e. GNN model) is applied to model the information flow between nodes; Equation 2: the attention head takes in the normalized attention coefficients (i.e. weights) and node features (i.e. features associated with each of the set of token nodes) as input; see also Section 3.3.5); 
updating a set of features associated with each of the plurality of nodes based on the application of the GNN model on the constructed hierarchal graph (Section 3.3.1, equation 2: the output of the attention network is zi, which is used to update the node features; see also Section 3.3.5); 
generating a document vector for a natural language processing (NLP) task, based on the updated set of features associated with each of the plurality of nodes (Fig. 3: the output (i.e. document vector) is based on the output from the previous steps; see Also section 3.4, which describes a series of probabilities and scores (i.e. vectors) that identify areas of the document that are relevant to the NLP task); and 
displaying an output of the NLP task for the document, based on the generated document vector (Section 3.4: Zheng notes that they use the output scores, e.g. g(c, S, l) and g(c, S) in order to select candidates; in addition, Table 1 notes a comparison of their results (i.e. outputs) to other previous systems, which suggests that the results must have been displayed to them at some point).
However, while Zheng does direct their teachings to machine reading comprehension and provide their models as code (Section 1), Zheng fails to explicitly teach the use of a processor.
In a related art, Hunter teaches a method for determining a set of features associated with a set of vertices (i.e. nodes) of a directed graph (Abstract). Hunter also teaches that their described graph system may be used, for example, to analyze text associated with vertices to perform NLP operations (Paragraph 358). Notably, Hunter teaches that their system may include processors to perform the processes described (Paragraph 7). Furthermore, Hunter teaches using graph neural networks to e.g. classify whether one or more vertices should be prioritized (Paragraph 373: i.e. considered a key node; see also Fig. 24, elements 3412, 3413, 3414, and 3420: several nodes in a hierarchical graph are highlighted (i.e. indicated as key nodes)). In addition, Hunter teaches determining prioritizing certain features and vertices based on e.g. a vertex’s relationship with other vertices in the directed graph and visually indicating the prioritized vertices (Paragraph 338). Hunter teaches that the prioritization and display of certain vertices may increase the interpretability of directed graphs (Paragraph 339).
It would have been obvious to one of ordinary skill in the art at the time of filing to modify Zheng to incorporate the teachings of Hunter to include the use of processors. Doing so would have been an example of applying a known technique to a known device ready for improvement to yield predictable results (see MPEP(2143)(D)).
Zheng teaches a method and provides code for Natural Language Processing utilizing a graph neural network on a hierarchal graph representing a text document in a manner similar to the instant application. However, Zheng does not teach the use of a processor to perform the method.
Hunter describes a system that utilizes a processor for a natural language process that involves utilizing a graph neural network on a hierarchal graph representing a text document (Paragraph 7).
One of ordinary skill in the art would have recognized that utilizing the processor to perform the method of Zheng would have yielded predictable results, as Zheng already provided code that may have been performed by a processor.
Thus, the combination of Zheng and Hunter teaches:
An electronic device, comprising (Paragraph 138: the system may be e.g. a type of mobile computing device, fixed computing device, or other electronic device): 
A processor configured to (Hunter, Paragraph 7):
construct a hierarchal graph associated with a document, the hierarchal graph includes a plurality of nodes including a document node, a set of paragraph nodes connected to the document node, a set of sentence nodes each connected to a corresponding one of the set of paragraph nodes, and a set of token nodes each connected to a corresponding one of the set of sentence nodes (Zheng, Section 3.2: the document structure is treated as a tree (i.e. hierarchal graph) which has token nodes, sentence nodes, paragraph nodes, and a document node; see also Zheng, Figure 4); 
determine, based on a language attention model (Zheng, Section 3.3.1: Zheng describes a multi-head attention mechanism (i.e. language attention model)), a set of weights associated with a set of edges between a first node and each of a second set of nodes connected to the first node in the constructed hierarchal graph (Zheng, Section 3.3.1: the attention coefficient eij (i.e. the weight associated with the edge between node j and node i) is calculated), the language attention model corresponds to a model to assign a contextual significance to each of a plurality of words in a sentence of the document (Zheng, Fig. 4: Each of the nodes in the bottom most layer corresponds to a token (i.e. word); Zheng, Section 3.3.1: the attention coefficient (i.e. weight) is calculated based on both the feature of node i and the feature of node j – that is, the context of node i); 
apply a graph neural network (GNN) model on the constructed hierarchal graph based on at least one of: a set of first features associated with each of the set of token nodes, and the determined set of weights (Zheng, Section 3.3.1: a graph attention network (i.e. GNN model) is applied to model the information flow between nodes; Zheng, Equation 2: the attention head takes in the normalized attention coefficients (i.e. weights) and node features (i.e. features associated with each of the set of token nodes) as input; see also Zheng, Section 3.3.5); 
update a set of features associated with each of the plurality of nodes based on the application of the GNN model on the constructed hierarchal graph (Zheng, Section 3.3.1, equation 2: the output of the attention network is zi, which is used to update the node features; see also Zheng, Section 3.3.5); 
generate a document vector for a natural language processing (NLP) task, based on the updated set of features associated with each of the plurality of nodes (Zheng, Fig. 3: the output (i.e. document vector) is based on the output from the previous steps; see also Zheng, Section 3.4, which describes a series of probabilities and scores (i.e. vectors) that identify areas of the document that are relevant to the NLP task); and 
display an output of the NLP task for the document, based on the generated document vector (Zheng, Section 3.4: Zheng notes that they use the output scores, e.g. g(c, S, l) and g(c, S) in order to select candidates; in addition, Zheng, Table 1 notes a comparison of their results (i.e. outputs) to other previous systems, which suggests that the results must have been displayed to them at some point).
Claims 5 is rejected under 35 U.S.C. 103 as being unpatentable over Zheng and Hunter as applied to claim 4 above, and further in view of Marneffe et al. (2006, Generating Typed Dependency Parses from Phrase Structure Parses, hereinafter “Marneffe”).
In regards to claim 5, Zheng in view of Hunter does not explicitly teach that the parsing of each sentence from the set of sentences further comprises: constructing a dependency parse tree associated with a set of words in the parsed sentence, wherein the dependency parse tree indicates a dependency relationship between each of the set of words in the parsed sentence; and 
constructing a constituent parse tree associated with the set of words based on the constructed dependency parse tree, wherein the constituent parse tree is a representative of parts of speech associated with each of the set of words in the parsed sentence.
	In a related art, Marneffe teaches a system for extracting typed dependency parses of English sentences from phrase structure parses (Abstract). Marneffe notes that dependency parses can be useful for a range of NLP tasks, which benefit from having access to dependencies between words typed with grammatical relations (Section 1). Notably, Marneffe teaches two phases to their method: dependency extraction, which involves parsing a phrase and identifying dependencies between words (i.e. constructing a dependency parse tree that indicates a dependency relationship between each of the set of words in the parsed sentence), and dependency typing, which involves labelling each of the dependencies with a grammatical relation (Section 3; i.e. constructing a constituent parse tree that is representative of parts of speech associated with each of the set of words in the parsed sentence; see also Fig. 4 for a “Typed dependency parse”, i.e. constituent parse tree). Marneffe teaches that their system facilitates the rapid extraction of grammatical relations from phrase structure parses, which may increase robustness and accuracy relative to other, similar systems (Section 1, Paragraph 2).
 	It would have been obvious to one of ordinary skill in the art at the time of filing to modify the combination of Zheng and Hunter to incorporate the teachings of Marneffe to include the typed dependency parse tree. Doing so may have increased robustness and accuracy of parsing, while also benefiting the NLP task with access to dependencies between words typed with grammatical relations, as taught by Marneffe (Section 1, Paragraphs 1 and 2).
	Thus, the combination of Zheng, Hunter, and Marneffe teaches:
The method according to claim 4, wherein the parsing of each sentence from the set of sentences further comprising: 
constructing a dependency parse tree associated with a set of words in the parsed sentence, wherein the dependency parse tree indicates a dependency relationship between each of the set of words in the parsed sentence (Marneffe, Section 3: dependency extraction, which involves parsing a phrase and identifying dependencies between words); and 
constructing a constituent parse tree associated with the set of words based on the constructed dependency parse tree, wherein the constituent parse tree is a representative of parts of speech associated with each of the set of words in the parsed sentence (Marneffe, Section 3: dependency typing, which involves labelling each of the dependencies with a grammatical relation; see also Fig. 4 for a “Typed dependency parse”, i.e. constituent parse tree).
Claims 8 is rejected under 35 U.S.C. 103 as being unpatentable over Zheng and Hunter as applied to claim 7 above, and further in view of Cothenet (2020, “Short technical information about Word2Vec, GloVe, and Fasttext”).
In regards to claim 8, Zheng in view of Hunter fails to explicitly teach that the set of first features for each of set of token nodes is determined based on a token embedding technique including at least one of: a word2vec technique, a Fastext technique, or a Glove technique.
In a related art, Cothenet discusses the use of each of these techniques for determining word (i.e. token) embeddings (Section “Embeddings”). Cothenet notes that each of the three models can still be considered useful, and advocates trying out each one of them, and keeping the one with which the model achieves the best score on the final task (Section “Conclusion”). It should be noted that Zheng teaches the use of a BERT encoder for determining the set of first features (see e.g. Fig. 3), which fulfills a very similar function in determining token embeddings from text data.
Thus, it would have been obvious to one of ordinary skill in the art at the time of filing to modify the combination of Zheng and Hunter to incorporate the teachings of Cothenet to try using word2vec, Fastext, or Glove in order to determine word embeddings. Doing so would have been the application of a known technique to a known device ready for improvement to yield predictable results (see MPEP 2143(I)(D)).
Zheng teaches a system for performing a natural language processing task using a hierarchal graph structure, but does not teach the use of Word2Vec, Fasttext, or GloVe for determining word embeddings, like the claimed invention.
Cothenet teaches that Word2Vec, Fasttext, and GloVe are all comparable and viable methods of determining word embeddings (Section “Conclusion”) for natural language processing tasks.
One of ordinary skill in the art would have recognized that using the 3 methods as a replacement method for determining word embeddings may have resulted in an improved system, and that each of them may have been potentially viable and worth trying (as taught by Cothenet).
Thus, the combination of Zheng, Hunter, and Cothenet teaches:
The method according to claim 7, wherein the set of first features for each of set of token nodes is determined based on a token embedding technique including at least one of: a word2vec technique, a Fastext technique, or a Glove technique (Cothenet, Section “Embeddings”, Page 3: Cothenet discusses GloVe, FastText, and Word2Vec; also Cothenet, Section “Conclusion”, Page 9: Cothenet notes that each of the models can be useful, and suggests trying them all).
Claims 9 is rejected under 35 U.S.C. 103 as being unpatentable over Zheng and Hunter as applied to claim 7 above, and further in view Jain et al. (2019, “Estimating Distributed Representation Performance in Disaster-Related Social Media Classification”, hereinafter “Jain”).
In regards to claim 9, while Zheng teaches the use of BERT for determining the initial word embeddings (Section 3.3.6; see also Fig. 3; i.e. set of first features), Zheng in view of Hunter fails to explicitly teach the use of ELMo. 
In a related art, Jain discusses using a variety of different representations for word embeddings for a natural language processing task (Abstract). Notably, Jain discusses using both BERT and ELMo interchangeably in their experiment (Abstract: “Models that are built from pre-trained word embeddings from Word2Vec, GloVe, ELMo and BERT are used for performance evaluation.”; also Section III(B): “we selected a single classification architecture that could take input from the various embedding options”), and found that, in some cases, the experiments with ELMo returned the best results (Section IV; see also Table III).
It would have been obvious to one of ordinary skill in the art at the time of filing to modify the combination of Zheng and Hunter to incorporate the teachings of Jain to use ELMo in lieu of BERT. This would have been an example of simple substitution of one known element for another to obtain predictable results (see MPEP 2143(I)(B))
Zheng teaches a system for performing a natural language processing task using a hierarchal graph structure, but does not teach the use of ELMo for determining word embeddings, like the claimed invention.
However, Jain teaches using ELMo as a substitute for BERT (Abstract: “Models that are built from pre-trained word embeddings from Word2Vec, GloVe, ELMo and BERT are used for performance evaluation.”; also Section III(B): “we selected a single classification architecture that could take input from the various embedding options”) with a finding that ELMo may even perform better than BERT for a natural language processing task (Section IV; see also Table III)
One of ordinary skill in the art could have substituted ELMo for BERT in a manner similar to Jain in the system taught by Zheng as they both perform the similar function of creating contextualized embeddings of some input natural language text.
Thus, the combination of Zheng, Hunter, and Jain teaches:
The method according to claim 7, wherein the set of first features for each of set of token nodes are determined based on a pre-trained contextual model including at least one of: an Embeddings from Language Models (ELMo) model (Jain, Abstract), or a Bidirectional Encoder Representations from Transformer (BERT) model (Zheng, Fig. 3).
Claims 15 and 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over Zheng and Hunter as applied to claim 1 above, and further in view of Fang et al. (2020, “Hierarchical Graph Network for Multi-hop Question Answering”, hereinafter “Fang”).
In regards to claim 15, Zheng in view of Hunter fails to explicitly teach that generating the document vector for the NLP task further comprises at least one of: averaging or aggregating the updated set of features associated with each of the plurality of nodes of the constructed hierarchal graph, determining a multi-level clustering of the plurality of nodes, or applying a multi-level selection of a pre-determined number of top nodes from the plurality of nodes.
In a related art, Fang teaches a system for NLP processing using a hierarchical graph network (Abstract). Fang’s hierarchical graph network comprises Paragraph level, sentence level, and entity level nodes (Fig. 2). Notably, Fang teaches generating a document vector (Section 3.3, “Gated Attention”: G is a representation which is aggregated from the contextual representation and graph representation (i.e. constructed hierarchal graph)) that is used for answer span extraction. Fang’s system determines a span (i.e. clustering of nodes – because the nodes are a representation of the text, a span of the text would hence refer to a plurality of nodes) where the answer may appear by defining three different sub-tasks, divided across the paragraph, sentence, and entity levels (i.e. multi-level). Furthermore, Fang also teaches selecting only a pre-determined number of nodes Fang, Section 3.3: the graph may have a pre-determined number of paragraph/sentence/entity nodes in a graph). Fang notes that their system is effective for multi-hop question answering problems (Abstract).
It would have been obvious to one of ordinary skill in the art at the time of filing to modify the combination of Zheng and Hunter to incorporate the teachings of Fang to include aspects of their “hierarchical graph network”. Doing so may have improved the system’s effectiveness for multi-hope question answering problems, as taught by Fang (Abstract).
Thus, the combination of Zheng, Hunter, and Fang teaches:
The method according to claim 1, wherein the generating the document vector for the NLP task further comprising at least one of: averaging or aggregating the updated set of features associated with each of the plurality of nodes of the constructed hierarchal graph (Fang, Section 3.4: G (i.e. document vector) is aggregated from the contextual representation and the graph representation (i.e. constructed hierarchal graph) and used for answer span extraction (i.e. the NLP task)), determining a multi-level clustering of the plurality of nodes (Fang, Section 3.4: Fang’s system determines a span (i.e. clustering of nodes – because the nodes are a representation of the text, a span of the text would hence refer to a plurality of nodes) where the answer may appear by defining three different sub-tasks, divided across the paragraph, sentence, and entity levels (i.e. multi-level)), or applying a multi-level selection of a pre-determined number of top nodes from the plurality of nodes (Fang, Section 3.3: the graph may have a pre-determined number of paragraph/sentence/entity (i.e. multi-level) nodes in a graph; Section 3.1, “Paragraph Selection”, describes how the top-N paragraphs may be selected based on ranking scores).
In regards to claim 17, Zheng in view of Hunter fails to explicitly teach applying the generated document vector on a feedforward layer associated with the neural network model trained for the NLP task; generating a prediction result associated with the NLP task based on the application of the generated document vector on the feedforward layer associated with the neural network model; and displaying the output of the NLP task for the document, based on the generated prediction result.
In a related art, Fang teaches a system for NLP processing using a hierarchical graph network (Abstract). Fang’s hierarchical graph network comprises Paragraph level, sentence level, and entity level nodes (Fig. 2). Notably, Fang teaches generating a document vector (Section 3.3, “Gated Attention”: G is a representation that is used for answer span extraction (i.e. a NLP task)) that is applied on a feedforward layer associated with a neural network model trained for a NLP task (Section 3.4: Several Multilayer Perceptrons (i.e. feedforward neural network models trained for the NLP task) and generating a prediction result (Section 3.4: the Multilayer Perceptron is a classifier (that is, it generates a prediction result associated with the NLP task). Fang notes that their system is effective for multi-hop question answering problems (Abstract).
It would have been obvious to one of ordinary skill in the art at the time of filing to modify the combination of Zheng and Hunter to incorporate the teachings of Fang to include aspects of their “hierarchical graph network”. Doing so may have improved the system’s effectiveness for multi-hope question answering problems, as taught by Fang (Abstract).
Thus, the combination of Zheng, Hunter, and Fang teaches:
The method according to claim 1, further comprising: 95FPC.20-00986.ORD 
applying the generated document vector on a feedforward layer associated with the neural network model trained for the NLP task (Fang, Section 3.4: Several Multilayer Perceptrons (i.e. feedforward neural network models trained for the NLP task) are described, such as in equation 9, which describes a Multilayer Perceptron that uses matrix G as its input (i.e. the generate document vector)); 
generating a prediction result associated with the NLP task based on the application of the generated document vector on the feedforward layer associated with the neural network model (Fang, Section 3.4: the Multilayer Perceptron is a classifier (that is, it generates a prediction result associated with the NLP task)); and 
displaying the output of the NLP task for the document, based on the generated prediction result (Fang, Section 4.2: The results (i.e. output) of the classifiers are compared to the results of other, similar systems, indicating that the output was displayed in some way).
In regards to claim 18, Zhang in view of Hunter and further in view of Fang teaches:
The method according to claim 17, further comprising re-training the neural network model for the NLP task based on the document vector and the generated prediction result (Fang, Section 3.4: Fang describes several loss functions, as well as corresponding ground-truth facts, indicating that the MLPs are being updated based on their results (i.e. re-trained)).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Xie et al. (U.S. Patent Application Publication 2018/0300314 A1, hereinafter “Xie”) teaches a neural architecture for reading comprehension (Abstract). Notably, Xie teaches parsing a text passage into a constituent parse tree (Paragraph 18; see also Fig. 4) before calculating attention scores between constituents in a hierarchal manner (Paragraph 31).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALEXANDER J KIM whose telephone number is (571)272-4442. The examiner can normally be reached M-F 7:30 AM - 5:30 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on (571) 272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ALEXANDER JOONGIE KIM/Examiner, Art Unit 2655                                                                                                                                                                                                        
/ANDREW C FLANDERS/Supervisory Patent Examiner, Art Unit 2655