DETAILED ACTION
Introduction
This office action is in response to Applicant’s submission filed on May 6, 2021. 
Claims 1-20 are pending in the application. As such, claims 1-20 have been examined. 
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Drawings
The drawings were received on May 6, 2021.  These drawings have been accepted and considered by the Examiner.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-2, 4, 6-8, 10-11, 13, 15-17 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Larlus-Larrondo et al. (US Patent Pub. No. 2021/0312628), hereinafter Larlus, in view of Ho et al. (US Patent Pub. No. 2017/0286835), hereinafter Ho, in view of Frieder et al. (US Patent Pub. No. 2021/0134418), hereinafter Frieder.

Regarding claim 1, Larlus teaches a method for natural language processing (Larlus [0027] In other domain-specific proxy tasks utilizing self-supervised learning, a “pretext” task is solved to learn an implicit prior knowledge about the structure in the input space. The prior knowledge can be utilized in the target tasks, as discussed above. For computer vision applications, colorizing a gray-scale image, predicting image rotations, or clustering image embeddings provide useful priors to downstream vision problems. Similarly, solving next sentence prediction and masked language modeling tasks enables a language model to perform substantially better on a diverse set of natural language processing target tasks), 
the method comprising: 
receiving a first text corpus (Larlus [0038] Masked language modeling is a self-supervised proxy task to pre-train a language model over large-scale text corpora. This type of pre-training scheme enables the language model to learn efficient language priors so that simply fine-tuning the language model achieves significant improvements over the state-of-the-art on a wide range of natural language processing target tasks)
masking some of the [nodes] (Larlus [0039] In this pre-training task, (i) a sequence of words are tokenized, (ii) a random subset of the tokens are selected to be either masked, replaced by other tokens, or kept intact, (iii) all of the tokens are given as input to a language model (a bidirectional transformer encoder model), and (iv) the language model is trained to correctly predict the ground-truth labels of the selected tokens (before the tokens are masked or replaced, in case they are changed). FIG. 4 illustrated the architecture to carry out this pre-training task);
a bi-directional transformer model (Larlus [0039] In this pre-training task, (i) a sequence of words are tokenized, (ii) a random subset of the tokens are selected to be either masked, replaced by other tokens, or kept intact, (iii) all of the tokens are given as input to a language model (a bidirectional transformer encoder model), and (iv) the language model is trained to correctly predict the ground-truth labels of the selected tokens (before the tokens are masked or replaced, in case they are changed). FIG. 4 illustrated the architecture to carry out this pre-training task);
and training the bi-directional transformer model on the first text corpus (Larlus [0039] In this pre-training task, (i) a sequence of words are tokenized, (ii) a random subset of the tokens are selected to be either masked, replaced by other tokens, or kept intact, (iii) all of the tokens are given as input to a language model (a bidirectional transformer encoder model), and (iv) the language model is trained to correctly predict the ground-truth labels of the selected tokens (before the tokens are masked or replaced, in case they are changed). FIG. 4 illustrated the architecture to carry out this pre-training task)
by reducing loss from the bi-directional transformer model (Larlus [0094] To evaluate the representations learned by both models, the convolutional layers of an AlexNet are taken and a generalized-mean pooling, L2 normalization, and fully-connected layers are appended. The parameters of the fully-connected layer are trained for 300 epochs by minimizing the AP Loss over the clean version of the Landmarks dataset. The complete model is tested on the revisited Oxford Buildings and Paris datasets by computing mean-average-precision scores. The image representations that are produced by solving the image conditioned masked language modeling task outperforms the counterparts obtained by the RotNet model on this task)
predicting the masked [nodes] (Larlus [0035] More specifically, given an image-caption pair, a word in the caption is masked, and the image conditioned masked language modeling tries to predict the label of the masked word by using the representation of the image).
Larlus does not teach
that comprises semi-structured content comprising hierarchical nodes; 
masking some of the hierarchical nodes 
generating node embeddings and level embeddings from the semi-structured content of the first text corpus and from the masked hierarchical nodes; 
including the node embeddings and the level embeddings in a bi-directional transformer model
predicting the masked hierarchical nodes. 
Ho teaches
that comprises semi-structured content (Ho [0020] In the QA system 100, the knowledge manager 104 may be configured to receive inputs from various sources. For example, knowledge manager 104 may receive input from the network 102, one or more knowledge bases or corpora of electronic documents 106 which stores electronic documents 107, semantic data 108, or other possible sources of data input. In selected embodiments, the knowledge database 106 may include structured, semi-structured, and/or unstructured content in a plurality of documents that are contained in one or more large knowledge databases or corpora. The various computing devices (e.g., 110, 120, 130) on the network 102 may include access points for content creators and content users. Some of the computing devices may include devices for a database storing the corpus of data as the body of information used by the knowledge manager 104 to generate answers to questions. The network 102 may include local network connections and remote connections in various embodiments, such that knowledge manager 104 may operate in environments of any size, including local and global, e.g., the Internet. Additionally, knowledge manager 104 serves as a front-end system that can make available a variety of knowledge extracted from or represented in documents, network-accessible sources and/or structured data sources. In this manner, some processes populate the knowledge manager, with the knowledge manager also including input interfaces to receive knowledge requests and respond accordingly)
comprising hierarchical nodes (Ho [0100] By now, it will be appreciated that there is disclosed herein a system, method, apparatus, and computer program product for generating concept hierarchies with an information handling system having a processor and a memory. As disclosed, the system, method, apparatus, and computer program product generate at least a first concept set comprising one or more concepts extracted from one or more content sources. At the system, a user request is received to produce a hierarchy of concepts from the first concept set using one or more specified hierarchy parameters, which may be default parameters or parameters specified in the user request. A vector representation of each of the concepts in the first concept set is generated, retrieved, constructed, or otherwise obtained. The vectors are processed by performing a natural language processing (NLP) analysis comparison of the vector representation of each of the concepts in the first concept set to determine a similarity measure for each pair of distinct concepts Ci and Cj in the first concept set. The similarity measure may be defined on a selected subset of dimensions of the concept vectors with uniform or non-uniform weights, where the selected dimensions and their weights can be modified in each iterative step of hierarchy construction. In selected embodiments, the NLP analysis includes analyzing a vector similarity function sim(Vi, Vj) between vectors Vi, Vj representing each pair of distinct concepts Ci and Cj in the first concept set. In selected embodiments, analysis of the vector similarity function sim(Vi, Vj) includes computing, for each concept Ci for i=1 . . . N, the similarity measure corresponding to said concept Ci as a cosine distance measure between each vector pair Vi, Vj for j=1 . . . N, i≠j, and then selecting a distinct, unconnected concept Cj having a maximum cosine distance measure with the concept Ci. A concept hierarchy is constructed based on the one or more specified hierarchy parameters and the similarity measure for each pair of distinct concepts Ci and Cj in the first concept set. In selected embodiments, the concept hierarchy is constructed using a bottom-up method to iteratively build a concept graph by selecting distinct, unconnected concepts Ci, Cj from the first concept set based on a maximal similarity measure and identifying a first concept as a hierarchy root which has a maximal number of occurrences in the first concept set. In other embodiments, the concept hierarchy is constructed using a top-down/frequency method to sort the one or more concepts in the first concept set into a sorted concept sequence based on frequency of occurrence, select a root node C1 that has maximum frequency of occurrence, and sequentially add each concept from the sorted concept sequence to the root node C1 in the concept hierarchy based on a maximal similarity measure between a selected concept from the sorted concept sequence and the root node C1 in the concept hierarchy, or to another existing node Ci in the concept hierarchy based on a maximal similarity measure between a selected concept in the sorted concept sequence and that other existing node Ci in the concept hierarchy. In other embodiments, the concept hierarchy is constructed by generating a first sequence over a set of abstract concepts C1, . . . , Ck by simulating a random walk on a first hierarchical structure defined by a first branching factor and specified depth; generating a second sequence over a set of regular concepts D1, . . . , Dk, where the sequence extracted from a corpus; generating or retrieving a vector representation for each of the concepts in the first sequence of abstract concepts and the second sequence of regular concepts; and identifying one or more pairs of regular concepts to approximate corresponding pairs of abstract concepts based on analogies of relationships between the abstract concepts and the regular concepts. In addition, the system may display the concept hierarchy to visually present inter-relations between concepts from the first concept set, such as by visually presenting a hierarchical structure conveying concept grouping of concepts from the first concept set to enable user navigation over the first concept set. In other embodiments, the system may iteratively select a concept from the first concept set; identify an associated neighborhood for each selected concept in the first concept set using iterative clustering and probability flow-based traversals to identify, for each concept in the first concept set, an associated neighborhood and corresponding strength measure; and create a hierarchy of associated neighborhoods, each of which comprises a representative concept to enable a human user to easily identify the neighborhood).
Ho is considered to be analogous to the claimed invention because it is in the same field of using natural language processing for text analytics (Ho [0100]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus further in view of Ho to allow for using semi-structured content having hierarchical nodes. Doing so would allow better presentation and visualization of concepts and their inter-relations (e.g., hierarchical presentation) (Ho [0051]).
Larlus in view of Ho does not teach
generating node embeddings and level embeddings from the semi-structured content of the first text corpus and from the masked hierarchical nodes; 
including the node embeddings and the level embeddings in a bi-directional transformer model
Frieder teaches
node embeddings (Frieder [0082] In embodiments, a Cross-Global Attention Graph Kernel Network 1100 may learn an end-to-end deep graph kernel on a batch of graphs, as shown in FIG. 11. At 1102, a plurality of patient graphs may be received/input. The node level embedding 1104 and node clusters 1106 may be determined first by, at 1108 determining shared weight Graph Convolutional Networks (GCNs) to form node level embedding 1104, and at 1110, determining learning node clusters with reconstruction loss to form node clusters 1106. The graph level embedding may be derived 1112 from node matching 1114 based attention pooling 1115. The loss 1120 may be calculated 1116 by the resulting distance and kernel matrix 1118 and backpropagation 1122 may be performed to update all model parameters)
and level embeddings (Frieder [0082] In embodiments, a Cross-Global Attention Graph Kernel Network 1100 may learn an end-to-end deep graph kernel on a batch of graphs, as shown in FIG. 11. At 1102, a plurality of patient graphs may be received/input. The node level embedding 1104 and node clusters 1106 may be determined first by, at 1108 determining shared weight Graph Convolutional Networks (GCNs) to form node level embedding 1104, and at 1110, determining learning node clusters with reconstruction loss to form node clusters 1106. The graph level embedding may be derived 1112 from node matching 1114 based attention pooling 1115. The loss 1120 may be calculated 1116 by the resulting distance and kernel matrix 1118 and backpropagation 1122 may be performed to update all model parameters).
Frieder is considered to be analogous to the claimed invention because it is in the same field of using a deep neural network which uses embeddings (Frieder [0082]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus in view of Ho further in view of Frieder to allow for using node embedding and level embeddings. Doing so would allow for an opportunity to model the EHRs in a compact structure with high interpretability (Frieder [0010]).

Regarding claim 2, Larlus in view of Ho in view of Frieder teaches the method of claim 1.
Larlus teaches
wherein the training produces a trained model (Larlus [0039] In this pre-training task, (i) a sequence of words are tokenized, (ii) a random subset of the tokens are selected to be either masked, replaced by other tokens, or kept intact, (iii) all of the tokens are given as input to a language model (a bidirectional transformer encoder model), and (iv) the language model is trained to correctly predict the ground-truth labels of the selected tokens (before the tokens are masked or replaced, in case they are changed). FIG. 4 illustrated the architecture to carry out this pre-training task).
and wherein the method further comprises: 
inputting one or more terms into the trained model so that the trained model predicts a node type of the terms (Larlus [0035] More specifically, given an image-caption pair, a word in the caption is masked, and the image conditioned masked language modeling tries to predict the label of the masked word by using the representation of the image).
Larlus teaches node type, however Larlus does not teach
wherein the node type is from the semi-structured content.
Ho teaches
wherein the node type is from the semi-structured content (Ho [0020] In the QA system 100, the knowledge manager 104 may be configured to receive inputs from various sources. For example, knowledge manager 104 may receive input from the network 102, one or more knowledge bases or corpora of electronic documents 106 which stores electronic documents 107, semantic data 108, or other possible sources of data input. In selected embodiments, the knowledge database 106 may include structured, semi-structured, and/or unstructured content in a plurality of documents that are contained in one or more large knowledge databases or corpora. The various computing devices (e.g., 110, 120, 130) on the network 102 may include access points for content creators and content users. Some of the computing devices may include devices for a database storing the corpus of data as the body of information used by the knowledge manager 104 to generate answers to questions. The network 102 may include local network connections and remote connections in various embodiments, such that knowledge manager 104 may operate in environments of any size, including local and global, e.g., the Internet. Additionally, knowledge manager 104 serves as a front-end system that can make available a variety of knowledge extracted from or represented in documents, network-accessible sources and/or structured data sources. In this manner, some processes populate the knowledge manager, with the knowledge manager also including input interfaces to receive knowledge requests and respond accordingly).
Ho is considered to be analogous to the claimed invention because it is in the same field of using natural language processing for text analytics (Ho [0100]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus further in view of Ho to allow for using semi-structured content having hierarchical nodes. Doing so would allow better presentation and visualization of concepts and their inter-relations (e.g., hierarchical presentation) (Ho [0051]).


Regarding claim 4, Larlus in view of Ho in view of Frieder teaches the method of claim 1.
Larlus teaches
wherein the bi-directional transformer model generates and uses bi-directional embeddings from the first text corpus (Larlus [0055] The above process is carried out to train the F-CNN (660 of FIG. 5) by providing the F-CNN (660 of FIG. 5) with reliable side information extracted from textual data. To carry out the training, one can use a pre-trained bidirectional transformer encoder model such as BERT as language model. Other language models could have been used. To benefit from the language prior learned by BERT while training the F-CNN: (i) the parameters of BERT (θ.sub.LM) are frozen, (ii) the pooled visual embedding vector are mapped to the token vocabulary space using a context filter (630 of FIG. 5) and the token embeddings that are parts of the pre-trained BERT model), 
and wherein the bi-directional embeddings are selected from the group consisting of 
token embeddings (Larlus [0055] The above process is carried out to train the F-CNN (660 of FIG. 5) by providing the F-CNN (660 of FIG. 5) with reliable side information extracted from textual data. To carry out the training, one can use a pre-trained bidirectional transformer encoder model such as BERT as language model. Other language models could have been used. To benefit from the language prior learned by BERT while training the F-CNN: (i) the parameters of BERT (θ.sub.LM) are frozen, (ii) the pooled visual embedding vector are mapped to the token vocabulary space using a context filter (630 of FIG. 5) and the token embeddings that are parts of the pre-trained BERT model),
segment embeddings, 
and positional embeddings (Larlus [0088] ρ.sub.θK and ρ.sub.θV blocks are built by using two Conv2D-BatchNorm2D-ReLU layers and a linear Conv2D layer afterwards. Each Conv2D layer has 3×3 kernels and 512 channels, except the last linear Conv2D where it has 768 channels which is the dimension of the token representations in BERT model. Besides, in order for ρ.sub.θK and ρ.sub.θV to understand the spatial configuration of the visual feature vectors, one-hat positional embeddings are concatenated to the visual feature vectors Φ.sub.θCNN(I.sub.i.Math.) before feeding them into ρ.sub.θK and ρ.sub.θV blocks. All trainable parameters in the model are tuned by performing 100k SGD updates with batches of size 256, using ADAM optimizer with learning rates 5×10-.sup.5 and 5×10-.sup.4 for the parameters in Φ.sub.θCNN and [ρ.sub.θK ρ.sub.θV] networks, respectively. Linear learning rate decay is applied during training).


Regarding claim 6, Larlus in view of Ho in view of Frieder teaches the method of claim 1.
Larlus does not teach
wherein the semi-structured content comprises at least one content type selected from the group consisting of 
hypertext markup language, 
extensible markup language, 
JavaScript Object Notation, 
and Markdown markup language.
Ho teaches
semi-structured content (Ho [0020] In the QA system 100, the knowledge manager 104 may be configured to receive inputs from various sources. For example, knowledge manager 104 may receive input from the network 102, one or more knowledge bases or corpora of electronic documents 106 which stores electronic documents 107, semantic data 108, or other possible sources of data input. In selected embodiments, the knowledge database 106 may include structured, semi-structured, and/or unstructured content in a plurality of documents that are contained in one or more large knowledge databases or corpora. The various computing devices (e.g., 110, 120, 130) on the network 102 may include access points for content creators and content users. Some of the computing devices may include devices for a database storing the corpus of data as the body of information used by the knowledge manager 104 to generate answers to questions. The network 102 may include local network connections and remote connections in various embodiments, such that knowledge manager 104 may operate in environments of any size, including local and global, e.g., the Internet. Additionally, knowledge manager 104 serves as a front-end system that can make available a variety of knowledge extracted from or represented in documents, network-accessible sources and/or structured data sources. In this manner, some processes populate the knowledge manager, with the knowledge manager also including input interfaces to receive knowledge requests and respond accordingly)
wherein the semi-structured content comprises at least one content type selected from the group consisting of hypertext markup language, extensible markup language, JavaScript Object Notation, and Markdown markup language (Ho [0060] To provide another illustrative example application for processing concept vectors 13A to compute concept hierarchies, a vector processing application 14 may be configured to build or discover a hierarchy of concept neighborhoods using iterative clustering and probability flow-based traversals. For example, after a user explores one or more concept graphs 18 having nodes which represent concepts (e.g., Wikipedia concepts), the user may request the user's browser to produce a hierarchy of concept neighborhoods, where each neighborhood can be understood by the user as a topic or common theme between the concepts that are measured to be connected to the neighborhood. In response to the neighborhood hierarchy request, the vector processing application 14 may process the extracted concept vectors 13A to identify, for each concept (node in the graph), the neighborhood of the graph that is associated with the concept along with a computed strength or similarity metric. In addition, the vector processing application 14 may process the concepts belonging to each neighborhood to identify a single concept that can represent the neighborhood. The construction of neighborhood hierarchies can be done by iteratively selecting nodes from a starting concept graph G for processing to identify the most similar non-selected node in concept graph G for combination with each selected node, for removal from concept graph G, and for transfer to a graph N_i, and edges from the two combined nodes are transferred to N_i to point to appropriate representative nodes of N_i, until all nodes are removed from concept graph G, at which point the graph N_i is stored, the concept graph G is updated with the graph N_i, and the process is repeated for i=i+1 until the number of nodes in the graph N_i is less than a specified number of neighborhoods. For any graph N_i resultant from the iterations node similarity may be computed using a standard node similarity metric, such as SimRank and Jaccard similarity coefficients, on the nodes created by the joining of the two nodes in graph N_(i−1). The vector processing application 14 may also create a hierarchy of neighborhoods for display, allowing the user to efficiently identify how a collection of concepts relate to each other, and whether the collection can be partitioned into specific themes that are low in the hierarchy. To provide an illustrative example of usage for the proposed hierarchical clustering and mapping, the corpus may include a first document (A) containing concepts regarding “web design” and “web accessibility.” Given a second document (B) which contains concepts regarding “javascript” and “HTML,” a neighborhood hierarchy discovery algorithm can efficiently classify that concepts in the documents A and B are under a “web development” theme. Now, given a third document (C) which refers to the concept of “visual impairment,” the neighborhood hierarchy discovery algorithm would be configured to efficiently identify that the theme in common between A and C is “web accessibility.” Identifying whether to classify A, B and C into a single, higher-level theme, or whether to classify A, B, and C into multiple categories can be done by traversing the N_i graphs generated from the multiple iterations of the algorithm, and determining in which N_i graphs do the A, B and C nodes appear in the same supernode (neighborhood). Due to the logarithmic decrease in nodes after each iteration of the algorithm, graphs generated through fewer iterations starting from G are composed of more nodes (more neighborhoods) than graphs generated from more iterations. If nodes A, B and C are found to be in the same neighborhood in a low iteration N_i, they are considered to be highly connected, whereas if A, B, and C are only found in the same neighborhood in a high iteration N_i, they are considered to be lowly connected).
Ho is considered to be analogous to the claimed invention because it is in the same field of using natural language processing for text analytics (Ho [0100]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus further in view of Ho to allow for using semi-structured content having hierarchical nodes and HTML or JavaScript. Doing so would allow better presentation and visualization of concepts and their inter-relations (e.g., hierarchical presentation) (Ho [0051]).

Regarding claim 7, Larlus in view of Ho in view of Frieder teaches the method of claim 1.
Larlus teaches
further comprising inputting a [text corpus] to a machine learning model so that the bi-directional transformer model is formed (Larlus [0039] In this pre-training task, (i) a sequence of words are tokenized, (ii) a random subset of the tokens are selected to be either masked, replaced by other tokens, or kept intact, (iii) all of the tokens are given as input to a language model (a bidirectional transformer encoder model), and (iv) the language model is trained to correctly predict the ground-truth labels of the selected tokens (before the tokens are masked or replaced, in case they are changed). FIG. 4 illustrated the architecture to carry out this pre-training task).
Larlus does not teach, however Ho teaches
a non-semi-structured text corpus (Ho [0020] In the QA system 100, the knowledge manager 104 may be configured to receive inputs from various sources. For example, knowledge manager 104 may receive input from the network 102, one or more knowledge bases or corpora of electronic documents 106 which stores electronic documents 107, semantic data 108, or other possible sources of data input. In selected embodiments, the knowledge database 106 may include structured, semi-structured, and/or unstructured content in a plurality of documents that are contained in one or more large knowledge databases or corpora. The various computing devices (e.g., 110, 120, 130) on the network 102 may include access points for content creators and content users. Some of the computing devices may include devices for a database storing the corpus of data as the body of information used by the knowledge manager 104 to generate answers to questions. The network 102 may include local network connections and remote connections in various embodiments, such that knowledge manager 104 may operate in environments of any size, including local and global, e.g., the Internet. Additionally, knowledge manager 104 serves as a front-end system that can make available a variety of knowledge extracted from or represented in documents, network-accessible sources and/or structured data sources. In this manner, some processes populate the knowledge manager, with the knowledge manager also including input interfaces to receive knowledge requests and respond accordingly).
Ho is considered to be analogous to the claimed invention because it is in the same field of using natural language processing for text analytics (Ho [0100]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus further in view of Ho to allow for using semi-structured content having hierarchical nodes. Doing so would allow better presentation and visualization of concepts and their inter-relations (e.g., hierarchical presentation) (Ho [0051]).

Regarding claim 8, Larlus in view of Ho in view of Frieder teaches the method of claim 1.
Larlus does not teach
wherein the level embeddings are based on a number of node depth levels in the semi-structured content.
Ho teaches
semi-structured content (Ho [0020] In the QA system 100, the knowledge manager 104 may be configured to receive inputs from various sources. For example, knowledge manager 104 may receive input from the network 102, one or more knowledge bases or corpora of electronic documents 106 which stores electronic documents 107, semantic data 108, or other possible sources of data input. In selected embodiments, the knowledge database 106 may include structured, semi-structured, and/or unstructured content in a plurality of documents that are contained in one or more large knowledge databases or corpora. The various computing devices (e.g., 110, 120, 130) on the network 102 may include access points for content creators and content users. Some of the computing devices may include devices for a database storing the corpus of data as the body of information used by the knowledge manager 104 to generate answers to questions. The network 102 may include local network connections and remote connections in various embodiments, such that knowledge manager 104 may operate in environments of any size, including local and global, e.g., the Internet. Additionally, knowledge manager 104 serves as a front-end system that can make available a variety of knowledge extracted from or represented in documents, network-accessible sources and/or structured data sources. In this manner, some processes populate the knowledge manager, with the knowledge manager also including input interfaces to receive knowledge requests and respond accordingly).
Ho is considered to be analogous to the claimed invention because it is in the same field of using natural language processing for text analytics (Ho [0100]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus further in view of Ho to allow for using semi-structured content having hierarchical nodes with specified depths. Doing so would allow better presentation and visualization of concepts and their inter-relations (e.g., hierarchical presentation) (Ho [0051]).
Larlus in view of Ho does not teach
wherein the level embeddings are based on a number of node depth levels in the semi-structured content.
Frieder teaches
level embeddings (Frieder [0082] In embodiments, a Cross-Global Attention Graph Kernel Network 1100 may learn an end-to-end deep graph kernel on a batch of graphs, as shown in FIG. 11. At 1102, a plurality of patient graphs may be received/input. The node level embedding 1104 and node clusters 1106 may be determined first by, at 1108 determining shared weight Graph Convolutional Networks (GCNs) to form node level embedding 1104, and at 1110, determining learning node clusters with reconstruction loss to form node clusters 1106. The graph level embedding may be derived 1112 from node matching 1114 based attention pooling 1115. The loss 1120 may be calculated 1116 by the resulting distance and kernel matrix 1118 and backpropagation 1122 may be performed to update all model parameters)
node depth levels (Frieder [0068] To define a topological sequence 506-1, 506-2, let T be a topological ordering of graph G=(V, E) such that T={n.sub.i|i=1, . . . , |V|}, the topological sequence S is defined as S={n.sub.i.Math.label+level|i=1, . . . ,|V|, and n.sub.i∈T}  (9) where + represents the string concatenation and level denotes the order of occurrence of label associated to node n in T Namely, every node in the topological sequence has an attached number to indicate the level. The level indicates the order of occurrence of the same node label in the topological ordering).
Frieder is considered to be analogous to the claimed invention because it is in the same field of using a deep neural network which uses embeddings (Frieder [0082]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus in view of Ho further in view of Frieder to allow for using node embedding and level embeddings. Doing so would allow for an opportunity to model the EHRs in a compact structure with high interpretability (Frieder [0010]).


Regarding claim 10, Larlus teaches a computer system for natural language processing (Larlus [0027] In other domain-specific proxy tasks utilizing self-supervised learning, a “pretext” task is solved to learn an implicit prior knowledge about the structure in the input space. The prior knowledge can be utilized in the target tasks, as discussed above. For computer vision applications, colorizing a gray-scale image, predicting image rotations, or clustering image embeddings provide useful priors to downstream vision problems. Similarly, solving next sentence prediction and masked language modeling tasks enables a language model to perform substantially better on a diverse set of natural language processing target tasks), 
the computer system comprising: 
one or more processors (Larlus [0020] The server 100 is typically connected to an extended network 200 such as the Internet for data exchange. The server 100 comprises a data processor 110 and memory 120, such as a hard disk), 
one or more computer-readable memories (Larlus [0020] The server 100 is typically connected to an extended network 200 such as the Internet for data exchange. The server 100 comprises a data processor 110 and memory 120, such as a hard disk), 
one or more computer-readable tangible storage media (Larlus [0020] The server 100 is typically connected to an extended network 200 such as the Internet for data exchange. The server 100 comprises a data processor 110 and memory 120, such as a hard disk), 
and program instructions stored on at least one of the one or more computer-readable tangible storage media for execution by at least one of the one or more processors via at least one of the one or more computer-readable memories (Larlus [0090] To compute the performance on this task, a publicly shared repository is used. All the approaches that are compared use an AlexNet-like architecture like the image conditioned masked language modeling), 
wherein the computer system is capable of performing a method comprising: 
receiving a first text corpus (Larlus [0038] Masked language modeling is a self-supervised proxy task to pre-train a language model over large-scale text corpora. This type of pre-training scheme enables the language model to learn efficient language priors so that simply fine-tuning the language model achieves significant improvements over the state-of-the-art on a wide range of natural language processing target tasks)
masking some of the [nodes] (Larlus [0039] In this pre-training task, (i) a sequence of words are tokenized, (ii) a random subset of the tokens are selected to be either masked, replaced by other tokens, or kept intact, (iii) all of the tokens are given as input to a language model (a bidirectional transformer encoder model), and (iv) the language model is trained to correctly predict the ground-truth labels of the selected tokens (before the tokens are masked or replaced, in case they are changed). FIG. 4 illustrated the architecture to carry out this pre-training task);
a bi-directional transformer model (Larlus [0039] In this pre-training task, (i) a sequence of words are tokenized, (ii) a random subset of the tokens are selected to be either masked, replaced by other tokens, or kept intact, (iii) all of the tokens are given as input to a language model (a bidirectional transformer encoder model), and (iv) the language model is trained to correctly predict the ground-truth labels of the selected tokens (before the tokens are masked or replaced, in case they are changed). FIG. 4 illustrated the architecture to carry out this pre-training task);
and training the bi-directional transformer model on the first text corpus (Larlus [0039] In this pre-training task, (i) a sequence of words are tokenized, (ii) a random subset of the tokens are selected to be either masked, replaced by other tokens, or kept intact, (iii) all of the tokens are given as input to a language model (a bidirectional transformer encoder model), and (iv) the language model is trained to correctly predict the ground-truth labels of the selected tokens (before the tokens are masked or replaced, in case they are changed). FIG. 4 illustrated the architecture to carry out this pre-training task)
by reducing loss from the bi-directional transformer model (Larlus [0094] To evaluate the representations learned by both models, the convolutional layers of an AlexNet are taken and a generalized-mean pooling, L2 normalization, and fully-connected layers are appended. The parameters of the fully-connected layer are trained for 300 epochs by minimizing the AP Loss over the clean version of the Landmarks dataset. The complete model is tested on the revisited Oxford Buildings and Paris datasets by computing mean-average-precision scores. The image representations that are produced by solving the image conditioned masked language modeling task outperforms the counterparts obtained by the RotNet model on this task)
predicting the masked [nodes] (Larlus [0035] More specifically, given an image-caption pair, a word in the caption is masked, and the image conditioned masked language modeling tries to predict the label of the masked word by using the representation of the image).
Larlus does not teach
that comprises semi-structured content comprising hierarchical nodes; 
masking some of the hierarchical nodes 
generating node embeddings and level embeddings from the semi-structured content of the first text corpus and from the masked hierarchical nodes; 
including the node embeddings and the level embeddings in a bi-directional transformer model
predicting the masked hierarchical nodes. 
Ho teaches
that comprises semi-structured content (Ho [0020] In the QA system 100, the knowledge manager 104 may be configured to receive inputs from various sources. For example, knowledge manager 104 may receive input from the network 102, one or more knowledge bases or corpora of electronic documents 106 which stores electronic documents 107, semantic data 108, or other possible sources of data input. In selected embodiments, the knowledge database 106 may include structured, semi-structured, and/or unstructured content in a plurality of documents that are contained in one or more large knowledge databases or corpora. The various computing devices (e.g., 110, 120, 130) on the network 102 may include access points for content creators and content users. Some of the computing devices may include devices for a database storing the corpus of data as the body of information used by the knowledge manager 104 to generate answers to questions. The network 102 may include local network connections and remote connections in various embodiments, such that knowledge manager 104 may operate in environments of any size, including local and global, e.g., the Internet. Additionally, knowledge manager 104 serves as a front-end system that can make available a variety of knowledge extracted from or represented in documents, network-accessible sources and/or structured data sources. In this manner, some processes populate the knowledge manager, with the knowledge manager also including input interfaces to receive knowledge requests and respond accordingly)
comprising hierarchical nodes (Ho [0100] By now, it will be appreciated that there is disclosed herein a system, method, apparatus, and computer program product for generating concept hierarchies with an information handling system having a processor and a memory. As disclosed, the system, method, apparatus, and computer program product generate at least a first concept set comprising one or more concepts extracted from one or more content sources. At the system, a user request is received to produce a hierarchy of concepts from the first concept set using one or more specified hierarchy parameters, which may be default parameters or parameters specified in the user request. A vector representation of each of the concepts in the first concept set is generated, retrieved, constructed, or otherwise obtained. The vectors are processed by performing a natural language processing (NLP) analysis comparison of the vector representation of each of the concepts in the first concept set to determine a similarity measure for each pair of distinct concepts Ci and Cj in the first concept set. The similarity measure may be defined on a selected subset of dimensions of the concept vectors with uniform or non-uniform weights, where the selected dimensions and their weights can be modified in each iterative step of hierarchy construction. In selected embodiments, the NLP analysis includes analyzing a vector similarity function sim(Vi, Vj) between vectors Vi, Vj representing each pair of distinct concepts Ci and Cj in the first concept set. In selected embodiments, analysis of the vector similarity function sim(Vi, Vj) includes computing, for each concept Ci for i=1 . . . N, the similarity measure corresponding to said concept Ci as a cosine distance measure between each vector pair Vi, Vj for j=1 . . . N, i≠j, and then selecting a distinct, unconnected concept Cj having a maximum cosine distance measure with the concept Ci. A concept hierarchy is constructed based on the one or more specified hierarchy parameters and the similarity measure for each pair of distinct concepts Ci and Cj in the first concept set. In selected embodiments, the concept hierarchy is constructed using a bottom-up method to iteratively build a concept graph by selecting distinct, unconnected concepts Ci, Cj from the first concept set based on a maximal similarity measure and identifying a first concept as a hierarchy root which has a maximal number of occurrences in the first concept set. In other embodiments, the concept hierarchy is constructed using a top-down/frequency method to sort the one or more concepts in the first concept set into a sorted concept sequence based on frequency of occurrence, select a root node C1 that has maximum frequency of occurrence, and sequentially add each concept from the sorted concept sequence to the root node C1 in the concept hierarchy based on a maximal similarity measure between a selected concept from the sorted concept sequence and the root node C1 in the concept hierarchy, or to another existing node Ci in the concept hierarchy based on a maximal similarity measure between a selected concept in the sorted concept sequence and that other existing node Ci in the concept hierarchy. In other embodiments, the concept hierarchy is constructed by generating a first sequence over a set of abstract concepts C1, . . . , Ck by simulating a random walk on a first hierarchical structure defined by a first branching factor and specified depth; generating a second sequence over a set of regular concepts D1, . . . , Dk, where the sequence extracted from a corpus; generating or retrieving a vector representation for each of the concepts in the first sequence of abstract concepts and the second sequence of regular concepts; and identifying one or more pairs of regular concepts to approximate corresponding pairs of abstract concepts based on analogies of relationships between the abstract concepts and the regular concepts. In addition, the system may display the concept hierarchy to visually present inter-relations between concepts from the first concept set, such as by visually presenting a hierarchical structure conveying concept grouping of concepts from the first concept set to enable user navigation over the first concept set. In other embodiments, the system may iteratively select a concept from the first concept set; identify an associated neighborhood for each selected concept in the first concept set using iterative clustering and probability flow-based traversals to identify, for each concept in the first concept set, an associated neighborhood and corresponding strength measure; and create a hierarchy of associated neighborhoods, each of which comprises a representative concept to enable a human user to easily identify the neighborhood).
Ho is considered to be analogous to the claimed invention because it is in the same field of using natural language processing for text analytics (Ho [0100]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus further in view of Ho to allow for using semi-structured content having hierarchical nodes. Doing so would allow better presentation and visualization of concepts and their inter-relations (e.g., hierarchical presentation) (Ho [0051]).
Larlus in view of Ho does not teach
generating node embeddings and level embeddings from the semi-structured content of the first text corpus and from the masked hierarchical nodes; 
including the node embeddings and the level embeddings in a bi-directional transformer model
Frieder teaches
node embeddings (Frieder [0082] In embodiments, a Cross-Global Attention Graph Kernel Network 1100 may learn an end-to-end deep graph kernel on a batch of graphs, as shown in FIG. 11. At 1102, a plurality of patient graphs may be received/input. The node level embedding 1104 and node clusters 1106 may be determined first by, at 1108 determining shared weight Graph Convolutional Networks (GCNs) to form node level embedding 1104, and at 1110, determining learning node clusters with reconstruction loss to form node clusters 1106. The graph level embedding may be derived 1112 from node matching 1114 based attention pooling 1115. The loss 1120 may be calculated 1116 by the resulting distance and kernel matrix 1118 and backpropagation 1122 may be performed to update all model parameters)
and level embeddings (Frieder [0082] In embodiments, a Cross-Global Attention Graph Kernel Network 1100 may learn an end-to-end deep graph kernel on a batch of graphs, as shown in FIG. 11. At 1102, a plurality of patient graphs may be received/input. The node level embedding 1104 and node clusters 1106 may be determined first by, at 1108 determining shared weight Graph Convolutional Networks (GCNs) to form node level embedding 1104, and at 1110, determining learning node clusters with reconstruction loss to form node clusters 1106. The graph level embedding may be derived 1112 from node matching 1114 based attention pooling 1115. The loss 1120 may be calculated 1116 by the resulting distance and kernel matrix 1118 and backpropagation 1122 may be performed to update all model parameters).
Frieder is considered to be analogous to the claimed invention because it is in the same field of using a deep neural network which uses embeddings (Frieder [0082]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus in view of Ho further in view of Frieder to allow for using node embedding and level embeddings. Doing so would allow for an opportunity to model the EHRs in a compact structure with high interpretability (Frieder [0010]).

Regarding claim 11, Larlus in view of Ho in view of Frieder teaches the computer system of claim 10.
Larlus teaches
wherein the training produces a trained model (Larlus [0039] In this pre-training task, (i) a sequence of words are tokenized, (ii) a random subset of the tokens are selected to be either masked, replaced by other tokens, or kept intact, (iii) all of the tokens are given as input to a language model (a bidirectional transformer encoder model), and (iv) the language model is trained to correctly predict the ground-truth labels of the selected tokens (before the tokens are masked or replaced, in case they are changed). FIG. 4 illustrated the architecture to carry out this pre-training task).
and wherein the method further comprises: 
inputting one or more terms into the trained model so that the trained model predicts a node type of the terms (Larlus [0035] More specifically, given an image-caption pair, a word in the caption is masked, and the image conditioned masked language modeling tries to predict the label of the masked word by using the representation of the image).
Larlus teaches node type, however Larlus does not teach
wherein the node type is from the semi-structured content.
Ho teaches
wherein the node type is from the semi-structured content (Ho [0020] In the QA system 100, the knowledge manager 104 may be configured to receive inputs from various sources. For example, knowledge manager 104 may receive input from the network 102, one or more knowledge bases or corpora of electronic documents 106 which stores electronic documents 107, semantic data 108, or other possible sources of data input. In selected embodiments, the knowledge database 106 may include structured, semi-structured, and/or unstructured content in a plurality of documents that are contained in one or more large knowledge databases or corpora. The various computing devices (e.g., 110, 120, 130) on the network 102 may include access points for content creators and content users. Some of the computing devices may include devices for a database storing the corpus of data as the body of information used by the knowledge manager 104 to generate answers to questions. The network 102 may include local network connections and remote connections in various embodiments, such that knowledge manager 104 may operate in environments of any size, including local and global, e.g., the Internet. Additionally, knowledge manager 104 serves as a front-end system that can make available a variety of knowledge extracted from or represented in documents, network-accessible sources and/or structured data sources. In this manner, some processes populate the knowledge manager, with the knowledge manager also including input interfaces to receive knowledge requests and respond accordingly).
Ho is considered to be analogous to the claimed invention because it is in the same field of using natural language processing for text analytics (Ho [0100]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus further in view of Ho to allow for using semi-structured content having hierarchical nodes. Doing so would allow better presentation and visualization of concepts and their inter-relations (e.g., hierarchical presentation) (Ho [0051]).


Regarding claim 13, Larlus in view of Ho in view of Frieder teaches the computer system of claim 10.
Larlus teaches
wherein the bi-directional transformer model generates and uses bi-directional embeddings from the first text corpus (Larlus [0055] The above process is carried out to train the F-CNN (660 of FIG. 5) by providing the F-CNN (660 of FIG. 5) with reliable side information extracted from textual data. To carry out the training, one can use a pre-trained bidirectional transformer encoder model such as BERT as language model. Other language models could have been used. To benefit from the language prior learned by BERT while training the F-CNN: (i) the parameters of BERT (θ.sub.LM) are frozen, (ii) the pooled visual embedding vector are mapped to the token vocabulary space using a context filter (630 of FIG. 5) and the token embeddings that are parts of the pre-trained BERT model), 
and wherein the bi-directional embeddings are selected from the group consisting of 
token embeddings (Larlus [0055] The above process is carried out to train the F-CNN (660 of FIG. 5) by providing the F-CNN (660 of FIG. 5) with reliable side information extracted from textual data. To carry out the training, one can use a pre-trained bidirectional transformer encoder model such as BERT as language model. Other language models could have been used. To benefit from the language prior learned by BERT while training the F-CNN: (i) the parameters of BERT (θ.sub.LM) are frozen, (ii) the pooled visual embedding vector are mapped to the token vocabulary space using a context filter (630 of FIG. 5) and the token embeddings that are parts of the pre-trained BERT model),
segment embeddings, 
and positional embeddings (Larlus [0088] ρ.sub.θK and ρ.sub.θV blocks are built by using two Conv2D-BatchNorm2D-ReLU layers and a linear Conv2D layer afterwards. Each Conv2D layer has 3×3 kernels and 512 channels, except the last linear Conv2D where it has 768 channels which is the dimension of the token representations in BERT model. Besides, in order for ρ.sub.θK and ρ.sub.θV to understand the spatial configuration of the visual feature vectors, one-hat positional embeddings are concatenated to the visual feature vectors Φ.sub.θCNN(I.sub.i.Math.) before feeding them into ρ.sub.θK and ρ.sub.θV blocks. All trainable parameters in the model are tuned by performing 100k SGD updates with batches of size 256, using ADAM optimizer with learning rates 5×10-.sup.5 and 5×10-.sup.4 for the parameters in Φ.sub.θCNN and [ρ.sub.θK ρ.sub.θV] networks, respectively. Linear learning rate decay is applied during training).


Regarding claim 15, Larlus in view of Ho in view of Frieder teaches the computer system of claim 10.
Larlus does not teach
wherein the semi-structured content comprises at least one content type selected from the group consisting of 
hypertext markup language, 
extensible markup language, 
JavaScript Object Notation, 
and Markdown markup language.
Ho teaches
semi-structured content (Ho [0020] In the QA system 100, the knowledge manager 104 may be configured to receive inputs from various sources. For example, knowledge manager 104 may receive input from the network 102, one or more knowledge bases or corpora of electronic documents 106 which stores electronic documents 107, semantic data 108, or other possible sources of data input. In selected embodiments, the knowledge database 106 may include structured, semi-structured, and/or unstructured content in a plurality of documents that are contained in one or more large knowledge databases or corpora. The various computing devices (e.g., 110, 120, 130) on the network 102 may include access points for content creators and content users. Some of the computing devices may include devices for a database storing the corpus of data as the body of information used by the knowledge manager 104 to generate answers to questions. The network 102 may include local network connections and remote connections in various embodiments, such that knowledge manager 104 may operate in environments of any size, including local and global, e.g., the Internet. Additionally, knowledge manager 104 serves as a front-end system that can make available a variety of knowledge extracted from or represented in documents, network-accessible sources and/or structured data sources. In this manner, some processes populate the knowledge manager, with the knowledge manager also including input interfaces to receive knowledge requests and respond accordingly)
wherein the semi-structured content comprises at least one content type selected from the group consisting of hypertext markup language, extensible markup language, JavaScript Object Notation, and Markdown markup language (Ho [0060] To provide another illustrative example application for processing concept vectors 13A to compute concept hierarchies, a vector processing application 14 may be configured to build or discover a hierarchy of concept neighborhoods using iterative clustering and probability flow-based traversals. For example, after a user explores one or more concept graphs 18 having nodes which represent concepts (e.g., Wikipedia concepts), the user may request the user's browser to produce a hierarchy of concept neighborhoods, where each neighborhood can be understood by the user as a topic or common theme between the concepts that are measured to be connected to the neighborhood. In response to the neighborhood hierarchy request, the vector processing application 14 may process the extracted concept vectors 13A to identify, for each concept (node in the graph), the neighborhood of the graph that is associated with the concept along with a computed strength or similarity metric. In addition, the vector processing application 14 may process the concepts belonging to each neighborhood to identify a single concept that can represent the neighborhood. The construction of neighborhood hierarchies can be done by iteratively selecting nodes from a starting concept graph G for processing to identify the most similar non-selected node in concept graph G for combination with each selected node, for removal from concept graph G, and for transfer to a graph N_i, and edges from the two combined nodes are transferred to N_i to point to appropriate representative nodes of N_i, until all nodes are removed from concept graph G, at which point the graph N_i is stored, the concept graph G is updated with the graph N_i, and the process is repeated for i=i+1 until the number of nodes in the graph N_i is less than a specified number of neighborhoods. For any graph N_i resultant from the iterations node similarity may be computed using a standard node similarity metric, such as SimRank and Jaccard similarity coefficients, on the nodes created by the joining of the two nodes in graph N_(i−1). The vector processing application 14 may also create a hierarchy of neighborhoods for display, allowing the user to efficiently identify how a collection of concepts relate to each other, and whether the collection can be partitioned into specific themes that are low in the hierarchy. To provide an illustrative example of usage for the proposed hierarchical clustering and mapping, the corpus may include a first document (A) containing concepts regarding “web design” and “web accessibility.” Given a second document (B) which contains concepts regarding “javascript” and “HTML,” a neighborhood hierarchy discovery algorithm can efficiently classify that concepts in the documents A and B are under a “web development” theme. Now, given a third document (C) which refers to the concept of “visual impairment,” the neighborhood hierarchy discovery algorithm would be configured to efficiently identify that the theme in common between A and C is “web accessibility.” Identifying whether to classify A, B and C into a single, higher-level theme, or whether to classify A, B, and C into multiple categories can be done by traversing the N_i graphs generated from the multiple iterations of the algorithm, and determining in which N_i graphs do the A, B and C nodes appear in the same supernode (neighborhood). Due to the logarithmic decrease in nodes after each iteration of the algorithm, graphs generated through fewer iterations starting from G are composed of more nodes (more neighborhoods) than graphs generated from more iterations. If nodes A, B and C are found to be in the same neighborhood in a low iteration N_i, they are considered to be highly connected, whereas if A, B, and C are only found in the same neighborhood in a high iteration N_i, they are considered to be lowly connected).
Ho is considered to be analogous to the claimed invention because it is in the same field of using natural language processing for text analytics (Ho [0100]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus further in view of Ho to allow for using semi-structured content having hierarchical nodes and HTML or JavaScript. Doing so would allow better presentation and visualization of concepts and their inter-relations (e.g., hierarchical presentation) (Ho [0051]).

Regarding claim 16, Larlus teaches a computer program product for natural language processing (Larlus [0027] In other domain-specific proxy tasks utilizing self-supervised learning, a “pretext” task is solved to learn an implicit prior knowledge about the structure in the input space. The prior knowledge can be utilized in the target tasks, as discussed above. For computer vision applications, colorizing a gray-scale image, predicting image rotations, or clustering image embeddings provide useful priors to downstream vision problems. Similarly, solving next sentence prediction and masked language modeling tasks enables a language model to perform substantially better on a diverse set of natural language processing target tasks), 
the computer program product comprising a computer-readable storage medium having program instructions embodied therewith (Larlus [0090] To compute the performance on this task, a publicly shared repository is used. All the approaches that are compared use an AlexNet-like architecture like the image conditioned masked language modeling), 
wherein the program instructions are executable by a processor to cause the processor to perform a method comprising: 
receiving a first text corpus (Larlus [0038] Masked language modeling is a self-supervised proxy task to pre-train a language model over large-scale text corpora. This type of pre-training scheme enables the language model to learn efficient language priors so that simply fine-tuning the language model achieves significant improvements over the state-of-the-art on a wide range of natural language processing target tasks)
masking some of the [nodes] (Larlus [0039] In this pre-training task, (i) a sequence of words are tokenized, (ii) a random subset of the tokens are selected to be either masked, replaced by other tokens, or kept intact, (iii) all of the tokens are given as input to a language model (a bidirectional transformer encoder model), and (iv) the language model is trained to correctly predict the ground-truth labels of the selected tokens (before the tokens are masked or replaced, in case they are changed). FIG. 4 illustrated the architecture to carry out this pre-training task);
a bi-directional transformer model (Larlus [0039] In this pre-training task, (i) a sequence of words are tokenized, (ii) a random subset of the tokens are selected to be either masked, replaced by other tokens, or kept intact, (iii) all of the tokens are given as input to a language model (a bidirectional transformer encoder model), and (iv) the language model is trained to correctly predict the ground-truth labels of the selected tokens (before the tokens are masked or replaced, in case they are changed). FIG. 4 illustrated the architecture to carry out this pre-training task);
and training the bi-directional transformer model on the first text corpus (Larlus [0039] In this pre-training task, (i) a sequence of words are tokenized, (ii) a random subset of the tokens are selected to be either masked, replaced by other tokens, or kept intact, (iii) all of the tokens are given as input to a language model (a bidirectional transformer encoder model), and (iv) the language model is trained to correctly predict the ground-truth labels of the selected tokens (before the tokens are masked or replaced, in case they are changed). FIG. 4 illustrated the architecture to carry out this pre-training task)
by reducing loss from the bi-directional transformer model (Larlus [0094] To evaluate the representations learned by both models, the convolutional layers of an AlexNet are taken and a generalized-mean pooling, L2 normalization, and fully-connected layers are appended. The parameters of the fully-connected layer are trained for 300 epochs by minimizing the AP Loss over the clean version of the Landmarks dataset. The complete model is tested on the revisited Oxford Buildings and Paris datasets by computing mean-average-precision scores. The image representations that are produced by solving the image conditioned masked language modeling task outperforms the counterparts obtained by the RotNet model on this task)
predicting the masked [nodes] (Larlus [0035] More specifically, given an image-caption pair, a word in the caption is masked, and the image conditioned masked language modeling tries to predict the label of the masked word by using the representation of the image).
Larlus does not teach
that comprises semi-structured content comprising hierarchical nodes; 
masking some of the hierarchical nodes 
generating node embeddings and level embeddings from the semi-structured content of the first text corpus and from the masked hierarchical nodes; 
including the node embeddings and the level embeddings in a bi-directional transformer model
predicting the masked hierarchical nodes. 
Ho teaches
that comprises semi-structured content (Ho [0020] In the QA system 100, the knowledge manager 104 may be configured to receive inputs from various sources. For example, knowledge manager 104 may receive input from the network 102, one or more knowledge bases or corpora of electronic documents 106 which stores electronic documents 107, semantic data 108, or other possible sources of data input. In selected embodiments, the knowledge database 106 may include structured, semi-structured, and/or unstructured content in a plurality of documents that are contained in one or more large knowledge databases or corpora. The various computing devices (e.g., 110, 120, 130) on the network 102 may include access points for content creators and content users. Some of the computing devices may include devices for a database storing the corpus of data as the body of information used by the knowledge manager 104 to generate answers to questions. The network 102 may include local network connections and remote connections in various embodiments, such that knowledge manager 104 may operate in environments of any size, including local and global, e.g., the Internet. Additionally, knowledge manager 104 serves as a front-end system that can make available a variety of knowledge extracted from or represented in documents, network-accessible sources and/or structured data sources. In this manner, some processes populate the knowledge manager, with the knowledge manager also including input interfaces to receive knowledge requests and respond accordingly)
comprising hierarchical nodes (Ho [0100] By now, it will be appreciated that there is disclosed herein a system, method, apparatus, and computer program product for generating concept hierarchies with an information handling system having a processor and a memory. As disclosed, the system, method, apparatus, and computer program product generate at least a first concept set comprising one or more concepts extracted from one or more content sources. At the system, a user request is received to produce a hierarchy of concepts from the first concept set using one or more specified hierarchy parameters, which may be default parameters or parameters specified in the user request. A vector representation of each of the concepts in the first concept set is generated, retrieved, constructed, or otherwise obtained. The vectors are processed by performing a natural language processing (NLP) analysis comparison of the vector representation of each of the concepts in the first concept set to determine a similarity measure for each pair of distinct concepts Ci and Cj in the first concept set. The similarity measure may be defined on a selected subset of dimensions of the concept vectors with uniform or non-uniform weights, where the selected dimensions and their weights can be modified in each iterative step of hierarchy construction. In selected embodiments, the NLP analysis includes analyzing a vector similarity function sim(Vi, Vj) between vectors Vi, Vj representing each pair of distinct concepts Ci and Cj in the first concept set. In selected embodiments, analysis of the vector similarity function sim(Vi, Vj) includes computing, for each concept Ci for i=1 . . . N, the similarity measure corresponding to said concept Ci as a cosine distance measure between each vector pair Vi, Vj for j=1 . . . N, i≠j, and then selecting a distinct, unconnected concept Cj having a maximum cosine distance measure with the concept Ci. A concept hierarchy is constructed based on the one or more specified hierarchy parameters and the similarity measure for each pair of distinct concepts Ci and Cj in the first concept set. In selected embodiments, the concept hierarchy is constructed using a bottom-up method to iteratively build a concept graph by selecting distinct, unconnected concepts Ci, Cj from the first concept set based on a maximal similarity measure and identifying a first concept as a hierarchy root which has a maximal number of occurrences in the first concept set. In other embodiments, the concept hierarchy is constructed using a top-down/frequency method to sort the one or more concepts in the first concept set into a sorted concept sequence based on frequency of occurrence, select a root node C1 that has maximum frequency of occurrence, and sequentially add each concept from the sorted concept sequence to the root node C1 in the concept hierarchy based on a maximal similarity measure between a selected concept from the sorted concept sequence and the root node C1 in the concept hierarchy, or to another existing node Ci in the concept hierarchy based on a maximal similarity measure between a selected concept in the sorted concept sequence and that other existing node Ci in the concept hierarchy. In other embodiments, the concept hierarchy is constructed by generating a first sequence over a set of abstract concepts C1, . . . , Ck by simulating a random walk on a first hierarchical structure defined by a first branching factor and specified depth; generating a second sequence over a set of regular concepts D1, . . . , Dk, where the sequence extracted from a corpus; generating or retrieving a vector representation for each of the concepts in the first sequence of abstract concepts and the second sequence of regular concepts; and identifying one or more pairs of regular concepts to approximate corresponding pairs of abstract concepts based on analogies of relationships between the abstract concepts and the regular concepts. In addition, the system may display the concept hierarchy to visually present inter-relations between concepts from the first concept set, such as by visually presenting a hierarchical structure conveying concept grouping of concepts from the first concept set to enable user navigation over the first concept set. In other embodiments, the system may iteratively select a concept from the first concept set; identify an associated neighborhood for each selected concept in the first concept set using iterative clustering and probability flow-based traversals to identify, for each concept in the first concept set, an associated neighborhood and corresponding strength measure; and create a hierarchy of associated neighborhoods, each of which comprises a representative concept to enable a human user to easily identify the neighborhood).
Ho is considered to be analogous to the claimed invention because it is in the same field of using natural language processing for text analytics (Ho [0100]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus further in view of Ho to allow for using semi-structured content having hierarchical nodes. Doing so would allow better presentation and visualization of concepts and their inter-relations (e.g., hierarchical presentation) (Ho [0051]).
Larlus in view of Ho does not teach
generating node embeddings and level embeddings from the semi-structured content of the first text corpus and from the masked hierarchical nodes; 
including the node embeddings and the level embeddings in a bi-directional transformer model
Frieder teaches
node embeddings (Frieder [0082] In embodiments, a Cross-Global Attention Graph Kernel Network 1100 may learn an end-to-end deep graph kernel on a batch of graphs, as shown in FIG. 11. At 1102, a plurality of patient graphs may be received/input. The node level embedding 1104 and node clusters 1106 may be determined first by, at 1108 determining shared weight Graph Convolutional Networks (GCNs) to form node level embedding 1104, and at 1110, determining learning node clusters with reconstruction loss to form node clusters 1106. The graph level embedding may be derived 1112 from node matching 1114 based attention pooling 1115. The loss 1120 may be calculated 1116 by the resulting distance and kernel matrix 1118 and backpropagation 1122 may be performed to update all model parameters)
and level embeddings (Frieder [0082] In embodiments, a Cross-Global Attention Graph Kernel Network 1100 may learn an end-to-end deep graph kernel on a batch of graphs, as shown in FIG. 11. At 1102, a plurality of patient graphs may be received/input. The node level embedding 1104 and node clusters 1106 may be determined first by, at 1108 determining shared weight Graph Convolutional Networks (GCNs) to form node level embedding 1104, and at 1110, determining learning node clusters with reconstruction loss to form node clusters 1106. The graph level embedding may be derived 1112 from node matching 1114 based attention pooling 1115. The loss 1120 may be calculated 1116 by the resulting distance and kernel matrix 1118 and backpropagation 1122 may be performed to update all model parameters).
Frieder is considered to be analogous to the claimed invention because it is in the same field of using a deep neural network which uses embeddings (Frieder [0082]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus in view of Ho further in view of Frieder to allow for using node embedding and level embeddings. Doing so would allow for an opportunity to model the EHRs in a compact structure with high interpretability (Frieder [0010]).

Regarding claim 17, Larlus in view of Ho in view of Frieder teaches the computer program product of claim 16.
Larlus teaches
wherein the training produces a trained model (Larlus [0039] In this pre-training task, (i) a sequence of words are tokenized, (ii) a random subset of the tokens are selected to be either masked, replaced by other tokens, or kept intact, (iii) all of the tokens are given as input to a language model (a bidirectional transformer encoder model), and (iv) the language model is trained to correctly predict the ground-truth labels of the selected tokens (before the tokens are masked or replaced, in case they are changed). FIG. 4 illustrated the architecture to carry out this pre-training task).
and wherein the method further comprises: 
inputting one or more terms into the trained model so that the trained model predicts a node type of the terms (Larlus [0035] More specifically, given an image-caption pair, a word in the caption is masked, and the image conditioned masked language modeling tries to predict the label of the masked word by using the representation of the image).
Larlus teaches node type, however Larlus does not teach
wherein the node type is from the semi-structured content.
Ho teaches
wherein the node type is from the semi-structured content (Ho [0020] In the QA system 100, the knowledge manager 104 may be configured to receive inputs from various sources. For example, knowledge manager 104 may receive input from the network 102, one or more knowledge bases or corpora of electronic documents 106 which stores electronic documents 107, semantic data 108, or other possible sources of data input. In selected embodiments, the knowledge database 106 may include structured, semi-structured, and/or unstructured content in a plurality of documents that are contained in one or more large knowledge databases or corpora. The various computing devices (e.g., 110, 120, 130) on the network 102 may include access points for content creators and content users. Some of the computing devices may include devices for a database storing the corpus of data as the body of information used by the knowledge manager 104 to generate answers to questions. The network 102 may include local network connections and remote connections in various embodiments, such that knowledge manager 104 may operate in environments of any size, including local and global, e.g., the Internet. Additionally, knowledge manager 104 serves as a front-end system that can make available a variety of knowledge extracted from or represented in documents, network-accessible sources and/or structured data sources. In this manner, some processes populate the knowledge manager, with the knowledge manager also including input interfaces to receive knowledge requests and respond accordingly).
Ho is considered to be analogous to the claimed invention because it is in the same field of using natural language processing for text analytics (Ho [0100]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus further in view of Ho to allow for using semi-structured content having hierarchical nodes. Doing so would allow better presentation and visualization of concepts and their inter-relations (e.g., hierarchical presentation) (Ho [0051]).



Regarding claim 19, Larlus in view of Ho in view of Frieder teaches the computer program product of claim 16.
Larlus teaches
wherein the bi-directional transformer model generates and uses bi-directional embeddings from the first text corpus (Larlus [0055] The above process is carried out to train the F-CNN (660 of FIG. 5) by providing the F-CNN (660 of FIG. 5) with reliable side information extracted from textual data. To carry out the training, one can use a pre-trained bidirectional transformer encoder model such as BERT as language model. Other language models could have been used. To benefit from the language prior learned by BERT while training the F-CNN: (i) the parameters of BERT (θ.sub.LM) are frozen, (ii) the pooled visual embedding vector are mapped to the token vocabulary space using a context filter (630 of FIG. 5) and the token embeddings that are parts of the pre-trained BERT model), 
and wherein the bi-directional embeddings are selected from the group consisting of 
token embeddings (Larlus [0055] The above process is carried out to train the F-CNN (660 of FIG. 5) by providing the F-CNN (660 of FIG. 5) with reliable side information extracted from textual data. To carry out the training, one can use a pre-trained bidirectional transformer encoder model such as BERT as language model. Other language models could have been used. To benefit from the language prior learned by BERT while training the F-CNN: (i) the parameters of BERT (θ.sub.LM) are frozen, (ii) the pooled visual embedding vector are mapped to the token vocabulary space using a context filter (630 of FIG. 5) and the token embeddings that are parts of the pre-trained BERT model),
segment embeddings, 
and positional embeddings (Larlus [0088] ρ.sub.θK and ρ.sub.θV blocks are built by using two Conv2D-BatchNorm2D-ReLU layers and a linear Conv2D layer afterwards. Each Conv2D layer has 3×3 kernels and 512 channels, except the last linear Conv2D where it has 768 channels which is the dimension of the token representations in BERT model. Besides, in order for ρ.sub.θK and ρ.sub.θV to understand the spatial configuration of the visual feature vectors, one-hat positional embeddings are concatenated to the visual feature vectors Φ.sub.θCNN(I.sub.i.Math.) before feeding them into ρ.sub.θK and ρ.sub.θV blocks. All trainable parameters in the model are tuned by performing 100k SGD updates with batches of size 256, using ADAM optimizer with learning rates 5×10-.sup.5 and 5×10-.sup.4 for the parameters in Φ.sub.θCNN and [ρ.sub.θK ρ.sub.θV] networks, respectively. Linear learning rate decay is applied during training).

Claims 3, 12 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Larlus in view of Ho in view of Frieder in view of Higgins (US Patent Pub. No. 2009/0190839).

Regarding claim 3, Larlus in view of Ho in view of Frieder teaches the method of claim 1.
Larlus teaches
wherein the training produces a trained model (Larlus [0039] In this pre-training task, (i) a sequence of words are tokenized, (ii) a random subset of the tokens are selected to be either masked, replaced by other tokens, or kept intact, (iii) all of the tokens are given as input to a language model (a bidirectional transformer encoder model), and (iv) the language model is trained to correctly predict the ground-truth labels of the selected tokens (before the tokens are masked or replaced, in case they are changed). FIG. 4 illustrated the architecture to carry out this pre-training task).
Larlus in view of Ho in view of Frieder does not teach
and wherein the method further comprises: 
inputting a second text corpus and a third text corpus into the trained model, 
wherein the second text corpus and the third text corpus each comprises semi-structured content, respectively; 
and receiving as output from the trained model a similarity score indicating a similarity between the second text corpus and the third text corpus.
Higgins teaches
obtaining a similarity score between two documents (Higgins [0023] Disclosed herein is a computer-implemented method, system, and computer program product for generating vector-based similarity scores in text document comparisons considering the confounding effect of document length. Embodiments of the present invention input two text documents to be compared, and compute the geometric mean of the number of word types of the two text documents. A similarity score is then computed, preferably with a Random Indexing model though a Content-Vector Analysis model or Latent Semantic Analysis model or other vector-based similarity model may also be used. The invention then performs a unique pivoted document length normalization on the similarity score, with normalization terms affected by both text documents, and a normalization slope parameter selected to minimize the correlation between document length and a resulting normalized similarity score. The invention may perform pivoting of a joint normalization term, as well as separate pivoting of the normalization component of each of the two text documents. The normalized similarity score is then output to a user. The geometric mean may be replaced by the arithmetic mean or the harmonic mean).
Higgins is considered to be analogous to the claimed invention because it is in the same field of using natural language processing for text analytics (Higgins [0033]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus in view of Ho in view of Frieder further in view of Higgins to allow for using document similarity scores. Doing so would allow the pivoted version of the Random Indexing similarity scores to provide a conceptually purer representation of the degree to which two documents contain terms similar in meaning, which is more useful in the text categorization tasks, and presumably in other NLP tasks as well (Higgins [0048]).

Regarding claim 12, Larlus in view of Ho in view of Frieder teaches the computer system of claim 10.
Larlus teaches
wherein the training produces a trained model (Larlus [0039] In this pre-training task, (i) a sequence of words are tokenized, (ii) a random subset of the tokens are selected to be either masked, replaced by other tokens, or kept intact, (iii) all of the tokens are given as input to a language model (a bidirectional transformer encoder model), and (iv) the language model is trained to correctly predict the ground-truth labels of the selected tokens (before the tokens are masked or replaced, in case they are changed). FIG. 4 illustrated the architecture to carry out this pre-training task).
Larlus in view of Ho in view of Frieder does not teach
and wherein the method further comprises: 
inputting a second text corpus and a third text corpus into the trained model, 
wherein the second text corpus and the third text corpus each comprises semi-structured content, respectively; 
and receiving as output from the trained model a similarity score indicating a similarity between the second text corpus and the third text corpus.
Higgins teaches
obtaining a similarity score between two documents (Higgins [0023] Disclosed herein is a computer-implemented method, system, and computer program product for generating vector-based similarity scores in text document comparisons considering the confounding effect of document length. Embodiments of the present invention input two text documents to be compared, and compute the geometric mean of the number of word types of the two text documents. A similarity score is then computed, preferably with a Random Indexing model though a Content-Vector Analysis model or Latent Semantic Analysis model or other vector-based similarity model may also be used. The invention then performs a unique pivoted document length normalization on the similarity score, with normalization terms affected by both text documents, and a normalization slope parameter selected to minimize the correlation between document length and a resulting normalized similarity score. The invention may perform pivoting of a joint normalization term, as well as separate pivoting of the normalization component of each of the two text documents. The normalized similarity score is then output to a user. The geometric mean may be replaced by the arithmetic mean or the harmonic mean).
Higgins is considered to be analogous to the claimed invention because it is in the same field of using natural language processing for text analytics (Higgins [0033]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus in view of Ho in view of Frieder further in view of Higgins to allow for using document similarity scores. Doing so would allow the pivoted version of the Random Indexing similarity scores to provide a conceptually purer representation of the degree to which two documents contain terms similar in meaning, which is more useful in the text categorization tasks, and presumably in other NLP tasks as well (Higgins [0048]).

Regarding claim 18, Larlus in view of Ho in view of Frieder teaches the computer program product of claim 16.
Larlus teaches
wherein the training produces a trained model (Larlus [0039] In this pre-training task, (i) a sequence of words are tokenized, (ii) a random subset of the tokens are selected to be either masked, replaced by other tokens, or kept intact, (iii) all of the tokens are given as input to a language model (a bidirectional transformer encoder model), and (iv) the language model is trained to correctly predict the ground-truth labels of the selected tokens (before the tokens are masked or replaced, in case they are changed). FIG. 4 illustrated the architecture to carry out this pre-training task).
Larlus in view of Ho in view of Frieder does not teach
and wherein the method further comprises: 
inputting a second text corpus and a third text corpus into the trained model, 
wherein the second text corpus and the third text corpus each comprises semi-structured content, respectively; 
and receiving as output from the trained model a similarity score indicating a similarity between the second text corpus and the third text corpus.
Higgins teaches
obtaining a similarity score between two documents (Higgins [0023] Disclosed herein is a computer-implemented method, system, and computer program product for generating vector-based similarity scores in text document comparisons considering the confounding effect of document length. Embodiments of the present invention input two text documents to be compared, and compute the geometric mean of the number of word types of the two text documents. A similarity score is then computed, preferably with a Random Indexing model though a Content-Vector Analysis model or Latent Semantic Analysis model or other vector-based similarity model may also be used. The invention then performs a unique pivoted document length normalization on the similarity score, with normalization terms affected by both text documents, and a normalization slope parameter selected to minimize the correlation between document length and a resulting normalized similarity score. The invention may perform pivoting of a joint normalization term, as well as separate pivoting of the normalization component of each of the two text documents. The normalized similarity score is then output to a user. The geometric mean may be replaced by the arithmetic mean or the harmonic mean).
Higgins is considered to be analogous to the claimed invention because it is in the same field of using natural language processing for text analytics (Higgins [0033]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus in view of Ho in view of Frieder further in view of Higgins to allow for using document similarity scores. Doing so would allow the pivoted version of the Random Indexing similarity scores to provide a conceptually purer representation of the degree to which two documents contain terms similar in meaning, which is more useful in the text categorization tasks, and presumably in other NLP tasks as well (Higgins [0048]).

Claims 5, 14 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Larlus in view of Ho in view of Frieder in view of Deng et al. (Deng, Xingchen, Lei Zhang, Yixing Fan, Long Bai, Jiafeng Guo, and Pengfei Wang. "Bidirectional Dependency-Guided Attention for Relation Extraction." In Asian Conference on Machine Learning, pp. 129-144. PMLR, 2020.), hereinafter Deng.

Regarding claim 5, Larlus in view of Ho in view of Frieder teaches the method of claim 1.
Larlus teaches masking, however Larlus does not teach
wherein the masking comprises masking a first node and a header node associated with the first node.
Deng teaches
masking a first node (Deng [p 136 ln 6-7] For each node on the dependency tree, we want it to interact with all its descendant nodes. We implement it by using a bottom-up mask matrix Maskbottom; [p 136 ln 17-19] The output of top-down attention htop l can be obtained in the same way with top-down mask matrix and different transformation matrices and bias terms for calculating Q,K,V and htop l . And the h l ∈ R n×dl is the sum of hbottom l and htop l)
masking a header node associated with the first node (Deng [p 138 ln 16-18] Follow previous work, a ”entity mask” strategy is used to replace subject (or object) entity with ”<NER Type>-SUBJ” ( or ” <NER Type>- OBJ ”) and report micro-averaged F1 score on this dataset).
Deng is considered to be analogous to the claimed invention because it is in the same field of using natural language processing for text analytics (Deng [p 132 ln 33-43]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus in view of Ho in view of Frieder further in view of Deng to allow for masking hierarchical nodes. Doing so achieves state-of-the-art result on TACRED dataset and a significant result on SemEval2010-Taks8 dataset, showing superiority to previous dependency-based models (Deng [p 141 ln 42-43]).


Regarding claim 14, Larlus in view of Ho in view of Frieder teaches the computer system of claim 10.
Larlus teaches masking, however Larlus does not teach
wherein the masking comprises masking a first node and a header node associated with the first node.
Deng teaches
masking a first node (Deng [p 136 ln 6-7] For each node on the dependency tree, we want it to interact with all its descendant nodes. We implement it by using a bottom-up mask matrix Maskbottom; [p 136 ln 17-19] The output of top-down attention htop l can be obtained in the same way with top-down mask matrix and different transformation matrices and bias terms for calculating Q,K,V and htop l . And the h l ∈ R n×dl is the sum of hbottom l and htop l)
masking a header node associated with the first node (Deng [p 138 ln 16-18] Follow previous work, a ”entity mask” strategy is used to replace subject (or object) entity with ”<NER Type>-SUBJ” ( or ” <NER Type>- OBJ ”) and report micro-averaged F1 score on this dataset).
Deng is considered to be analogous to the claimed invention because it is in the same field of using natural language processing for text analytics (Deng [p 132 ln 33-43]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus in view of Ho in view of Frieder further in view of Deng to allow for masking hierarchical nodes. Doing so achieves state-of-the-art result on TACRED dataset and a significant result on SemEval2010-Taks8 dataset, showing superiority to previous dependency-based models (Deng [p 141 ln 42-43]).

Regarding claim 20, Larlus in view of Ho in view of Frieder teaches the computer program product of claim 16.
Larlus teaches masking, however Larlus does not teach
wherein the masking comprises masking a first node and a header node associated with the first node.
Deng teaches
masking a first node (Deng [p 136 ln 6-7] For each node on the dependency tree, we want it to interact with all its descendant nodes. We implement it by using a bottom-up mask matrix Maskbottom; [p 136 ln 17-19] The output of top-down attention htop l can be obtained in the same way with top-down mask matrix and different transformation matrices and bias terms for calculating Q,K,V and htop l . And the h l ∈ R n×dl is the sum of hbottom l and htop l)
masking a header node associated with the first node (Deng [p 138 ln 16-18] Follow previous work, a ”entity mask” strategy is used to replace subject (or object) entity with ”<NER Type>-SUBJ” ( or ” <NER Type>- OBJ ”) and report micro-averaged F1 score on this dataset).
Deng is considered to be analogous to the claimed invention because it is in the same field of using natural language processing for text analytics (Deng [p 132 ln 33-43]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus in view of Ho in view of Frieder further in view of Deng to allow for masking hierarchical nodes. Doing so achieves state-of-the-art result on TACRED dataset and a significant result on SemEval2010-Taks8 dataset, showing superiority to previous dependency-based models (Deng [p 141 ln 42-43]).

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Larlus in view of Ho in view of Frieder in view of Zhu et al. (Zhu, Henghui, Feng Nan, Zhiguo Wang, Ramesh Nallapati, and Bing Xiang. "Who Did They Respond To? Conversation Structure Modeling Using Masked Hierarchical Transformer." In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, pp. 9741-9748. 2020.), hereinafter Zhu.

Regarding claim 9, Larlus in view of Ho in view of Frieder teaches the method of claim 1.
Larlus teaches
further comprising masking text (Larlus [0039] In this pre-training task, (i) a sequence of words are tokenized, (ii) a random subset of the tokens are selected to be either masked, replaced by other tokens, or kept intact, (iii) all of the tokens are given as input to a language model (a bidirectional transformer encoder model), and (iv) the language model is trained to correctly predict the ground-truth labels of the selected tokens (before the tokens are masked or replaced, in case they are changed). FIG. 4 illustrated the architecture to carry out this pre-training task)
and wherein the bi-directional transformer model is trained (Larlus [0039] In this pre-training task, (i) a sequence of words are tokenized, (ii) a random subset of the tokens are selected to be either masked, replaced by other tokens, or kept intact, (iii) all of the tokens are given as input to a language model (a bidirectional transformer encoder model), and (iv) the language model is trained to correctly predict the ground-truth labels of the selected tokens (before the tokens are masked or replaced, in case they are changed). FIG. 4 illustrated the architecture to carry out this pre-training task)
on the first text corpus (Larlus [0038] Masked language modeling is a self-supervised proxy task to pre-train a language model over large-scale text corpora. This type of pre-training scheme enables the language model to learn efficient language priors so that simply fine-tuning the language model achieves significant improvements over the state-of-the-art on a wide range of natural language processing target tasks)
by reducing loss from the bi-directional transformer model (Larlus [0094] To evaluate the representations learned by both models, the convolutional layers of an AlexNet are taken and a generalized-mean pooling, L2 normalization, and fully-connected layers are appended. The parameters of the fully-connected layer are trained for 300 epochs by minimizing the AP Loss over the clean version of the Landmarks dataset. The complete model is tested on the revisited Oxford Buildings and Paris datasets by computing mean-average-precision scores. The image representations that are produced by solving the image conditioned masked language modeling task outperforms the counterparts obtained by the RotNet model on this task)
predicting the masked text (Larlus [0035] More specifically, given an image-caption pair, a word in the caption is masked, and the image conditioned masked language modeling tries to predict the label of the masked word by using the representation of the image).
Larlus does not teach
further comprising masking text within the hierarchical nodes; 
wherein the node embeddings and the level embeddings are also generated from the masked text.
Ho teaches
hierarchical nodes (Ho [0100] By now, it will be appreciated that there is disclosed herein a system, method, apparatus, and computer program product for generating concept hierarchies with an information handling system having a processor and a memory. As disclosed, the system, method, apparatus, and computer program product generate at least a first concept set comprising one or more concepts extracted from one or more content sources. At the system, a user request is received to produce a hierarchy of concepts from the first concept set using one or more specified hierarchy parameters, which may be default parameters or parameters specified in the user request. A vector representation of each of the concepts in the first concept set is generated, retrieved, constructed, or otherwise obtained. The vectors are processed by performing a natural language processing (NLP) analysis comparison of the vector representation of each of the concepts in the first concept set to determine a similarity measure for each pair of distinct concepts Ci and Cj in the first concept set. The similarity measure may be defined on a selected subset of dimensions of the concept vectors with uniform or non-uniform weights, where the selected dimensions and their weights can be modified in each iterative step of hierarchy construction. In selected embodiments, the NLP analysis includes analyzing a vector similarity function sim(Vi, Vj) between vectors Vi, Vj representing each pair of distinct concepts Ci and Cj in the first concept set. In selected embodiments, analysis of the vector similarity function sim(Vi, Vj) includes computing, for each concept Ci for i=1 . . . N, the similarity measure corresponding to said concept Ci as a cosine distance measure between each vector pair Vi, Vj for j=1 . . . N, i≠j, and then selecting a distinct, unconnected concept Cj having a maximum cosine distance measure with the concept Ci. A concept hierarchy is constructed based on the one or more specified hierarchy parameters and the similarity measure for each pair of distinct concepts Ci and Cj in the first concept set. In selected embodiments, the concept hierarchy is constructed using a bottom-up method to iteratively build a concept graph by selecting distinct, unconnected concepts Ci, Cj from the first concept set based on a maximal similarity measure and identifying a first concept as a hierarchy root which has a maximal number of occurrences in the first concept set. In other embodiments, the concept hierarchy is constructed using a top-down/frequency method to sort the one or more concepts in the first concept set into a sorted concept sequence based on frequency of occurrence, select a root node C1 that has maximum frequency of occurrence, and sequentially add each concept from the sorted concept sequence to the root node C1 in the concept hierarchy based on a maximal similarity measure between a selected concept from the sorted concept sequence and the root node C1 in the concept hierarchy, or to another existing node Ci in the concept hierarchy based on a maximal similarity measure between a selected concept in the sorted concept sequence and that other existing node Ci in the concept hierarchy. In other embodiments, the concept hierarchy is constructed by generating a first sequence over a set of abstract concepts C1, . . . , Ck by simulating a random walk on a first hierarchical structure defined by a first branching factor and specified depth; generating a second sequence over a set of regular concepts D1, . . . , Dk, where the sequence extracted from a corpus; generating or retrieving a vector representation for each of the concepts in the first sequence of abstract concepts and the second sequence of regular concepts; and identifying one or more pairs of regular concepts to approximate corresponding pairs of abstract concepts based on analogies of relationships between the abstract concepts and the regular concepts. In addition, the system may display the concept hierarchy to visually present inter-relations between concepts from the first concept set, such as by visually presenting a hierarchical structure conveying concept grouping of concepts from the first concept set to enable user navigation over the first concept set. In other embodiments, the system may iteratively select a concept from the first concept set; identify an associated neighborhood for each selected concept in the first concept set using iterative clustering and probability flow-based traversals to identify, for each concept in the first concept set, an associated neighborhood and corresponding strength measure; and create a hierarchy of associated neighborhoods, each of which comprises a representative concept to enable a human user to easily identify the neighborhood)
Ho is considered to be analogous to the claimed invention because it is in the same field of using natural language processing for text analytics (Ho [0100]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus further in view of Ho to allow for using semi-structured content having hierarchical nodes. Doing so would allow better presentation and visualization of concepts and their inter-relations (e.g., hierarchical presentation) (Ho [0051]).
Larlus in view of Ho does not teach
further comprising masking text within the hierarchical nodes; 
wherein the node embeddings and the level embeddings are also generated from the masked text.
Frieder teaches
node embeddings (Frieder [0082] In embodiments, a Cross-Global Attention Graph Kernel Network 1100 may learn an end-to-end deep graph kernel on a batch of graphs, as shown in FIG. 11. At 1102, a plurality of patient graphs may be received/input. The node level embedding 1104 and node clusters 1106 may be determined first by, at 1108 determining shared weight Graph Convolutional Networks (GCNs) to form node level embedding 1104, and at 1110, determining learning node clusters with reconstruction loss to form node clusters 1106. The graph level embedding may be derived 1112 from node matching 1114 based attention pooling 1115. The loss 1120 may be calculated 1116 by the resulting distance and kernel matrix 1118 and backpropagation 1122 may be performed to update all model parameters)
and level embeddings (Frieder [0082] In embodiments, a Cross-Global Attention Graph Kernel Network 1100 may learn an end-to-end deep graph kernel on a batch of graphs, as shown in FIG. 11. At 1102, a plurality of patient graphs may be received/input. The node level embedding 1104 and node clusters 1106 may be determined first by, at 1108 determining shared weight Graph Convolutional Networks (GCNs) to form node level embedding 1104, and at 1110, determining learning node clusters with reconstruction loss to form node clusters 1106. The graph level embedding may be derived 1112 from node matching 1114 based attention pooling 1115. The loss 1120 may be calculated 1116 by the resulting distance and kernel matrix 1118 and backpropagation 1122 may be performed to update all model parameters).
Frieder is considered to be analogous to the claimed invention because it is in the same field of using a deep neural network which uses embeddings (Frieder [0082]). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Larlus in view of Ho further in view of Frieder to allow for using node embedding and level embeddings. Doing so would allow for an opportunity to model the EHRs in a compact structure with high interpretability (Frieder [0010]).
Larlus in view of Ho in view of Frieder does not teach
further comprising masking text within the hierarchical nodes; 
wherein the node embeddings and the level embeddings are also generated from the masked text.
Zhu teaches
masking text within the hierarchical nodes (Zhu [0123] Figure 2: Diagram of the masked hierarchical transformer for conversation structure modeling. The colored blocks on the right indicates one element in the mask matrix, which means the corresponding utterance is attendable. The white block, on the other hand, indicates a zero element)
node embeddings generated from the masked text (Zhu [p 2 col 2 ln 32-34] In this paper, we use Glove embedding (Pennington, Socher, and Manning 2014) in the decomposable attention model and ESIM).
level embeddings generated from the masked text (Zhu [p 5 col 2 ln 5-8] Adding ELMo embedding improve the model accuracy. We run our masked hierarchical transformer model with 5 initial random seeds and report the average and the standard deviation of the score).
Since Zhu and Larlus are analogous in the art because they are from the same field of endeavor, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to use the known technique of masking text combined with creating various embeddings in order to improve recovering the parent utterance by taking into account the history and structure of the conversation. One of ordinary skill in the art would have recognized that the results of the combination were predictable since the use of that known technique provides the rationale to arrive at a conclusion of obviousness. See KSR International Co. v. Teleflex Inc., 82 USPQ2d 1385 (U.S. 2007).

Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to PAUL J. MUELLER whose telephone number is (571)272-1875. The examiner can normally be reached M-F 8:30am-5:30pm (Eastern).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel C. Washburn can be reached on 571-272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/PAUL J. MUELLER/Examiner, Art Unit 2657                                                                                                                                                                                                        
/Paras D Shah/Primary Examiner, Art Unit 2659                                                                                                                                                                                                        

12/12/2022