ABCs of KM Introduction to knowledge graphs

Introduction to knowledge graphs (section 5.2): Inductive knowledge – Knowledge graph embeddings

Bruce Boyes11 Jul 2023

1,007 9 minutes read

This article is section 5.2 of part 5 of the Introduction to knowledge graphs series of articles. Recent research has identified the development of knowledge graphs as an important aspect of artificial intelligence (AI) in knowledge management (KM).

Machine learning can be used for directly refining a knowledge graph; or for downstream tasks using the knowledge graph, such as recommendation, information extraction, question answering, query relaxation, query approximation, and so on. However, machine learning techniques typically assume numeric representations (e.g., vectors), distinct from how graphs are usually expressed. In this section of their comprehensive tutorial article¹, Hogan and colleagues explore how graphs can be encoded numerically for machine learning.

A first attempt to represent a graph using vectors would be to use a one-hot encoding, generating a vector of length |L| · |V| for each node – with |V| the number of nodes in the input graph and |L| the number of edge labels – placing a one at the corresponding index to indicate the existence of the respective edge in the graph, or zero otherwise. Such a representation will, however, typically result in large and sparse vectors, which will be detrimental for most machine learning models.

The main goal of knowledge graph embedding techniques is to create a dense representation of the graph (i.e., embed the graph) in a continuous, low-dimensional vector space that can then be used for machine learning tasks. The dimensionality d of the embedding is fixed and typically low (often, e.g., 50 ≥ d ≥ 1000). Typically the graph embedding is composed of an entity embedding for each node: a vector with d dimensions that we denote by e; and a relation embedding for each edge label: (typically) a vector with O(d) dimensions that we denote by r. The overall goal of these vectors is to abstract and preserve latent structures in the graph. There are many ways in which this notion of an embedding can be instantiated. Most commonly, given an edge a specific embedding approach defines a scoring function that accepts e_s (the entity embedding of node ), r_p (the relation embedding of edge label p) and e_o (the entity embedding of node ) and computes the plausibility of the edge: how likely it is to be true. Given a data graph, the goal is then to compute the embeddings of dimension d that maximise the plausibility of positive edges (typically edges in the graph) and minimise the plausibility of negative examples (typically edges in the graph with a node or edge label changed such that they are no longer in the graph) according to the given scoring function. The resulting embeddings can then be seen as models learned through self-supervision that encode (latent) features of the graph, mapping input edges to output plausibility scores.

Embeddings can then be used for a number of low-level tasks. The plausibility scoring function can be used to assign confidence to edges (possibly extracted from an external source) or to complete edges with missing nodes/edge labels (a.k.a. link prediction). Additionally, embeddings will typically assign similar vectors to similar terms and can thus be used for similarity measures.

A wide range of knowledge graph embedding techniques have been proposed. The most prominent are:

translational models where relations are seen as translating subject entities to object entities
tensor decomposition models that extract latent factors approximating the graph’s structure
neural models based on neural networks
language models based on word embedding techniques.

Translational Models

Translational models interpret edge labels as transformations from subject nodes (a.k.a. the source or head) to object nodes (a.k.a. the target or tail); for example, in the edge the edge label bus is seen as transforming to and likewise for other bus edges. A seminal approach is TransE². Over all positive edges TransE learns vectors e_s, r_p, and e_o aiming to make e_s + r_p as close as possible to e_o. Conversely, if the edge is negative, then TransE attempts to learn a representation that keeps e_s + r_p away from e_o.

TransE can however be too simplistic, as it will try to give similar vectors to all target locations, which may not be feasible given other edges. To resolve such issues, many variants of TransE have been investigated, typically using a distinct hyperplane (e.g., TransH) or vector space (e.g., TransR, TransD) for each type of relation. Recently, RotatE³ proposes translational embeddings in complex space, which allows to capture more characteristics of relations, such as direction, symmetry, inversion, antisymmetry, and composition. Embeddings have also been proposed in non-Euclidean space; e.g., MuRP⁴ uses relation embeddings that transform entity embeddings in the hyperbolic space of the Poincaré ball mode, whose curvature provides more “space” to separate entities with respect to the dimensionality.

Tensor Decomposition Models

A second approach to derive graph embeddings is to apply methods based on tensor decomposition. A tensor is a multidimensional numeric field that generalises scalars (0-order tensors), vectors (1-order tensors), and matrices (2-order tensors) towards arbitrary dimension/order. Tensor decomposition involves decomposing a tensor into more “elemental” tensors (e.g., of lower order) from which the original tensor can be recomposed (or approximated) by a fixed sequence of basic operations. These elemental tensors can be seen as capturing latent factors in the original tensor. There are many approaches to tensor decomposition, including rank decompositions.

Leaving aside graphs, consider an (a × b)-matrix (i.e., a 2-order tensor) C, where each element (C)_i_j denotes the average temperature of the ith city of Chile in the jth month of the year. Since Chile is a long, thin country – ranging from subpolar to desert climates – we may decompose C into two vectors representing latent factors – x (with a elements) giving lower values for cities with lower latitude, and y (with b elements), giving lower values for months with lower temperatures – such that computing the outer product6 of the two vectors approximates C reasonably well: x⊗y ≈ C. If there exist x and y such that x ⊗ y = C, then we call C a rank-1 matrix. Otherwise, the rank r of C is the minimum number of rank-1 matrices we need to sum to get precisely C, i.e., x₁ ⊗ y₁ + . . . x_r ⊗ y_r = C. In the temperature example, x₂ ⊗ y₂ might correspond to a correction for altitude, x₃ ⊗ y₃ for higher temperature variance further south, and so on. A (low) rank decomposition of a matrix then sets a limit d on the rank and computes the vectors (x₁, y₁, . . . , x_d, y_d) such that x₁ ⊗ y₁+· · ·+x_d⊗ y_d gives the best d-rank approximation of C. Noting that to generate n-order tensors we need to compute the outer product of n vectors, we can generalise this idea towards low rank decomposition of tensors; this method is called canonical polyadic decomposition⁵.

DistMult⁶ is a seminal method for computing knowledge graph embeddings based on rank decompositions, but does not capture edge direction. Rather than use a vector as a relation embedding, RESCAL⁷ uses a matrix, which can capture edge direction. However, RESCAL incurs a higher cost in terms of space and time than DistMult. Recently, ComplEx and HolE both use vectors for relation and entity embeddings, but ComplEx uses complex vectors, while HolE uses a circular correlation operator (on reals) to capture edge-direction.

Neural Models

A number of approaches instead use neural networks to learn knowledge graph embeddings with non-linear scoring functions for plausibility.

An early neural model was semantic matching energy⁸, which learns parameters (a.k.a. weights: w, w′) for two functions – f_w(e_s, r_p) and g_w′(e_o, r_p) – such that the dot product of the result of both functions gives the plausibility score. Both linear and bilinear variants of f_w and g_w′ are proposed. Another early proposal was neural tensor networks⁹, which maintains a tensor W of weights and computes plausibility scores by combining the outer product e_s ⊗ W ⊗ e_o with r_p and a standard neural layer over e_s and e_o. The tensor W yields a high number of parameters, limiting scalability. Multi layer perceptron¹⁰ is a simpler model, where e_s, r_p, and e_o are concatenated and fed into a hidden layer to compute the plausibility score.

More recent models use convolutional kernels. ConvE¹¹ generates a matrix from e_s and r_p by “wrapping” each vector over several rows and concatenating both matrices, over which (2D) convolutional layers generate the embeddings. A disadvantage is that wrapping vectors imposes an arbitrary two-dimensional structure on the embeddings. HypER¹² also uses convolutions, but avoids such wrapping by applying a fully connected layer (called the “hypernetwork”) to r_p to generate relation-specific convolutional filters through which the embeddings are generated.

The presented approaches strike different balances in terms of expressivity and the number of parameters that need to be trained. While more expressive models, such as neural tensor networks, may better fit more complex plausibility functions over lower dimensional embeddings by using more hidden parameters, simpler models and convolutional networks that enable parameter sharing by applying the same (typically small) kernels over different regions of a matrix, require handling fewer parameters overall and are more scalable.

Language Models

Embedding techniques were first explored as a way to represent natural language within machine learning frameworks, with word2vec¹³ and GloVe¹⁴ being two seminal approaches. Both approaches compute embeddings for words based on large corpora of text such that words used in similar contexts (e.g., “frog,” “toad”) have similar vectors.

Approaches for language embeddings can be applied for graphs. However, while graphs consist of an unordered set of sequences of three terms (i.e., a set of edges), text in natural language consists of arbitrary-length sequences of terms (i.e., sentences of words). Along these lines, RDF2Vec¹⁵ performs biased random walks on the graph and records the paths traversed as “sentences,” which are then fed as input into the word2vec model.

RDF2Vec also proposes a second mode where sequences are generated for nodes from canonically-labelled sub-trees of which they are a root node. Conversely, KGloVe is based on the GloVe model. Much like how the original GloVe model considers words that co-occur frequently in windows of text to be more related, KGloVe uses personalised PageRank to determine the most related nodes to a given node, whose results are then fed into the GloVe model.

Entailment-aware Models

The embeddings thus far consider the data graph alone. But what if an ontology or set of rules is provided? One may first consider using constraint rules to refine the predictions made by embeddings. More recent approaches rather propose joint embeddings that consider both the data graph and rules. KALE¹⁶ computes entity and relation embeddings using a translational model (specifically TransE) that is adapted to further consider rules using t-norm fuzzy logics. But generating ground rules can be costly.

An alternative approach, adopted by FSL¹⁷, observes that in the case of a simple rule, such as the relation embedding bus should always return a lower plausibility than connects to. Thus, for all such rules, FSL proposes to train relation embeddings while avoiding violations of such inequalities. While relatively straightforward, FSL only supports simple rules, while KALE also supports more complex rules.

Next part: (section 5.3): Inductive knowledge – Graph neural networks.

Header image source: Crow Intelligence, CC BY-NC-SA 4.0.

References:

Hogan, A., Blomqvist, E., Cochez, M., d’Amato, C., Melo, G. D., Gutierrez, C., … & Zimmermann, A. (2021). Knowledge graphs. ACM Computing Surveys (CSUR), 54(4), 1-37. ↩
Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., & Yakhnenko, O. (2013). Translating embeddings for modeling multi-relational data. Advances in neural information processing systems, 26. ↩
Sun, Z., Deng, Z. H., Nie, J. Y., & Tang, J. (2019). Rotate: Knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197. ↩
Balazevic, I., Allen, C., & Hospedales, T. (2019). Multi-relational poincaré graph embeddings. Advances in Neural Information Processing Systems, 32. ↩
Vervliet, N., Debals, O., Sorber, L., Van Barel, M., & De Lathauwer, L. (2016). Canonical polyadic decomposition. Tensorlab Manual. ↩
Yang, B., Yih, W. T., He, X., Gao, J., & Deng, L. (2014). Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575. ↩
Nickel, M., & Tresp, V. (2013). Tensor factorization for multi-relational learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13 (pp. 617-621). Springer Berlin Heidelberg. ↩
Bordes, A., Glorot, X., Weston, J., & Bengio, Y. (2014). A semantic matching energy function for learning with multi-relational data: Application to word-sense disambiguation. Machine Learning, 94, 233-259. ↩
Socher, R., Chen, D., Manning, C. D., & Ng, A. (2013). Reasoning with neural tensor networks for knowledge base completion. Advances in neural information processing systems, 26. ↩
Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., … & Zhang, W. (2014, August). Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 601-610). ↩
Dettmers, T., Minervini, P., Stenetorp, P., & Riedel, S. (2018, April). Convolutional 2d knowledge graph embeddings. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32, No. 1). ↩
Balažević, I., Allen, C., & Hospedales, T. M. (2019). Hypernetwork knowledge graph embeddings. In Artificial Neural Networks and Machine Learning–ICANN 2019: Workshop and Special Sessions: 28th International Conference on Artificial Neural Networks, Munich, Germany, September 17–19, 2019, Proceedings 28 (pp. 553-565). Springer International Publishing. ↩
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. ↩
Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543). ↩
Ristoski, P., & Paulheim, H. (2016). Rdf2vec: Rdf graph embeddings for data mining. In The Semantic Web–ISWC 2016: 15th International Semantic Web Conference, Kobe, Japan, October 17–21, 2016, Proceedings, Part I 15 (pp. 498-514). Springer International Publishing. ↩
Guo, S., Wang, Q., Wang, L., Wang, B., & Guo, L. (2016, November). Jointly embedding knowledge graphs and logical rules. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 192-202). ↩
Demeester, T., Rocktäschel, T., & Riedel, S. (2016). Lifted rule injection for relation embeddings. arXiv preprint arXiv:1606.08359. ↩

Rate this post

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.

Translational Models

Tensor Decomposition Models

Neural Models

Language Models

Entailment-aware Models

Bruce Boyes

Related Articles

KM4Dev Knowledge Cafe 29: KM practitioners, where do we stand vis-à-vis AI and ML?

In the know: Development not developing | Climate Policy Radar | Thinking Other-wise

Why men and machine are the best team in materials science

How machine learning will change education, product development, and decision-making