# Introduction to knowledge graphs (section 5.2): Inductive knowledge – Knowledge graph embeddings

*This article is section 5.2 of part 5 of the Introduction to knowledge graphs series of articles. Recent research has identified the development of knowledge graphs as an important aspect of artificial intelligence (AI) in knowledge management (KM).
*

Machine learning can be used for directly *refining* a knowledge graph; or for *downstream tasks* using the knowledge graph, such as recommendation, information extraction, question answering, query relaxation, query approximation, and so on. However, machine learning techniques typically assume numeric representations (e.g., vectors), distinct from how graphs are usually expressed. In this section of their comprehensive tutorial article^{1}, Hogan and colleagues explore how graphs can be encoded numerically for machine learning.

A first attempt to represent a graph using vectors would be to use a *one-hot encoding*, generating a vector of length |*L*| · |*V*| for each node – with |*V*| the number of nodes in the input graph and |*L*| the number of edge labels – placing a one at the corresponding index to indicate the existence of the respective edge in the graph, or zero otherwise. Such a representation will, however, typically result in large and sparse vectors, which will be detrimental for most machine learning models.

The main goal of knowledge graph embedding techniques is to create a dense representation of the graph (i.e., *embed* the graph) in a continuous, low-dimensional vector space that can then be used for machine learning tasks. The dimensionality *d* of the embedding is fixed and typically low (often, e.g., 50 ≥ *d* ≥ 1000). Typically the graph embedding is composed of an *entity embedding* for each node: a vector with *d* dimensions that we denote by **e**; and a *relation embedding* for each edge label: (typically) a vector with *O*(*d*) dimensions that we denote by **r**. The overall goal of these vectors is to abstract and preserve latent structures in the graph. There are many ways in which this notion of an embedding can be instantiated. Most commonly, given an edge a specific embedding approach defines a *scoring function* that accepts **e**_{s} (the entity embedding of node ), **r**_{p} (the relation embedding of edge label p) and **e**_{o} (the entity embedding of node ) and computes the *plausibility* of the edge: how likely it is to be true. Given a data graph, the goal is then to compute the embeddings of dimension *d* that maximise the plausibility of positive edges (typically edges in the graph) and minimise the plausibility of negative examples (typically edges in the graph with a node or edge label changed such that they are no longer in the graph) according to the given scoring function. The resulting embeddings can then be seen as models learned through self-supervision that encode (latent) features of the graph, mapping input edges to output plausibility scores.

Embeddings can then be used for a number of low-level tasks. The plausibility scoring function can be used to assign confidence to edges (possibly extracted from an external source) or to complete edges with missing nodes/edge labels (a.k.a. *link prediction*). Additionally, embeddings will typically assign similar vectors to similar terms and can thus be used for similarity measures.

A wide range of knowledge graph embedding techniques have been proposed. The most prominent are:

*translational models*where relations are seen as translating subject entities to object entities*tensor decomposition models*that extract latent factors approximating the graph’s structure*neural models*based on neural networks- l
*anguage models*based on word embedding techniques.

#### Translational Models

*Translational models* interpret edge labels as transformations from subject nodes (a.k.a. the *source* or *head*) to object nodes (a.k.a. the *target* or *tail*); for example, in the edge the edge label bus is seen as transforming to and likewise for other bus edges. A seminal approach is TransE^{2}. Over all positive edges TransE learns vectors e_{s}, r_{p}, and e_{o} aiming to make e_{s} + r_{p} as close as possible to e_{o}. Conversely, if the edge is negative, then TransE attempts to learn a representation that keeps e_{s} + r_{p} away from e_{o}.

TransE can however be too simplistic, as it will try to give similar vectors to all target locations, which may not be feasible given other edges. To resolve such issues, many variants of TransE have been investigated, typically using a distinct hyperplane (e.g., TransH) or vector space (e.g., TransR, TransD) for each type of relation. Recently, RotatE^{3} proposes translational embeddings in complex space, which allows to capture more characteristics of relations, such as direction, symmetry, inversion, antisymmetry, and composition. Embeddings have also been proposed in non-Euclidean space; e.g., MuRP^{4} uses relation embeddings that transform entity embeddings in the hyperbolic space of the Poincaré ball mode, whose curvature provides more “space” to separate entities with respect to the dimensionality.

#### Tensor Decomposition Models

A second approach to derive graph embeddings is to apply methods based on *tensor decomposition*. A *tensor* is a multidimensional numeric field that generalises scalars (0-order tensors), vectors (1-order tensors), and matrices (2-order tensors) towards arbitrary dimension/order. Tensor decomposition involves decomposing a tensor into more “elemental” tensors (e.g., of lower order) from which the original tensor can be recomposed (or approximated) by a fixed sequence of basic operations. These elemental tensors can be seen as capturing l*atent factors* in the original tensor. There are many approaches to tensor decomposition, including *rank decompositions*.

Leaving aside graphs, consider an (*a* × *b*)-matrix (i.e., a 2-order tensor) **C**, where each element (**C**)* _{i}_{j}* denotes the average temperature of the

*i*th city of Chile in the

*j*th month of the year. Since Chile is a long, thin country – ranging from subpolar to desert climates – we may decompose

**C**into two vectors representing latent factors –

**x**(with

*a*elements) giving lower values for cities with lower latitude, and

**y**(with

*b*elements), giving lower values for months with lower temperatures – such that computing the outer product6 of the two vectors approximates

**C**reasonably well:

**x**⊗

**y**≈

**C**. If there exist

**x**and

**y**such that

**x**⊗

**y**=

**C**, then we call

**C**a rank-1 matrix. Otherwise, the rank

*r*of

**C**is the minimum number of rank-1 matrices we need to sum to get precisely

**C**, i.e.,

**x**

_{1}⊗

**y**

_{1}+ . . .

**x**

_{r}⊗

**y**

_{r}=

**C**. In the temperature example,

**x**

_{2}⊗

**y**

_{2}might correspond to a correction for altitude,

**x**

_{3}⊗

**y**

_{3}for higher temperature variance further south, and so on. A (low) rank decomposition of a matrix then sets a limit

*d*on the rank and computes the vectors (

**x**

_{1},

**y**

_{1}, . . . ,

**x**

_{d},

**y**

_{d}) such that

**x**

_{1}⊗

**y**

_{1}+· · ·+

**x**

_{d }⊗

**y**

_{d}gives the best

*d*-rank approximation of

**C**. Noting that to generate

*n*-order tensors we need to compute the outer product of

*n*vectors, we can generalise this idea towards low rank decomposition of tensors; this method is called canonical polyadic decomposition

^{5}.

DistMult^{6} is a seminal method for computing knowledge graph embeddings based on rank decompositions, but does not capture edge direction. Rather than use a vector as a relation embedding, RESCAL^{7} uses a matrix, which can capture edge direction. However, RESCAL incurs a higher cost in terms of space and time than DistMult. Recently, ComplEx and HolE both use vectors for relation and entity embeddings, but ComplEx uses complex vectors, while HolE uses a circular correlation operator (on reals) to capture edge-direction.

#### Neural Models

A number of approaches instead use neural networks to learn knowledge graph embeddings with non-linear scoring functions for plausibility.

An early neural model was semantic matching energy^{8}, which learns parameters (a.k.a. weights: **w**, **w′**) for two functions – *f*_{w}(e_{s}, r_{p}) and *g*_{w′}(e_{o}, r_{p}) – such that the dot product of the result of both functions gives the plausibility score. Both linear and bilinear variants of *f*_{w} and *g*_{w′} are proposed. Another early proposal was neural tensor networks^{9}, which maintains a tensor ** W** of weights and computes plausibility scores by combining the outer product

**e**

_{s}⊗

*⊗*

**W****e**

_{o}with

**r**

_{p}and a standard neural layer over

**e**

_{s}and

**e**

_{o}. The tensor

*yields a high number of parameters, limiting scalability. Multi layer perceptron*

**W**^{10}is a simpler model, where

**e**

_{s},

**r**

_{p}, and

**e**

_{o}are concatenated and fed into a hidden layer to compute the plausibility score.

More recent models use convolutional kernels. ConvE^{11} generates a matrix from **e**_{s} and **r**_{p} by “wrapping” each vector over several rows and concatenating both matrices, over which (2D) convolutional layers generate the embeddings. A disadvantage is that wrapping vectors imposes an arbitrary two-dimensional structure on the embeddings. HypER^{12} also uses convolutions, but avoids such wrapping by applying a fully connected layer (called the “hypernetwork”) to **r**_{p} to generate relation-specific convolutional filters through which the embeddings are generated.

The presented approaches strike different balances in terms of expressivity and the number of parameters that need to be trained. While more expressive models, such as neural tensor networks, may better fit more complex plausibility functions over lower dimensional embeddings by using more hidden parameters, simpler models and convolutional networks that enable parameter sharing by applying the same (typically small) kernels over different regions of a matrix, require handling fewer parameters overall and are more scalable.

#### Language Models

Embedding techniques were first explored as a way to represent natural language within machine learning frameworks, with word2vec^{13} and GloVe^{14} being two seminal approaches. Both approaches compute embeddings for words based on large corpora of text such that words used in similar contexts (e.g., “frog,” “toad”) have similar vectors.

Approaches for language embeddings can be applied for graphs. However, while graphs consist of an unordered set of sequences of three terms (i.e., a set of edges), text in natural language consists of arbitrary-length sequences of terms (i.e., sentences of words). Along these lines, RDF2Vec^{15} performs biased random walks on the graph and records the paths traversed as “sentences,” which are then fed as input into the word2vec model.

RDF2Vec also proposes a second mode where sequences are generated for nodes from canonically-labelled sub-trees of which they are a root node. Conversely, KGloVe is based on the GloVe model. Much like how the original GloVe model considers words that co-occur frequently in windows of text to be more related, KGloVe uses personalised PageRank to determine the most related nodes to a given node, whose results are then fed into the GloVe model.

#### Entailment-aware Models

The embeddings thus far consider the data graph alone. But what if an ontology or set of rules is provided? One may first consider using constraint rules to refine the predictions made by embeddings. More recent approaches rather propose joint embeddings that consider both the data graph and rules. KALE^{16} computes entity and relation embeddings using a translational model (specifically TransE) that is adapted to further consider rules using *t-norm fuzzy logics*. But generating ground rules can be costly.

An alternative approach, adopted by FSL^{17}, observes that in the case of a simple rule, such as the relation embedding bus should always return a lower plausibility than connects to. Thus, for all such rules, FSL proposes to train relation embeddings while avoiding violations of such inequalities. While relatively straightforward, FSL only supports simple rules, while KALE also supports more complex rules.

**Next part: (section 5.3):** Inductive knowledge – Graph neural networks.

**Header image source:** Crow Intelligence, CC BY-NC-SA 4.0.

*References:*

- Hogan, A., Blomqvist, E., Cochez, M., d’Amato, C., Melo, G. D., Gutierrez, C., … & Zimmermann, A. (2021). Knowledge graphs.
*ACM Computing Surveys (CSUR), 54*(4), 1-37. ↩ - Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., & Yakhnenko, O. (2013). Translating embeddings for modeling multi-relational data.
*Advances in neural information processing systems*,*26*. ↩ - Sun, Z., Deng, Z. H., Nie, J. Y., & Tang, J. (2019). Rotate: Knowledge graph embedding by relational rotation in complex space.
*arXiv preprint arXiv:1902.10197*. ↩ - Balazevic, I., Allen, C., & Hospedales, T. (2019). Multi-relational poincaré graph embeddings.
*Advances in Neural Information Processing Systems*,*32*. ↩ - Vervliet, N., Debals, O., Sorber, L., Van Barel, M., & De Lathauwer, L. (2016). Canonical polyadic decomposition.
*Tensorlab Manual*. ↩ - Yang, B., Yih, W. T., He, X., Gao, J., & Deng, L. (2014). Embedding entities and relations for learning and inference in knowledge bases.
*arXiv preprint arXiv:1412.6575*. ↩ - Nickel, M., & Tresp, V. (2013). Tensor factorization for multi-relational learning. In
*Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13*(pp. 617-621). Springer Berlin Heidelberg. ↩ - Bordes, A., Glorot, X., Weston, J., & Bengio, Y. (2014). A semantic matching energy function for learning with multi-relational data: Application to word-sense disambiguation.
*Machine Learning*,*94*, 233-259. ↩ - Socher, R., Chen, D., Manning, C. D., & Ng, A. (2013). Reasoning with neural tensor networks for knowledge base completion.
*Advances in neural information processing systems*,*26*. ↩ - Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., … & Zhang, W. (2014, August). Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In
*Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining*(pp. 601-610). ↩ - Dettmers, T., Minervini, P., Stenetorp, P., & Riedel, S. (2018, April). Convolutional 2d knowledge graph embeddings. In
*Proceedings of the AAAI conference on artificial intelligence*(Vol. 32, No. 1). ↩ - Balažević, I., Allen, C., & Hospedales, T. M. (2019). Hypernetwork knowledge graph embeddings. In
*Artificial Neural Networks and Machine Learning–ICANN 2019: Workshop and Special Sessions: 28th International Conference on Artificial Neural Networks, Munich, Germany, September 17–19, 2019, Proceedings 28*(pp. 553-565). Springer International Publishing. ↩ - Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space.
*arXiv preprint arXiv:1301.3781*. ↩ - Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In
*Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)*(pp. 1532-1543). ↩ - Ristoski, P., & Paulheim, H. (2016). Rdf2vec: Rdf graph embeddings for data mining. In
*The Semantic Web–ISWC 2016: 15th International Semantic Web Conference, Kobe, Japan, October 17–21, 2016, Proceedings, Part I 15*(pp. 498-514). Springer International Publishing. ↩ - Guo, S., Wang, Q., Wang, L., Wang, B., & Guo, L. (2016, November). Jointly embedding knowledge graphs and logical rules. In
*Proceedings of the 2016 conference on empirical methods in natural language processing*(pp. 192-202). ↩ - Demeester, T., Rocktäschel, T., & Riedel, S. (2016). Lifted rule injection for relation embeddings.
*arXiv preprint arXiv:1606.08359*. ↩