ABCs of KM Introduction to knowledge graphs

Introduction to knowledge graphs (section 3.1): Data graphs – Models

Bruce Boyes27 Mar 2023

1,201 4 minutes read

This article is section 3.1 of part 3 of the Introduction to knowledge graphs series of articles. Recent research has identified the development of knowledge graphs as an important aspect of artificial intelligence (AI) in knowledge management (KM).

Graphs offer a flexible way to conceptualise, represent, and integrate diverse and incomplete data. Drawing on Hogan and colleagues’ comprehensive tutorial article¹, this first section of the data graphs part of the series summarises the graph data models most commonly used in practice.

Directed Edge-labelled Graphs

Figure 1. Directed-edge labelled graph describing events and their venues (source: Hogan et al. 2021).

A directed edge-labelled graph, or del graph for short (also known as a multi-relational graph) is defined as a set of nodes – such as:and a set of directed labelled edges between those nodes, such as:

In knowledge graphs, nodes represent entities (the city Santiago; the hill Santa Lucía; noon on March 22nd, 2018; etc.) and edges represent binary relations between those entities (e.g., Santa Lucía is in the city Santiago). Figure 1 exemplifies how the tourism board could model event data as a del graph. Adding data to such a graph typically involves adding new nodes and edges. Representing incomplete information requires simply omitting a particular edge (e.g., the graph does not yet define a start/end date-time for the Food Truck festival). A standard data model based on del graphs is the Resource Description Framework (RDF).

Heterogeneous Graphs

A heterogeneous graph (or heterogeneous information network) is a graph where each node and edge is assigned one type. Heterogeneous graphs are thus akin to del graphs – with edge labels corresponding to edge types – but where the type of node forms part of the graph model itself, rather than being expressed as a special relation, as seen in Figure 2. An edge is called homogeneous if it is between two nodes of the same type (e.g., borders); otherwise it is called heterogeneous (e.g., capital).

Heterogeneous graphs allow for partitioning nodes according to their type, for example, for the purposes of machine learning tasks. However, unlike del graphs, they typically assume a one-to-one relation between nodes and types (notice the node “Santiago” with zero types and “EID15” with multiple types in the del graph of Figure 1).

Figure 2. Data about capitals and countries in a del graph and a heterogeneous graph (source: Hogan et al. 2021).

Figure 3. Flight data in a del graph and a property graph (source: Hogan et al. 2021).

Property Graphs

A property graph allows a set of property–value pairs and a label to be associated with nodes and edges, offering additional flexibility when modelling data. Consider, for example, modelling the airline companies that offer flights. In a del graph, we cannot directly annotate an edge like

with the company, but we could add a new node denoting a flight and connect it with the source, destination, companies, and mode, as shown in Figure 3(a). Applying this pattern to a large graph may require significant changes. Conversely, Figure 3(b) exemplifies a property graph with analogous data, where property–value pairs on edges model companies, property–value pairs on nodes indicate latitudes and longitudes, and node/edge labels indicate the type of node/edge. Though not yet standardised, property graphs are used in popular graph databases, such as Neo4j.

While the more intricate models offers greater flexibility in terms of how to encode data as a property graph (e.g., using property graphs, we can continue modelling flights as edges in Figure 3(b)) potentially leading to a more intuitive representation, these additional details likewise require more intricate query languages, formal semantics, and inductive techniques versus simpler graph models such as del graphs or heterogeneous graphs.

Graph Dataset

A graph dataset allows for managing several graphs and consists of a set of named graphs and a default graph. Each named graph is a pair of a graph ID and a graph. The default graph is a graph without an ID and is referenced “by default” if a graph ID is not specified. Figure 4 provides an example where events and routes are stored in two named graphs, and the default graph manages meta-data about the named graphs.

Though the example uses del graphs, graph datasets can be generalised to other types of graphs. Graph datasets are useful for managing and querying data from multiple sources, where each source can be managed as a separate graph, allowing individual graphs to be queried, updated, removed, and so on, as needed.

Figure 4. Graph dataset with two named graphs and a default graph describing events and routes (source: Hogan et al. 2021).

Other Graph Data Models

The graph models presented thus far are the most popular in practice. Other graph data models exist with nodes that may contain individual edges or even nested graphs (a.k.a. hypernodes). Likewise, hypergraphs allow edges that connect sets rather than pairs of nodes. Nonetheless, data can typically be converted from one model to another; in the view of Hogan and colleagues, a knowledge graph can thus adopt any such graph data model. They discuss del graphs given their relative succinctness, but most discussion extends naturally to other models.

Graph Stores

A variety of techniques have been proposed for storing and indexing graphs, facilitating the efficient evaluation of queries (as will be discussed in the next section). Directed-edge labelled graphs can be stored in relational databases either as a single relation of arity three (triple table), as a binary relation for each property (vertical partitioning), or as n-ary relations for entities of a given type (property tables).

Custom storage techniques have also been developed for a variety of graph models, providing efficient access for finding nodes, edges, and their adjacent elements. A number of systems further allow for distributing graphs over multiple machines based on popular NoSQL stores or custom partitioning schemes.

Creation

We have seen how knowledge graphs can be modelled and stored, but how are they created? Creation often involves integrating data from diverse sources, including direct human input; extraction from existing text, markup, legacy file formats, relational databases, other knowledge graphs; and so on.

Next part: (section 3.2): Data graphs – Querying.

Header image source: Crow Intelligence, CC BY-NC-SA 4.0.

References:

Hogan, A., Blomqvist, E., Cochez, M., d’Amato, C., Melo, G. D., Gutierrez, C., … & Zimmermann, A. (2021). Knowledge graphs. ACM Computing Surveys (CSUR), 54(4), 1-37. ↩

Rate this post

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.

Directed Edge-labelled Graphs

Heterogeneous Graphs

Property Graphs

Graph Dataset

Other Graph Data Models

Graph Stores

Creation

Bruce Boyes

Related Articles

Introduction to knowledge graphs (part 4): Deductive knowledge

Because You Need to Know – Sarah Berndt Hampton

Because You Need to Know – Kevin L Moffatt

A new tool helps researchers show their impact