Introduction to knowledge graphs (part 2): History of knowledge graphs
This article is part 2 of the Introduction to knowledge graphs series of articles.
As discussed in part 1, this series of articles provides an introduction to knowledge graphs to assist in advancing artificial intelligence (AI) in knowledge management (KM). This second article in the series summarises the history of knowledge graphs, drawing on a recent paper1 published in the journal Communications of the ACM.
Paper authors Claudio Gutiérrez and Juan F. Sequeda advise that an awareness of the history of knowledge graphs is very important:
The notion of Knowledge Graph stems from scientific advancements in diverse research areas such as Semantic Web, databases, knowledge representation and reasoning, NLP, and machine learning, among others. The integration of ideas and techniques from such disparate disciplines presents a challenge to practitioners and researchers to know how current advances develop from, and are rooted in, early techniques.
Understanding the historical context and background … is of utmost importance in order to understand the possible avenues of the future.
Gutiérrez and Sequeda provide an overview of the historical roots of the notion of knowledge graphs, focusing on developments after the advent of computing in its modern sense (1950s).
They periodize the relevant ideas, techniques, and systems into five themes, and present them using two core ideas – data and knowledge – together with a discussion on data+knowledge showing their interplay.
Links to further information in Wikipedia2 and other reference sources have been added throughout the text.
Advent of the digital age
The beginnings of computer science are marked by the advent and spread of digital computers and the first programming languages.
The first program to process complex information was “Logic Theorist” by Newell, Shaw, and Simon in 1956. In 1958, they developed the “General Problem Solver“, which illustrated well the paradigm researchers were after: to construct computer programs that can solve problems requiring intelligence and adaptation.
The growth in data processing needs brought a division of labor, and Edgar Codd introduced the relational data model, which provided representational independence. Peter Chen introduced the entity-relationship model, which incorporated semantic information of the real world in the form of graphs.
At the system level, relational database management systems (RDBMS) were developed and implemented to manage data based on the relational model, including relational query languages such as SEQUEL and QUEL.
An example of developments in reasoning was ELIZA, a program that could carry a dialogue in English.
Researchers recognized the process of searching in large spaces represented a form of “intelligence” or “reasoning”, and that an understanding of such space would ease searching. The idea of searching in diverse and complex spaces was new, and Dijkstra’s famous algorithm for finding shortests paths is from 1956.
An early system to manage data was the Integrated Data Store (IDS), designed by Charles Bachman in 1963. It was the basis for the CODASYL standard, which became known as Database Management Systems (DBMS).
Semantic networks were introduced in 1956 by Richard H. Richens and Ross Quillian. They were aimed at allowing information to be stored and processed in a computer following the model of human memory.
The period from 1980 to 1990 was characterized by realizations of automated reasoning, large search spaces, the need to understand natural language, and the relevance of systems and high level languages to manage data.
Data and knowledge foundations
The 1970s witnessed wider adoption of computing in industry, and the creation of the semantic network and processing system (SNePS).
In the mid-1970s, several critiques of semantic network structures emerged. Researchers focused on extending semantic networks with formal semantics, and in 1976 John Sowa introduced conceptual graphs, an intermediate language to map natural language queries and assertions to a relational database.
In the 1970s, data and knowledge started to integrate using logic. This is known as logic programming.
Early systems that could reason based on knowledge, known as expert systems, were developed to solve complex problems. These systems encoded domain knowledge as if-then rules.
The 1977 workshop on “Logic and Data Bases” in Toulouse, France, is considered a landmark event because it formalized the link between logic and databases.
The realization of representational independence, practical implementation of the relational model, and awareness of the potential of combining logic and data by means of networks are all included in this period.
Coming-of-age of data and knowledge
The 1980s saw the evolution of computing as it transitioned from industry to homes through the boom of personal computers. Object-oriented abstractions were being developed, including KL-ONE, LOOM, and CLASSIC.
The Japanese Fifth Generation Project, which adopted logic programming as a basis, sparked world wide activity leading to competing projects such as the Cyc project, which created the world’s largest knowledge base of common sense.
Expert systems proliferated in the 1980s and were deployed on parallel computers, and the Internet changed the way people communicated and exchanged information.
Increasing computational power pushed the development of new computing fields and artifacts, which in turn generated complex data that needed to be managed. This led to the development of object-oriented databases (OODB), which were used by academia and industry.
Graphs were investigated as a representation for object-oriented data, graphical and visual interfaces, hypertext, etc.
The 1980s saw the rise of the description logics, which were used in machine learning and expert systems.
On the academic side, an initial approach of combining logic and data was to layer logic programming on top of relational databases. This gave rise to deductive database systems, which natively extended relational databases with recursive rules.
Expert systems proved difficult to update and maintain, and were limited to specific domains. The IT world moved on and rolled their experience into mainstream IT tools.
By the end of the decade, the first systematic study3 with the term “knowledge graph” appeared. It focused on the integration between logic and data and the relevance of the trade-off between expressive power of logical languages and the computational complexity of reasoning tasks.
Data, knowledge, and the web
The 1990s witnessed the emergence of the World Wide Web and the digitization of almost all aspects of our society. These phenomena paved the way for big data.
The database industry focused on developing RDBMS to address the demands of e-commerce, and on research into data integration, data warehouses, and data mining.
The data community moved toward the web, and developed semi-structured data models such as Object Exchange Model (OEM), Extensible Markup Language (XML), and Resource Description Framework (RDF).
During this time, organizations required integration of multiple, distributed, and heterogeneous data sources to make business decisions. This led to the development of data warehouse systems that could support analytics on multi-dimensional data cubes.
With hardware not yet up to the task, researchers discussed the knowledge acquisition bottleneck4 in implementing knowledge-based systems.
The Web was a realization that knowledge should be shared and reused, and led to the spread of languages, video, and image social networks.
The description logic research community continued to study trade-offs and define new profiles of logic for knowledge representation. They joined forces and generated the DAML+OIL infrastructure, which influenced the standardization of the Web Ontology Language (OWL).
Big data drove statistical applications to knowledge via machine learning and neural networks, and the 2012 work5 on image classification with deep convolutional neural networks with GPUs initiated a new phase in AI: deep learning.
Ontologies are formal specifications of conceptualizations. They were defined by Gruber, and the first ontology engineering tools were created to help users code knowledge.
Data+Knowledge was manifested through specialized workshops on Deductive Databases (1990–1999) and Knowledge Representation Meets Databases (1994–2003).
The Semantic Web project is an endeavor to combine knowledge and data on the Web. It is influenced by several developments, such as SHOE, Ontobroker, Ontology Inference Layer (OIL) and DARPA Agent Markup Language (DAML), KQML, and the EU-funded Thematic Network OntoWeb.
The Web was rapidly changing the way data, information and knowledge were traditionally conceived. The computational power was not enough to handle the new levels of data produced by the Web.
Data and knowledge at large scale
The 2000s saw the explosion of e-commerce and online social networks, and the introduction of deep learning into AI. The emergence of non-relational, distributed, data stores gave rise to “NoSQL” databases, which were used to manage column, document, key-value, and graph data models.
Tim Berners-Lee coined the term “linked data” in 2006, and the Linked Open Data project was born. This project led to the creation in 2011 of Wikidata, a collaboratively edited multilingual knowledge graph hosted by the Wikimedia Foundation (see part 1 of this series).
In the beginning of the 21st Century, statistical techniques for large-scale data processing such as speech recognition, natural language processing (NLP), and image processing were advanced. This motivated the search for new forms of storing, managing, and integrating data and knowledge in the world of big data.
Where are we now?
In the history of computer science, there has been a never-ending growth of data and knowledge, and many different ideas, theories, and techniques have been developed to deal with it.
Data was traditionally considered a commodity, moreover, a material commodity. Since the second half of the 20th Century, computer scientists have developed ideas, techniques, and systems to elevate data to the conceptual place it deserves.
In 2012, Google announced a product called the Google Knowledge Graph, and a number of other large companies soon followed with their own knowledge graphs. Academia began to adopt the knowledge graph term to loosely refer to systems that integrate data with some structure of graphs, a reincarnation of the Semantic Web, and linked data.
Knowledge graphs represent a convergence of data and knowledge techniques around the old notion of graphs or networks. Manifold graph query languages are being developed, as well as new industrial languages, research languages, and the upcoming ISO standard for Graph Query Language (GQL).
It is not easy to predict the future, particularly the outcome of the interplay between data and knowledge, between statistics and logic. We should look at the past to inspire the future.
Next part: (part 3): Data graphs, deductive knowledge, inductive knowledge.
Acknowledgements: This summary was drafted by Wordtune Read with corrections, further edits, and link additions by Bruce Boyes.
Header image source: Crow Intelligence, CC BY-NC-SA 4.0.
- Gutiérrez, C., & Sequeda, J. F. (2021). Knowledge graphs. Communications of the ACM, 64(3), 96-104. ↩
- CC BY-SA 3.0. ↩
- Bakker, R. R. (1987). Knowledge Graphs: representation and structuring of scientific knowledge. PhD thesis, University of Twente, 1987. ↩
- Feigenbaum, E. A. (1984). Knowledge engineering. Annals of the New York Academy of Sciences, 426(1), 91-107. ↩
- Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84-90. ↩