MOSIG Master 2ND YEAR Research
YEAR 2020–2021

Combining link keys and similarity-based approaches to data interlinking

Master topic / Sujet de master recherche

Link specifications aim at identifying the description of the same objects in different RDF data sets. This allows the join exploitation of these data sets. Because a single specification may only cover part of these data sets, it is useful to extract combination of these specifications.

The society at large requests access to available data from various bodies: governments, universities, cultural actors, etc. This has led to the release of a vast quantity of linked data [Heath 2011], i.e., data expressed in semantic web formalisms (RDF). Part of the added value of linked data lies in the links identifying the same entity in different datasets. For instance, they may identify the same books and articles in different bibliographical data sources. Links allow to jointly exploit the content of data sources and make inferences between datasets. Thus, finding the manifestation of the same entity across several datasets is a crucial task for linked data.

Data interlinking refers to the process of finding pairs of IRIs representing the same resource in different RDF data sets [Ferrara 2011; Nentwig 2017]. The result of this process is a set of links, which may be added to the data sets by relating the corresponding IRIs with the owl:sameAs property. The task can be defined as: given two sets of individual identifiers ID and ID' from two data sets D and D', find the set L of pairs of identifiers which denote the same resource.

This task is performed by exploiting a link specification defining how to extract such links. There are several types of link specifications. The most prominent uses a similarity measure and threshold: it returns all pairs of identifiers the similarity of the identified object is above the threshold. We are developing another approach based on link keys [Atencia 2014]: it returns all pairs of identifiers whose properties satisfy specific constraints, e.g. having the same values for "firstname" for one data set and "prénom" for the other one. Each approach has benefits and drawback and none solves all problems.

Hence, this master topic aims at considering ways to combine such approaches elegantly. The starting point is recent work about combining several link keys for achieving better results than single ones [Atencia 2019]. This work may be extended by generalising the combination approach to other link specifications. Indeed, because the link specifications have the same interface, they may be combined together and the semantics of combination may be uniformly defined. It may also be extended to other ways to combine these links specifications.

Last, but not least, automatically extracting such compound specification may be computationally demanding. We developed heuristics for doing so, but they will have to be adapted to new combination operators and specifications. These rely on measures for evaluating the quality of extracted link keys, which could be reconsidered.


[Atencia 2014] Manuel Atencia, Jérôme David, Jérôme Euzenat, Data interlinking through robust linkkey extraction, in: Torsten Schaub, Gerhard Friedrich, Barry O'Sullivan (eds), Proc. 21st european conference on artificial intelligence (ECAI), Praha (CZ), pp15-20, 2014 []
[Atencia 2019] Manuel Atencia, Jérôme David, Jérôme Euzenat, Several link keys are better than one, or extracting disjunctions of link key candidates, in: Proc. 10th ACM international conference on knowledge capture (K-Cap), Marina del Rey (CA US), 2019 []
[Ferrara 2011] Alfio Ferrara, Andri Nikolov, François Scharffe. Data linking for the semantic web, International Journal of Semantic Web and Information Systems 7(3):46-76, 2011
[Heath 2011] Tom Heath, Chris Bizer, Linked data: evolving the web into a global data space, Morgan & Claypool, 2011
[Nentwig 2017] M. Nentwig, M. Hartung, A.-C. Ngonga Ngomo, E. Rahm, A survey of current link discovery frameworks, Semantic web journal 8(3):419-436, 2017


Reference number: Proposal n°2748

Master profile: M2R MOSIG, Artificial intelligence and the web profile, M2R MSIAM or M2R Informatics

Advisor: Jérôme Euzenat (Jerome:Euzenat#inria:fr) and Jérôme David (Jerome:David#inria:fr).

Team: The work will be carried out in the mOeX team common to INRIA & Université Grenoble Alpes. mOeX is dedicated to study knowledge evolution through adaptation. It gather permanent researchers from the Exmo team which has taken an active part these past 15 years in the development of the semantic web and more specifically ontology matching.

Laboratory: LIG.

Procedure: Contact us and provide vitæ and possibly motivation letter and references.