1st International Workshop on
challenges and experiences
from Data Integration to Knowledge Graphs

August 5, 2019

Anchorage, Alaska

Held in conjunction with KDD 2019

About the Workshop

General info about DI2KG Workshop

Data integration and knowledge graph construction are complex processes that have been studied by different communities, including data management, machine learning, statistics, data science, natural language processing and information retrieval, typically in isolation. As holistic solutions are emerging, we claim the need for a more cross-disciplinary community that pushes research toward the creation of the next generation of data integration and knowledge graph construction methods. We aim for DI2KG to be a long-term venue that fosters new research based on the availability of an end-to-end benchmark, designed up-front for dealing with the complexity of every integration task, while building a community that can contribute to the evolution of the benchmark.

In order to stimulate advances in this direction we will also host the DI2KG challenge: a set of fundamental integration tasks leading to the construction of a knowledge graph from a collection of product specifications extracted from the Web with their own manually-checked ground truth.

Papers

Submit your paper

We welcome submissions that can stimulate discussion, including papers by the challenge participants. Topics of interest include but are not limited to the following:

  • Source selection and discovery.
  • Data and information extraction.
  • Data cleaning and fusion.
  • Schema extraction and alignment.
  • Algorithmic and statistical techniques for entity resolution.
  • Machine learning methods for data integration.
  • Benchmarking and performance measurement.
  • Knowledge graph augmentation.
  • Knowledge graph embedding techniques.

Up to 4 pages in length (+ bibliography) in KDD 2019 format

Submission Paper Categories


Challenge Papers

  • Experience papers, which provide new insights into the strengths and weaknesses of existing integration systems, inspired by experimental activities on the benchmark.



  • Read more about the challenge
Research Papers

  • Position papers, which discuss requirements for a benchmark platform and the role of benchmarks in driving integration research.
  • Vision papers, which anticipate new challenges in integration and future research direction.
  • Application papers, which describe challenging use cases of modern data integration and knowledge graphs, with a strong economical and social impact component.
  • Technical papers, which present advances in topics related to integration.



Important Dates

(All deadlines are Alofi Time)

Challenge track


Benchmark
publication

April 21, 2019

Preliminary paper submissions

May 20, 2019

Paper
notifications

June 1, 2019

Benchmark results submission

July 16, 2019



Research track


Paper submissions

May 5, 2019
May 20, 2019

Paper notifications

June 1, 2019



Challenge

Join our challenge

Overview


We would like to bring together people from different communities because we believe that a more synergistic approach can lead to the definition of more effective integration methods. This year we will release the first version of our benchmark and we will host a challenge on different integration tasks. Attendees are invited to participate in our benchmark-supported DI2KG challenge and submit a paper describing their experience on the activities on the benchmark and new insights into the strengths and weaknesses of existing integration systems.

Tasks Definition


Our end-to-end benchmark will evaluate participants' solutions to a selection of integration tasks leading to the construction of a knowledge graph.

The challenge comprises three main tasks:

  • Entity Resolution
  • Schema Alignment
  • Knowledge Graph Augmentation

Each task requires participants to build a knowledge graph consisting of a set of predefined entities and properties. The main properties are the names of the specifications our benchmark will consider during the score calculation.

Dataset


Participants will be provided with a set of selected HTML pages regarding products from a variety of sources, each page correlated with a JSON file containing the result of an automated process of specifications extraction.

The JSON files consist of a series of key and value pairs extracted from the associated HTML page, e.g.:

                  
{
"<page title>": "Samsung Smart WB50F Digital Camera White Price in India with Offers & Full Specifications | PriceDekho.com",
"additional features": "Color\nWhite",
"brand": "Samsung",
"connectivity system req": "USB\nUSB 2.0",
"dimension": "Dimensions\n101 x 68 x 27.1 mm\nWeight\n157 gms",
"display": "Display Type\nLCD\nScreen Size\n3 Inches",
"general features": "Brand\nSamsung\nAnnounced\n2014, February\nStatus\nAvailable",
"lens": "Auto Focus\nCenter AF, Face Detection, Multi AF\nFocal Length\n4.3 - 51.6 mm (35 mm Equivalent to 24 - 288 mm)",
"media software": "Memory Card Type\nSD, SDHC, SDXC",
"optical sensor resolution in megapixel": "16.2 MP",
"other features": "ISO Rating\nAuto / 80 / 100 / 200 / 400 / 800 / 1600 / 3200\nSelf Timer\n2 sec, 10 sec\nFace Detection\nYes\nImage Stabilizer\nOptical\nMetering\nCenter, Multi, Spot\nExposure Compensation\n1/3 EV Steps, +/-2.0 EV\nMacro Mode (Exposure Mode)\n5 - 80 cm (W)\nRed Eye Reduction\nYes\nWhite Balancing\nAuto\nMicrophone\nBuilt-In Monaural Microphone",
"pixels": "Optical Sensor Resolution (in MegaPixel)\n16.2 MP",
"sensor": "Sensor Type\nCCD Sensor\nSensor Size\n1/2.3 Inches",
"sensor type": "CCD Sensor",
"shutter speed": "Maximum Shutter Speed\n1/2000 sec\nMinimum Shutter Speed\n2 sec",
"zoom": "Optical Zoom\n12x\nDigital Zoom\n2x"
}

Participants will also be provided with a set of records from our ground truth in order to have the possibility of training models; more information about training data will be released at the launch of the challenge.

Download dataset


N.B.: Ground Truth Data download available at the end of the next section



Ground Truth Data and Challenge Instructions


We manually built ground truth data, providing the solution to popular integration tasks in a unified way. Available tasks are:

  • Entity Resolution
  • Schema Alignment
  • Knowledge Graph Augmentation
Users can either focus on a single task or try to solve everything.
We partition our ground truth data in two parts. One part is available in the download for training or testing by the users. The second part will be used by us for evaluating submitted solutions (see section Scoring) and is not disclosed to participants.
Submitted solutions need to use the JSON format described below, although depending on the task of choice, some attributes might be unnecessary, as detailed in the following instructions. Instructions for submitting your JSON file will be available soon.
We consider different classes of resources, specified as value of the key “resource_class”, and most of them were assigned a global unique ID (“resource_id”). Available classes are:

  • source, that is, a website (e.g., www.camerashop.com)
  • json_file, that is, a json file in our dataset (e.g., 100.json) thus corresponding to a HTML file displaying the specification of a product
  • source_attribute, that is, a property name used in a website (e.g., battery in www.camerashop.com)
  • target_attribute, that is, a property of interest (e.g., battery)
  • provenance, that is, a key/value specification in a certain json file (e.g., battery: AAA in 100.json)
  • entity, that is, a certain product (e.g. Canon EOS 400d) that can appear in different HTML pages of our dataset.

Each class is described below, highlighting which attributes are available in the download and which attributes are left to challenging participants, depending on the task of choice.

Class source


Each resource of this class represents a single source -- i.e., website -- from our dataset.
                  
{
"resource_class": "source",
"resource_id": "SOURCE#1",
"source_name": "www.camerashop.com"
}

Class json_file


Each resource of this class corresponds to a json file of our dataset, that is, a set of extracted key-value pairs from a single HTML page. Note that we omit the value, which can be retrieved from the original json file in our dataset.
                  
{
"resource_class": "json_file",
"resource_id": "JSON#1",
"source_id": "SOURCE#1"
"source_name": "www.camerashop.com"
"json_number": 100
}

Class target_attribute


Each resource of this class represents a different property of interest. Such properties can be thought of attributes in the integrated schema or predicates in the knowledge graph. All our target attributes are included in the download.
                  
{
"resource_class": "target_attribute",
"resource_id": "TARGETATTRIBUTE#1",
"target_attribute_name": "battery_type"
}

Class source_attribute


Each resource of this class represents the attributes’ names at source level, that can correspond to a set of target attributes (in bold green). All the source attributes are available in the download, but only some of them come with their own target_attribute_ids. Completion of target_attribute_ids for every source attribute is left to the participants to the Schema Alignment task. We note that some source attributes can correspond to multiple target attributes, as in the example below, where the attribute battery of www.camerashop.com provides information about both the battery type (e.g. AA) and the chemistry (e.g. Li-Ion). Other source attributes, instead, can correspond to none of our target attributes (e.g., whether a product is used or new).
                  
{
"resource_class": "source_attribute",
"resource_id": "SOURCEATTRIBUTE#1",
"source_attribute_name": "battery"
"source_id": "SOURCE#1"
"source_name": "www.camerashop.com"
"target_attribute_ids": [ "TARGETATTRIBUTE#1", "TARGETATTRIBUTE#2" ]
}

Class provenance


Each resource of this class represents an attribute name from a specific json file of a specific source. All the provenance resources are included in the download.
                  
{
"resource_class": "provenance",
"resource_id": "PROVENANCE#1",
"json_id": "JSON#¹"
"json_number": 100
"source_id": "SOURCE#1"
"source_name": "www.camerashop.com"
"source_attribute_id": "SOURCEATTRIBUTE#¹"
"source_attribute_name": "battery"
}

Class entity


Each resource of this class represents a real world product, that can appear in a set of json files (in bold green) and that can be associated to a set of target attributes (claims). Only some entities are available in the download. Completion of all the entities, together with corresponding json files is left to the participants to the Entity Resolution task.
It is worth noticing that:

  • multiple jsons (even in the same source) can correspond to the same entity, as in the classic “dirty Entity Resolution” setting
  • a json file corresponds to one and only one entity
Depending on which sources each entity appears in, target attributes can correspond to different json attributes. The entity in the example below consists of two json files, and the fact that the first json file contains the battery type (that is, TARGETATTRIBUTE#1) in the “battery” attribute is represented by the resource PROVENANCE#1 in the claims. Some entities come already with their own provenance claims. Completion of provenance claims for every entity is left to the participants to the Knowledge Graph Augmentation task.
                  
{
"resource_class": "entity",
"claims": [ { "target_attribute_id": "TARGETATTRIBUTE#1", "target_attribute_name": "battery_type", "provenances": [ "PROVENANCE#1", "PROVENANCE#2" ] } ],
"instances": [ "JSON#1", "JSON#2" ]
}

It is important to note that Schema Alignment and Knowledge Graph Augmentation are related tasks: an attribute A of a source S -- that we refer to as S.A -- corresponds to a target attribute T iff there exists an entity whose T value is included in the attribute A of a json J in S -- that we refer to as S.J.A. However the opposite does not always hold. For instance, if S.A corresponds to T1 and T2, a certain S.J.A can be relevant to T1 only. In other words, the solution to the Knowledge Graph Augmentation task yields the solution to the Schema Alignment task (and the Entity Resolution task as well), but not viceversa.

For any question, you can contact us at the e-mail address: di2kg@inf.uniroma3.it

N.B.: Please, download the latest available version of the ground truth data!



Submission & Scoring


We use classic precision and recall for evaluating submitted solutions. In addition, in the spirit of the workshop, we invite partecipants to propose their own evaluation practices in their challenge papers.



Committees

Program Committee Co-Chairs and Organizers

Organizers


  • Valter Crescenzi, Roma Tre University
  • Xin Luna Dong, Amazon
  • Donatella Firmani, Roma Tre University
  • Paolo Merialdo, Roma Tre University
  • Divesh Srivastava, AT&T Labs-Research
  • Andrea De Angelis, Roma Tre University
  • Maurizio Mazzei, Roma Tre University

Program Chairs


  • Donatella Firmani, Roma Tre University
  • Divesh Srivastava, AT&T Labs-Research

Program Committee


  • Denilson Barbosa, University of Alberta
  • Valter Crescenzi, Roma Tre University
  • Xin Luna Dong, Amazon
  • Laura Haas, University of Massachusetts
  • Colin Lockard, University of Washington
  • Paolo Merialdo, Roma Tre University
  • Renée Miller, Northeastern University
  • Mourad Ouzzani, Qatar Computing Research Institute
  • Themis Palpanas, Paris Descartes University

Challenge Leaders


  • Andrea De Angelis, Roma Tre University
  • Maurizio Mazzei, Roma Tre University

Speakers

Keynote speakers

AnHai Doan - University of Wisconsin

Andrew McCallum - University of Massachusetts Amherst

Program

Half day workshop

TBD

Contacts

Contact us