2nd International Workshop on Challenges and Experiences from Data Integration to Knowledge Graphs


August 31, 2020

Held in conjunction with VLDB 2020

alaska.benchmark@gmail.com

News


Date Event
27 August 2020 New The DI2KG'20 workshop program is online.
14 July 2020 Keynote speakers for the DI2KG 2020 Workshop have been announced.
13 July 2020 Congratulations to the winners of the DI2KG 2020 Challenge! The final Leaderboard is available in the dedicated section.
08 July 2020 Submissions for the DI2KG2020 Challenge are closed. Please, check the final leaderboard in the dedicated Leaderboard section. Congratulations to the finalist teams: JNU_Cyber and SimSkipReloaded.
01 July 2020 We have updated the deadlines for the Challenge. Please, check the new dates in the Challenge overview section.
24 June 2020 Round 2 of the Challenge starts today! Please check the dedicated "Downloads" section to find the new dataset and labelled data.
21 June 2020 Link for the paper submission is online. Please check the Call for paper section for details.
10 June 2020 We have updated the deadlines. Please, check the new dates in the Call for paper section and in the Challenge overview section.
19 May 2020 The challenge leaderboard is online.
3 May 2020 Challenges begins!

About DI2KG


The DI2KG workshop has the goal of driving innovative solutions for data integration and knowledge graph construction: these are complex processes, which involve many issues that have been studied by different communities (data management, IR, NLP, machine learning), typically in isolation. As more holistic solutions are emerging, we claim the need for a more cross-disciplinary community, that pushes research toward the creation of the next generation of data integration and knowledge graph construction methods.

This is the second edition of the DI2KG workshop. The first edition was held in conjunction with KDD 2019 (http://di2kg.inf.uniroma3.it/2019/).

We aim for DI2KG to be a long-term venue that fosters new research based on the availability of the DI2KG benchmark, an end-to-end benchmark designed up-front for dealing with the complexity of every integration task, while building a community that can contribute to the evolution of the benchmark.

In order to stimulate advances in this direction we organize also the DI2KG Challenge: a set of fundamental integration tasks leading to the construction of a knowledge graph from a collection of product specifications extracted from the Web with their own manually-curated ground truth.

We invite researchers and practitioners to participate in our benchmark-supported DI2KG Challenge and submit a paper describing their experience on the activities on the benchmark and new insights into the strengths and weaknesses of existing integration systems.

The DI2KG Challenge comprises three main tasks:

Call for papers


We strongly encourage thought-provoking papers that fall under the following categories:

Topics of interest include but are not limited to the following:

Workshop proceedings will be submitted for publication to CEUR (indexed by DBLP and Scopus). A selection of best papers will be recommended for inclusion in a special issue of a high-quality international journal.

Authors can submit papers up to 4 pages of content plus unlimited pages for bibliography, written in English, and in PDF according to the ACM Proceedings Format. Submissions will go through a single-blind review process, and will be evaluated on the basis of relevance and potential to raise discussion.


Date Event
06 July 2020
13 July 2020
Paper submission deadline.
01 August 2020
03 August 2020
05 August 2020
Notification of acceptance.
15 August 2020
18 August 2020
Camera ready.
31 August 2020 Workshop.

Challenge


  1. Challenge overview
  2. Dataset Xv
  3. Tasks description
    1. Entity resolution
    2. Schema matching
    3. Instance-level attribute matching
  4. Registration and Submission
  5. Evaluation
  6. Downloads

1. Challenge overview


The challenge concentrates on some fundamental integration tasks that lead to the construction of a knowledge graph from a collection of product specifications extracted from the Web:

Participants to the challenge can join one or more tasks. All the participants are invited to submit a paper describing their solution as well as their experience with the challenge to the DI2KG workshop.

During the challenge a public leaderboard will be available online, showing Precision, Recall and F-Measure of the submitted solutions, calculated on a secret evaluation dataset.

Participants are organized in tracks, depending on technical choice of their solution. The winners of each track will be invited to present their solution at the workshop and paper describing their experience with the benchmark will be published in the workshop proceedings. Please check the dedicated section "Challenge - Registration and Submission" for more details about tracks, and the dedicated section "Challenge - Evaluation" for details about the procedure for selecting winners.

The core component of the challenge is our end-to-end ALASKA benchmark, which consists of:

The challenge is organized in two rounds, each round based on a vertical product domain:

Round one.

At the beginning of the challenge (May 3, 2020), participants of each task will be provided with:

  • XMONITOR, i.e., ~16k specifications from the MONITOR vertical;
  • YtMONITOR ⊂ EtMONITOR, i.e., a subset of the ground truth for the task t.

Given a task t, participants will be asked to combine the information in XMONITOR, accordingly to t.



Round two.

On June 16, participants will be provided with:

  • a new dataset Xv (where v ≠ MONITOR) containing specifications and attributes from a different product category;
  • the new datasets Ytv ⊂ Etv.

In this phase, participants will be asked to repeat the same operations performed during the 1st round but on the new Xv dataset. Eventually, they can also continue to work on XMONITOR at the same time.



Important dates
Date Event
3 May 2020 Round One starts: MONITOR dataset and labelled data released.
16 June 2020
23 June 2020
Round Two starts: NEW dataset (of a different product category) and labelled data released.
23 June 2020
30 June 2020
07 July 2020
Result submission deadline for participants to the challenge.
30 June 2020
05 July 2020
09 July 2020
Notification of the reproducibility test.
06 July 2020
13 July 2020
Paper submission deadline.
01 August 2020
03 August 2020
05 August 2020
Notification of acceptance.
15 August 2020
18 August 2020
Camera ready.
31 August 2020 Workshop.

Challenge - Dataset


Participants will be provided with a set of products specifications (in short, specs) in JSON format, automatically extracted from multiple e-commerce websites.

Each specification has been extracted from a web page and refers to a real-world product. A specification consists of a list of <attribute_name, attribute_value> pairs and is stored in a file; files are organized into directories, each directory corresponds to a web source (e.g., www.ebay.com).

Example of specification
                  
{
"<page title>": "ASUS VT229H & Full Specifications at ebay.com",
"screen size": "21.5''",
"brand": "Asus",
"display type": "LED",
"dimension": "Dimensions: 19.40 x 8.00 x 11.80 inches",
"refresh rate": "75hz",
"general features": "Black",
}

Note that the dataset exhibits higth degree of heterogeneity both across and within sources. Attribute names are sparse (only the page title is always present), there are several homonyms — i.e., attributes with the same name but different semantics (e.g., "device type" sometimes refer to "screen type", like "LCD", others to "screen size diagonal", like "23''") — and several synonyms — i.e., attributes with the same semantics but different names (e.g., "display diagonal" and "screen size").

Challenge - Entity resolution task


The Entity resolution task consists in identifying which specs of Xv represent the same real-world product (e.g. ASUS VT229H).

Participants to the Entity resolution task are provided with a labelled dataset in CSV format (i.e., YERv), containing three columns: "left_spec_id", "right_spec_id" and "label":

Example of YERv
left_spec_id, right_spec_id, label
www.ebay.com//1, www.ebay.com//2, 1
www.ebay.com//3, catalog.com//1, 1
catalog.com//1, ca.pcpartpicker.com//1, 0

Note that there might be matching pairs even within the same web source, and that the labelled dataset YERv is transitively closed (i.e., if A matches with B and B matches with C, then A matches with C).

Your goal is to find all pairs of product specs in the dataset Xv that match, that is, refer to the same real-world product. Your output must be stored in a CSV file containing only the matching spec pairs found by your system. The CSV file must have two columns: "left_spec_id" and "right_spec_id"; each row in this CSV file consists of just two ids, separated by comma.

Example of output CSV file
left_spec_id, right_spec_id
www.ebay.com//10, www.ebay.com//20
www.ebay.com//30, buy.net//10
..., ...

Challenge - Schema matching task


The Schema matching task consists in identifying mappings between source attributes (e.g. the attribute "brand" from source "www.ebay.com") and a set of target attributes (e.g. "brand", "dimensions", "screen_size", etc.) defined in a given mediated schema.

Participants to the Schema matching task are provided with the mediated schema (in TXT format, one target attribute per row) and a labelled dataset in CSV format (i.e., YSMv), containing two columns: "source_attribute_id" and "target_attribute_name":

Example of YSMv
source_attribute_id, target_attribute_name
www.ebay.com//producer name, brand
www.ebay.com//brand, brand
www.odsi.co.uk//device type, screen_type
www.odsi.co.uk//device type, screen_size_diagonal

Note that some source attribute have values refer to multiple target attributes. Therefore, there might be source attributes with mappings to more than one target attribute. For instance, if the set of values related to the source attribute "www.odsi.co.uk//device type" is the following:

Then this source attribute is mapped with target attributes "screen_type" (because of value1 and value3) and "screen_size_diagonal" (because of value1 and value2).

Your goal is to find mappings between source attributes in the dataset Xv and target attributes of the mediated schema. Your output must be stored in a CSV file containing all the mappings found by your system. The CSV file must have two columns: "source_attribute_id" and "target_attribute_name", separated by comma.

The output file valid for a submission have the same format of the labelled data YSMMONITOR and must contain mappings from the source attributes to the target attributes in the mediated schema given as an input:


Example of output CSV file
source_attribute_id, target_attribute_name
www.catalog.com//brand, brand
www.vology.com//screen size, screen_size_diagonal
..., ...

Challenge - Instance level attribute matching task


The Instance-level attribute matching task consists in identifying mappings between instance attributes (e.g. the attribute "brand" from the specification "1.json" of the source "www.ebay.com") and a set of target attributes (e.g. "brand", "dimensions", "screen_size", etc.) defined in the given mediated schema (which is the same of the schema matching task). Thus, it's a finer grain task compared to schema matching.

Participants to the Instance-level attribute matching task are provided with the mediated schema (in TXT format, one target attribute per row) and a labelled dataset in CSV format (i.e., YILAMv), containing two columns: "instance_attribute_id" and "target_attribute_name":

Example of YILAMv
instance_attribute_id, target_attribute_name
www.ebay.com//1//producer name, brand
www.odsi.co.uk//1//device type, screen_type
www.odsi.co.uk//1//device type, screen_size_diagonal
www.odsi.co.uk//2//device type, screen_size_diagonal

For instance, if the value related to the instance attribute "www.odsi.co.uk//1//device type" is "LED-backlit LCD monitor - 23''", then this attribute is mapped with both screen_type and screen_size_diagonal target attributes. Instead, if the value related to the instance attribute "www.odsi.co.uk//2//device type" is "23''", then this attribute is mapped only with screen_size_diagonal.

Your goal is to find mappings from instance attributes in the dataset Xv to the target attributes of the mediated schema. Your output must be stored in a CSV file containing all the mappings found by your system. The CSV file must have two columns: "instance_attribute_id" and "target_attribute_name", separated by comma.

The output file valid for a submission have the same format of the labelled data YILAMMONITOR and must contain mappings from the instance attributes to the target attributes in the mediated schema given as an input:


Example of output CSV file
instance_attribute_id, target_attribute_name
www.ebay.com//10//producer name, brand
www.odsi.co.uk//10//device type, screen_type
www.odsi.co.uk//10//device type, screen_size_diagonal
www.odsi.co.uk//20//device type, screen_size_diagonal
..., ...

Challenge - Registration and Submission


Every participant needs to register here. After registration each you will receive by e-mail (within 1 working day) an alphanumeric Team ID that will be used for the submissions.

Please note that during the challenge you will be able to submit solutions even for tasks you are not registered in, simply by filling out the submission form.

To submit a solution for the MONITOR vertical, participants must use this submission form.

To submit a solution for the NOTEBOOK vertical, participants must use this submission form.

In every submission, participants need to fill the submission form always with their Team ID.

Submissions must include only the output CSV file. Please remember that the correct format for the output CSV file depends on the task you are participating; the formats are described in the dedicated tasks description sections.

Multiple submissions are allowed. The last submitted CSV file will override the previous submitted files.

When participants want to submit a new solution, they have to specify which task and which track they are participating in. Tracks are defined according to technical details of the solution. We consider 8 tracks, each one defined by answering Yes/No to the following questions:

Note that generic external knowledge (such as, pretrained embeddings and LMs) is not considered domain specific knowledge.

For example, a solution that uses Bert and a classifier falls in the YNN track. A solution that uses Bert and computes matches based on a simple cosine similarity threshold falls in the NNN track. A machine learning solution leveraging a catalog of brands falls in the YYN track. Note that NNN is still a valid track.

If you are unsure on how to classify your solution, you can contact us by email (alaska.benchmark@gmail.com).

In case you are participating to more than one task, or if you are implementing solutions for different tracks, please compile a new form for each task/track you have a solution for.

Challenge - Evaluation


Submitted solutions are ranked on the basis of F-measure (the harmonic mean of precision and recall), rounded up to three decimal places. Precision and recall are computed w.r.t. a hidden evaluation dataset, i.e. Etv - Ytv.

For clarity purposes, the figures below represent how evaluation works for each available task.

Entity Resolution.

In the graphs, nodes represent specs and edges represent matching relationships.



Precision and recall of the submitted solution will be evaluated only on the edges in which both nodes are included in the hidden evaluation dataset, as illustrated in the figure below.



Schema matching.

In the bipartite graphs, nodes on the left side represent source attributes, nodes in the right side represent target attributes and edges represent matching relationships.



Precision and recall of the submitted solution will be evaluated only on the edges from source attributes which are included in the hidden evaluation dataset to the target attributes of the mediated schema.



Instance-level attribute matching.

In the bipartite graphs, nodes on the left side represent instance attributes, nodes in the right side represent target attributes and edges represent matching relationships.



Precision and recall of the submitted solution will be evaluated only on the edges from instance attributes which are included in the secret evaluation dataset to the target attributes of the mediated schema.



The results of the evaluation, in terms of Precision, Recall and F-Measure, will be shown in our public leaderboard, updated twice a week.

After the challenge deadline, we will publish the final leaderboard. The top solutions of each task/track will be submitted to a reproducibility test. Their authors will be asked to provide a package with:

We will evaluate if the provided information is likely to be sufficient to understand and to reproduce the experiments and verify if it is reasonable in scope and content. Note that we might also run actual reproducibility experiments, to check if the csv file produced by the code in the submitted package is consistent with the submitted output.

Challenge - Downloads


Round 1
Dataset XMONITOR Specs Dataset from the monitor vertical 3.59 Mb
Dataset YERMONITOR Labelled Dataset for the Entity Resolution task 5.33 Mb *
5.22 Mb (05-26-2020 fix) New
Dataset YSMMONITOR Labelled Dataset for the Schema Matching task 7.70 kB
Dataset YILAMMONITOR Labelled Dataset for the Instance-level attribute matching task 41.2 Kb
Mediated schema Mediated schema for Schema matching and Instance-level attribute matching tasks. 1.67 Kb
Round 2
Dataset XNOTEBOOK Specs Dataset from the notebook vertical 7.51 Mb
Dataset YERNOTEBOOK Labelled Dataset for the Entity Resolution task 2.98 Mb
Dataset YSMNOTEBOOK Labelled Dataset for the Schema Matching task 8.80 kB
Dataset YILAMNOTEBOOK Labelled Dataset for the Instance-level attribute matching task 19.3 Kb
Notebook Mediated schema Mediated schema for Schema matching and Instance-level attribute matching tasks. 667 byte

* This version of the labelled data for the Entity Resolution task contained 10 wrong specifications. Please download the fixed version.

Notebook data sets will be available after the end of the Sigmod 2021 Programming Contest

Submission form for the MONITOR vertical.

Submission form for the NOTEBOOK vertical.

Committees


Program Chairs Challenge Chairs Workshop Organizers Program Committees Proceedings Chair

Speakers


Keynote speakers

Workshop Program


The DI2KG'20 workshop program is available below, and in the VLDB 2020 program page. Partecipants can use Zoom to join the workshop and DI2KG Slack channel (see here to sign in to the VLDB2020 official Slack channels) to join the workshop discussion.

The workshop program will be played twice: once in time block 1 from 8AM UTC (see the VLDB 2020 Time Zone Conversion Chart) and once in time block 2 from 3PM UTC. DI2KG authors will take live QA during block 1, while invited speakers will take live QA during block 2. See table below for summary information.

time block time (UTC) type title authorlist
W1_6 08:00-08:15 recorded + live QA Intermediate Training of BERT for Product Matching Ralph Peeters (University of Mannheim), Christian Bizer (University of Mannheim) and Goran Glavaš (University of Mannheim)
W1_6 08:15-08:30 recorded + live QA Fast Entity Resolution With Mock Labels and Sorted Integer Sets Mark Blacher (Friedrich Schiller University Jena), Joachim Giesen (Friedrich Schiller University Jena), Sören Laue (Friedrich Schiller University Jena), Julien Klaus (Friedrich Schiller University Jena) and Matthias Mitterreiter (Friedrich Schiller University Jena)
W1_6 08:30-09:00 recorded + without live QA Bringing Data Exchange to Knowledge Bases Renée Miller (Northeastern University)
W1_6 09:00-09:15 recorded + live QA Entity Resolution on Camera Records without Machine Learning Luca Zecchini (Università degli Studi di Modena e Reggio Emilia), Giovanni Simonini (Università degli Studi di Modena e Reggio Emilia) and Sonia Bergamaschi (Università degli Studi di Modena e Reggio Emilia)
W1_6 09:15-09:30 recorded + live QA CheetahER: A Fast Entity Resolution System for Heterogeneous Camera Data Nan Deng (Southern University of Science and Technology), Wendi Luan (Southern University of Science and Technology), Haotian Liu (Southern University of Science and Technology) and Bo Tang (Southern University of Science and Technology)
W1_6 09:30-10:00 recorded + without live QA Knowledge-graph aware language models William W. Cohen (Google)
W1_6 10:00-10:15 recorded + live QA An Extensible Block Scheme-Based Method for Entity Matching Jiawei Wang (Jinan University), Haizhou Ye (Jinan University) and Jianhui Huang (Jinan University)
W1_6 10:15-10:30 recorded + live QA Spread the good around! Information Propagation in Schema Matching and Entity Resolution for Heterogeneous Data Gabriel Campero Durand (University of Magdeburg), Anshu Daur (University of Magdeburg), Vinayak Kumar (University of Magdeburg), Shivalika Suman (University of Magdeburg), Altaf Mohammed Aftab (University of Magdeburg), Sajad Karim (University of Magdeburg), Prafulla Diwesh (University of Magdeburg), Chinmaya Hegde (University of Magdeburg), Disha Setlur (University of Magdeburg), Syed Md Ismail (University of Magdeburg), David Broneske (University of Magdeburg) and Gunter Saake (University of Magdeburg)
W1_6 10:30-11:00 recorded + without live QA The Diffbot Knowledge Graph Mike Tung (Diffbot)
W1_6 11:00-12:00 live streaming + live QA Student Panel: Towards the next generation of benchmarks for Data Integration and Knowlegde Graph construction
Moderator: Donatella Firmani
Bahar Ghadiri Bashardoost (University of Toronto), Riccardo Cappuzzo (EURECOM), Daniel Obraczka (University of Leipzig)
BREAK
W2_6 15:00-15:40 live streaming + live QA Bringing Data Exchange to Knowledge Bases Renée Miller (Northeastern University)
W2_6 15:40-15:50 recorded + without live QA Intermediate Training of BERT for Product Matching Ralph Peeters (University of Mannheim), Christian Bizer (University of Mannheim) and Goran Glavaš (University of Mannheim)
W2_6 15:50-16:00 recorded + without live QA Fast Entity Resolution With Mock Labels and Sorted Integer Sets Mark Blacher (Friedrich Schiller University Jena), Joachim Giesen (Friedrich Schiller University Jena), Sören Laue (Friedrich Schiller University Jena), Julien Klaus (Friedrich Schiller University Jena) and Matthias Mitterreiter (Friedrich Schiller University Jena)
W2_6 16:00-16:40 recorded + live QA Knowledge-graph aware language models William W. Cohen (Google)
W2_6 16:40-16:50 recorded + without live QA Entity Resolution on Camera Records without Machine Learning Luca Zecchini (Università degli Studi di Modena e Reggio Emilia), Giovanni Simonini (Università degli Studi di Modena e Reggio Emilia) and Sonia Bergamaschi (Università degli Studi di Modena e Reggio Emilia)
W2_6 16:50-17:00 recorded + without live QA CheetahER: A Fast Entity Resolution System for Heterogeneous Camera Data Nan Deng (Southern University of Science and Technology), Wendi Luan (Southern University of Science and Technology), Haotian Liu (Southern University of Science and Technology) and Bo Tang (Southern University of Science and Technology)
W2_6 17:00-17:40 recorded + live QA The Diffbot Knowledge Graph Mike Tung (Diffbot)
W2_6 17:40-17:50 recorded + without live QA An Extensible Block Scheme-Based Method for Entity Matching Jiawei Wang (Jinan University), Haizhou Ye (Jinan University) and Jianhui Huang (Jinan University)
W2_6 17:50-18:00 recorded + without live QA Spread the good around! Information Propagation in Schema Matching and Entity Resolution for Heterogeneous Data Gabriel Campero Durand (University of Magdeburg), Anshu Daur (University of Magdeburg), Vinayak Kumar (University of Magdeburg), Shivalika Suman (University of Magdeburg), Altaf Mohammed Aftab (University of Magdeburg), Sajad Karim (University of Magdeburg), Prafulla Diwesh (University of Magdeburg), Chinmaya Hegde (University of Magdeburg), Disha Setlur (University of Magdeburg), Syed Md Ismail (University of Magdeburg), David Broneske (University of Magdeburg) and Gunter Saake (University of Magdeburg)
W2_6 18:00-19:00 recorded + without live QA Student Panel: Towards the next generation of benchmarks for Data Integration and Knowlegde Graph construction
Moderator: Donatella Firmani
Bahar Ghadiri Bashardoost (University of Toronto), Riccardo Cappuzzo (EURECOM), Daniel Obraczka (University of Leipzig)

Proceedings