DI2KG Workshop - VLDB 2020

Date	Event
27 August 2020 New	The DI2KG'20 workshop program is online.
14 July 2020	Keynote speakers for the DI2KG 2020 Workshop have been announced.
13 July 2020	Congratulations to the winners of the DI2KG 2020 Challenge! The final Leaderboard is available in the dedicated section.
08 July 2020	Submissions for the DI2KG2020 Challenge are closed. Please, check the final leaderboard in the dedicated Leaderboard section. Congratulations to the finalist teams: JNU_Cyber and SimSkipReloaded.
01 July 2020	We have updated the deadlines for the Challenge. Please, check the new dates in the Challenge overview section.
24 June 2020	Round 2 of the Challenge starts today! Please check the dedicated "Downloads" section to find the new dataset and labelled data.
21 June 2020	Link for the paper submission is online. Please check the Call for paper section for details.
10 June 2020	We have updated the deadlines. Please, check the new dates in the Call for paper section and in the Challenge overview section.
19 May 2020	The challenge leaderboard is online.
3 May 2020	Challenges begins!

About DI2KG

The DI2KG workshop has the goal of driving innovative solutions for data integration and knowledge graph construction: these are complex processes, which involve many issues that have been studied by different communities (data management, IR, NLP, machine learning), typically in isolation. As more holistic solutions are emerging, we claim the need for a more cross-disciplinary community, that pushes research toward the creation of the next generation of data integration and knowledge graph construction methods.

This is the second edition of the DI2KG workshop. The first edition was held in conjunction with KDD 2019 (http://di2kg.inf.uniroma3.it/2019/).

We aim for DI2KG to be a long-term venue that fosters new research based on the availability of the DI2KG benchmark, an end-to-end benchmark designed up-front for dealing with the complexity of every integration task, while building a community that can contribute to the evolution of the benchmark.

In order to stimulate advances in this direction we organize also the DI2KG Challenge: a set of fundamental integration tasks leading to the construction of a knowledge graph from a collection of product specifications extracted from the Web with their own manually-curated ground truth.

We invite researchers and practitioners to participate in our benchmark-supported DI2KG Challenge and submit a paper describing their experience on the activities on the benchmark and new insights into the strengths and weaknesses of existing integration systems.

The DI2KG Challenge comprises three main tasks:

Entity resolution
Schema matching
Instance-level attribute matching

Call for papers

We strongly encourage thought-provoking papers that fall under the following categories:

Challenge papers, which provide new insights into the strengths and weaknesses of existing integration systems, inspired by experimental activities on the DI2KG Challenge.
Position papers, which discuss requirements for a benchmark platform and the role of benchmarks in driving integration research.
Vision papers, which anticipate new challenges in integration and future research direction.
Application papers, which describe challenging use cases of modern data integration and knowledge graphs, with a strong economic and social impact component.
Technical papers, which present advances in topics related to integration.

Topics of interest include but are not limited to the following:

Source selection and discovery
Data and information extraction
Data cleaning and fusion
Schema extraction and alignment
Algorithmic and statistical techniques for entity resolution
Machine learning methods for data integration
Benchmarking and performance measurement
Knowledge graph augmentation
Knowledge graph embedding techniques

Workshop proceedings will be submitted for publication to CEUR (indexed by DBLP and Scopus). A selection of best papers will be recommended for inclusion in a special issue of a high-quality international journal.

Authors can submit papers up to 4 pages of content plus unlimited pages for bibliography, written in English, and in PDF according to the ACM Proceedings Format. Submissions will go through a single-blind review process, and will be evaluated on the basis of relevance and potential to raise discussion.

Submit paper

Date	Event
~~06 July 2020~~ 13 July 2020	Paper submission deadline.
~~01 August 2020~~ ~~03 August 2020~~ 05 August 2020	Notification of acceptance.
~~15 August 2020~~ 18 August 2020	Camera ready.
31 August 2020	Workshop.

1. Challenge overview

The challenge concentrates on some fundamental integration tasks that lead to the construction of a knowledge graph from a collection of product specifications extracted from the Web:

Entity resolution (ER)
Schema matching (SM)
Instance-level attribute matching (ILAM)

Participants to the challenge can join one or more tasks. All the participants are invited to submit a paper describing their solution as well as their experience with the challenge to the DI2KG workshop.

During the challenge a public leaderboard will be available online, showing Precision, Recall and F-Measure of the submitted solutions, calculated on a secret evaluation dataset.

Participants are organized in tracks, depending on technical choice of their solution. The winners of each track will be invited to present their solution at the workshop and paper describing their experience with the benchmark will be published in the workshop proceedings. Please check the dedicated section "Challenge - Registration and Submission" for more details about tracks, and the dedicated section "Challenge - Evaluation" for details about the procedure for selecting winners.

The core component of the challenge is our end-to-end ALASKA benchmark, which consists of:

a set of products specifications collected over different web sources. Each specification consists of a set of key-value pairs (e.g., “height”: “500mm”). Products specifications are organized into different product category verticals v. We will refer to this dataset as dataset X_v: for instance, X_MONITOR is the dataset of specifications from the monitor vertical. Details on the datasets are in the Dataset section;
a manually-curated ground truth for each supported task t, involving an exhaustive subset of the information in X_v. We will refer to this dataset as dataset E^t_v: for instance, E^ER_MONITOR is the ground truth related to the Entity Resolution task for X_MONITOR.

The challenge is organized in two rounds, each round based on a vertical product domain:

Round one.

At the beginning of the challenge (May 3, 2020), participants of each task will be provided with:

X_MONITOR, i.e., ~16k specifications from the MONITOR vertical;
Y^t_MONITOR ⊂ E^t_MONITOR, i.e., a subset of the ground truth for the task t.

Given a task t, participants will be asked to combine the information in X_MONITOR, accordingly to t.

Round two.

On June 16, participants will be provided with:

a new dataset X_v (where v ≠ MONITOR) containing specifications and attributes from a different product category;
the new datasets Y^t_v ⊂ E^t_v.

In this phase, participants will be asked to repeat the same operations performed during the 1st round but on the new X_v dataset. Eventually, they can also continue to work on X_MONITOR at the same time.

Important dates

Date	Event
3 May 2020	Round One starts: MONITOR dataset and labelled data released.
~~16 June 2020~~ 23 June 2020	Round Two starts: NEW dataset (of a different product category) and labelled data released.
~~23 June 2020~~ ~~30 June 2020~~ 07 July 2020	Result submission deadline for participants to the challenge.
~~30 June 2020~~ ~~05 July 2020~~ 09 July 2020	Notification of the reproducibility test.
~~06 July 2020~~ 13 July 2020	Paper submission deadline.
~~01 August 2020~~ ~~03 August 2020~~ 05 August 2020	Notification of acceptance.
~~15 August 2020~~ 18 August 2020	Camera ready.
31 August 2020	Workshop.

Challenge - Dataset

Participants will be provided with a set of products specifications (in short, specs) in JSON format, automatically extracted from multiple e-commerce websites.

Each specification has been extracted from a web page and refers to a real-world product. A specification consists of a list of <attribute_name, attribute_value> pairs and is stored in a file; files are organized into directories, each directory corresponds to a web source (e.g., www.ebay.com).

Example of specification

                  
{

   "<page title>": "ASUS VT229H & Full Specifications at ebay.com",

   "screen size": "21.5''",

   "brand": "Asus",

   "display type": "LED",

   "dimension": "Dimensions: 19.40 x 8.00 x 11.80 inches",

   "refresh rate": "75hz",

   "general features": "Black",

}

Note that the dataset exhibits higth degree of heterogeneity both across and within sources. Attribute names are sparse (only the page title is always present), there are several homonyms — i.e., attributes with the same name but different semantics (e.g., "device type" sometimes refer to "screen type", like "LCD", others to "screen size diagonal", like "23''") — and several synonyms — i.e., attributes with the same semantics but different names (e.g., "display diagonal" and "screen size").

Challenge - Entity resolution task

The Entity resolution task consists in identifying which specs of X_v represent the same real-world product (e.g. ASUS VT229H).

Participants to the Entity resolution task are provided with a labelled dataset in CSV format (i.e., Y^ER_v), containing three columns: "left_spec_id", "right_spec_id" and "label":

the "spec_id" is a global identifier for a spec and consists of a relative path of the spec file. Note that instead of "/" the spec_id uses a special character "//" and that there is no extension. For instance, the spec_id "www.ebay.com//1000" refers to the 1000.json file inside the www.ebay.com directory. All "spec_id" in the labelled dataset Y^ER_v refer to product specs in dataset X_v. Thus, the dataset Y^ER_v provides labels for a subset of the product pairs in the Cartesian product of the specs dataset X_v with itself;
each row of the labelled dataset represents a pair of specifications. Label=1 means that the left spec and the right spec refer to the same real-world product (in short, that they are matching). Label=0 means that the left spec and the right spec refer to different real-world products (in short, that they are non-matching).

Example of Y^ER_v

left_spec_id, right_spec_id, label
www.ebay.com//1, www.ebay.com//2, 1
www.ebay.com//3, catalog.com//1, 1
catalog.com//1, ca.pcpartpicker.com//1, 0

Note that there might be matching pairs even within the same web source, and that the labelled dataset Y^ER_v is transitively closed (i.e., if A matches with B and B matches with C, then A matches with C).

Your goal is to find all pairs of product specs in the dataset X_v that match, that is, refer to the same real-world product. Your output must be stored in a CSV file containing only the matching spec pairs found by your system. The CSV file must have two columns: "left_spec_id" and "right_spec_id"; each row in this CSV file consists of just two ids, separated by comma.

Example of output CSV file

left_spec_id, right_spec_id
www.ebay.com//10, www.ebay.com//20
www.ebay.com//30, buy.net//10
..., ...

Challenge - Schema matching task

The Schema matching task consists in identifying mappings between source attributes (e.g. the attribute "brand" from source "www.ebay.com") and a set of target attributes (e.g. "brand", "dimensions", "screen_size", etc.) defined in a given mediated schema.

Participants to the Schema matching task are provided with the mediated schema (in TXT format, one target attribute per row) and a labelled dataset in CSV format (i.e., Y^SM_v), containing two columns: "source_attribute_id" and "target_attribute_name":

the "source_attribute_id" is a global identifier for an attribute at source level. For instance, the source_attribute_id "www.ebay.com//screen size" refers to all the "screen size" attributes from specs in source "www.ebay.com". All "source_attribute_id" in the labelled dataset Y^SM_v refer to one or multiple target attributes (or properties) of the given mediated schema. Thus, the dataset Y^SM_v provides a subset of mappings from source attributes in the specs dataset X_v and target attributes in the mediated schema;
the "target_attribute_name" is the name of the target attribute in the mediated schema (e.g. "brand", "screen_size", etc.).

Example of Y^SM_v

source_attribute_id, target_attribute_name
www.ebay.com//producer name, brand
www.ebay.com//brand, brand
www.odsi.co.uk//device type, screen_type
www.odsi.co.uk//device type, screen_size_diagonal

Note that some source attribute have values refer to multiple target attributes. Therefore, there might be source attributes with mappings to more than one target attribute. For instance, if the set of values related to the source attribute "www.odsi.co.uk//device type" is the following:

value₁: "LED-backlit LCD monitor - 23''"
value₂: "23''"
value₃: "LED LCD"

Then this source attribute is mapped with target attributes "screen_type" (because of value₁ and value₃) and "screen_size_diagonal" (because of value₁ and value₂).

Your goal is to find mappings between source attributes in the dataset X_v and target attributes of the mediated schema. Your output must be stored in a CSV file containing all the mappings found by your system. The CSV file must have two columns: "source_attribute_id" and "target_attribute_name", separated by comma.

The output file valid for a submission have the same format of the labelled data Y^SM_MONITOR and must contain mappings from the source attributes to the target attributes in the mediated schema given as an input:

Example of output CSV file

source_attribute_id, target_attribute_name
www.catalog.com//brand, brand
www.vology.com//screen size, screen_size_diagonal
..., ...

Challenge - Instance level attribute matching task

The Instance-level attribute matching task consists in identifying mappings between instance attributes (e.g. the attribute "brand" from the specification "1.json" of the source "www.ebay.com") and a set of target attributes (e.g. "brand", "dimensions", "screen_size", etc.) defined in the given mediated schema (which is the same of the schema matching task). Thus, it's a finer grain task compared to schema matching.

Participants to the Instance-level attribute matching task are provided with the mediated schema (in TXT format, one target attribute per row) and a labelled dataset in CSV format (i.e., Y^ILAM_v), containing two columns: "instance_attribute_id" and "target_attribute_name":

the "instance_attribute_id" is a global identifier for an attribute at instance level. For instance, the instance_attribute_id "www.ebay.com//1//screen size" refers to the "screen size" attribute from spec "1.json" in source "www.ebay.com". All "instance_attribute_id" in the labelled dataset Y^ILAM_v refer to one or multiple target attributes of the mediated schema. Thus, the dataset Y^ILAM_v provides a subset of mappings from instance attributes in the specs dataset X_v and target attributes in the mediated schema;
the "target_attribute_name" is the name of the target attribute in the mediated schema (e.g. "brand", "screen_size", etc.).

Example of Y^ILAM_v

instance_attribute_id, target_attribute_name
www.ebay.com//1//producer name, brand
www.odsi.co.uk//1//device type, screen_type
www.odsi.co.uk//1//device type, screen_size_diagonal
www.odsi.co.uk//2//device type, screen_size_diagonal

For instance, if the value related to the instance attribute "www.odsi.co.uk//1//device type" is "LED-backlit LCD monitor - 23''", then this attribute is mapped with both screen_type and screen_size_diagonal target attributes. Instead, if the value related to the instance attribute "www.odsi.co.uk//2//device type" is "23''", then this attribute is mapped only with screen_size_diagonal.

Your goal is to find mappings from instance attributes in the dataset X_v to the target attributes of the mediated schema. Your output must be stored in a CSV file containing all the mappings found by your system. The CSV file must have two columns: "instance_attribute_id" and "target_attribute_name", separated by comma.

The output file valid for a submission have the same format of the labelled data Y^ILAM_MONITOR and must contain mappings from the instance attributes to the target attributes in the mediated schema given as an input:

Example of output CSV file

instance_attribute_id, target_attribute_name
www.ebay.com//10//producer name, brand
www.odsi.co.uk//10//device type, screen_type
www.odsi.co.uk//10//device type, screen_size_diagonal
www.odsi.co.uk//20//device type, screen_size_diagonal
..., ...

Challenge - Registration and Submission

Every participant needs to register here. After registration each you will receive by e-mail (within 1 working day) an alphanumeric Team ID that will be used for the submissions.

Please note that during the challenge you will be able to submit solutions even for tasks you are not registered in, simply by filling out the submission form.

To submit a solution for the MONITOR vertical, participants must use this submission form.

To submit a solution for the NOTEBOOK vertical, participants must use this submission form.

In every submission, participants need to fill the submission form always with their Team ID.

Submissions must include only the output CSV file. Please remember that the correct format for the output CSV file depends on the task you are participating; the formats are described in the dedicated tasks description sections.

Multiple submissions are allowed. The last submitted CSV file will override the previous submitted files.

When participants want to submit a new solution, they have to specify which task and which track they are participating in. Tracks are defined according to technical details of the solution. We consider 8 tracks, each one defined by answering Yes/No to the following questions:

are you using supervised machine learning?
are you relying on domain specific knowledge? (e.g., catalogs, thesauri, predefined patterns);
do you need humans-in-the-loop? (e.g., human oracles or crowdsourcing).

Note that generic external knowledge (such as, pretrained embeddings and LMs) is not considered domain specific knowledge.

For example, a solution that uses Bert and a classifier falls in the YNN track. A solution that uses Bert and computes matches based on a simple cosine similarity threshold falls in the NNN track. A machine learning solution leveraging a catalog of brands falls in the YYN track. Note that NNN is still a valid track.

If you are unsure on how to classify your solution, you can contact us by email (alaska.benchmark@gmail.com).

In case you are participating to more than one task, or if you are implementing solutions for different tracks, please compile a new form for each task/track you have a solution for.

Challenge - Evaluation

Submitted solutions are ranked on the basis of F-measure (the harmonic mean of precision and recall), rounded up to three decimal places. Precision and recall are computed w.r.t. a hidden evaluation dataset, i.e. E^t_v - Y^t_v.

For clarity purposes, the figures below represent how evaluation works for each available task.

Entity Resolution.

In the graphs, nodes represent specs and edges represent matching relationships.

Precision and recall of the submitted solution will be evaluated only on the edges in which both nodes are included in the hidden evaluation dataset, as illustrated in the figure below.

Schema matching.

In the bipartite graphs, nodes on the left side represent source attributes, nodes in the right side represent target attributes and edges represent matching relationships.

Precision and recall of the submitted solution will be evaluated only on the edges from source attributes which are included in the hidden evaluation dataset to the target attributes of the mediated schema.

Instance-level attribute matching.

In the bipartite graphs, nodes on the left side represent instance attributes, nodes in the right side represent target attributes and edges represent matching relationships.

Precision and recall of the submitted solution will be evaluated only on the edges from instance attributes which are included in the secret evaluation dataset to the target attributes of the mediated schema.

The results of the evaluation, in terms of Precision, Recall and F-Measure, will be shown in our public leaderboard, updated twice a week.

After the challenge deadline, we will publish the final leaderboard. The top solutions of each task/track will be submitted to a reproducibility test. Their authors will be asked to provide a package with:

the code and other relevant files;
a README.txt describing the implemented solution and how to run it;
the output csv containing the solution.

We will evaluate if the provided information is likely to be sufficient to understand and to reproduce the experiments and verify if it is reasonable in scope and content. Note that we might also run actual reproducibility experiments, to check if the csv file produced by the code in the submitted package is consistent with the submitted output.

Challenge - Downloads

Round 1
Dataset X_MONITOR	Specs Dataset from the monitor vertical	3.59 Mb
Dataset Y^ER_MONITOR	Labelled Dataset for the Entity Resolution task	~~5.33 Mb~~ * 5.22 Mb (05-26-2020 fix) New
Dataset Y^SM_MONITOR	Labelled Dataset for the Schema Matching task	7.70 kB
Dataset Y^ILAM_MONITOR	Labelled Dataset for the Instance-level attribute matching task	41.2 Kb
Mediated schema	Mediated schema for Schema matching and Instance-level attribute matching tasks.	1.67 Kb
Round 2
Dataset X_NOTEBOOK	Specs Dataset from the notebook vertical	7.51 Mb
Dataset Y^ER_NOTEBOOK	Labelled Dataset for the Entity Resolution task	~~2.98 Mb~~
Dataset Y^SM_NOTEBOOK	Labelled Dataset for the Schema Matching task	~~8.80 kB~~
Dataset Y^ILAM_NOTEBOOK	Labelled Dataset for the Instance-level attribute matching task	~~19.3 Kb~~
Notebook Mediated schema	Mediated schema for Schema matching and Instance-level attribute matching tasks.	~~667 byte~~

* This version of the labelled data for the Entity Resolution task contained 10 wrong specifications. Please download the fixed version.

Notebook data sets will be available after the end of the Sigmod 2021 Programming Contest

Submission form for the MONITOR vertical.

Submission form for the NOTEBOOK vertical.

Speakers

Keynote speakers

William W. Cohen (Google) - Knowledge-graph aware language models

Abstract: Neural language models, which can be pretrained on very large corpora, turn out to "know" a lot about the world, in the sense that they can be trained to answer questions surprisingly reliably. However, "language models as knowledge graphs" have many disadvantages: for example, they cannot be easily updated when information changes. I will describe recent work in my team and elsewhere on incorporating symbolic knowledge into language models and question-answering systems, and also comment on some of the remaining challenges associated with integrating symbolic KG-like reasoning and neural NLP. This is joint work with Pat Verga, Haitian Sun, Fernando Pereira, and several other colleagues at Google.

Speaker Bio: William Cohen is a Principal Scientist at Google, and is based in Google's Pittsburgh office. He received his bachelor's degree in Computer Science from Duke University in 1984, and a PhD in Computer Science from Rutgers University in 1990. From 1990 to 2000 Dr. Cohen worked at AT&T Bell Labs and later AT&T Labs-Research, and from April 2000 to May 2002 Dr. Cohen worked at Whizbang Labs, a company specializing in extracting information from the web. From 2002 to 2018, Dr. Cohen worked at Carnegie Mellon University in the Machine Learning Department, with a joint appointment in the Language Technology Institute, as an Associate Research Professor, a Research Professor, and a Professor. Dr. Cohen also was the Director of the Undergraduate Minor in Machine Learning at CMU and co-Director of the Master of Science in ML Program. Dr. Cohen is a past president of the International Machine Learning Society. In the past he has also served as an action editor for the the AI and Machine Learning series of books published by Morgan Claypool, for the journal Machine Learning, the journal Artificial Intelligence, the Journal of Machine Learning Research, and the Journal of Artificial Intelligence Research. He was General Chair for the 2008 International Machine Learning Conference, held July 6-9 at the University of Helsinki, in Finland; Program Co-Chair of the 2006 International Machine Learning Conference; and Co-Chair of the 1994 International Machine Learning Conference. Dr. Cohen was also the co-Chair for the 3rd Int'l AAAI Conference on Weblogs and Social Media, which was held May 17-20, 2009 in San Jose, and was the co-Program Chair for the 4rd Int'l AAAI Conference on Weblogs and Social Media. He is a AAAI Fellow, and was a winner of the 2008 the SIGMOD "Test of Time" Award for the most influential SIGMOD paper of 1998, and the 2014 SIGIR "Test of Time" Award for the most influential SIGIR paper of 2002-2004. Dr. Cohen's research interests include information integration and machine learning, particularly information extraction, text categorization and learning from large datasets. He has a long-standing interest in statistical relational learning and learning models, or learning from data, that display non-trivial structure. He holds seven patents related to learning, discovery, information retrieval, and data integration, and is the author of more than 200 publications .
Renée Miller (Northeastern University) - Bringing Data Exchange to Knowledge Bases

Abstract: In this keynote, I discuss my experience in bringing data exchange to knowledge graphs. This experience includes the development of Kensho, a tool for generating mapping rules and performing knowledge exchange between two Knowledge Bases (KBs). I highlight the challenges addressed in Kensho including handling the open-world assumption underlying KBs, managing the rich structural complexity of KBs, and the need to handle incomplete correspondences between property paths. I will use Kensho to highlight many open problems related to knowledge exchange including how knowledge translation can inform the task of KB integration (and vice versa).

Speaker Bio: Renée J. Miller is a University Distinguished Professor of Computer Science at Northeastern University. She is a Fellow of the Royal Society of Canada, Canada’s National Academy of Science, Engineering and the Humanities. She received the US Presidential Early Career Award for Scientists and Engineers (PECASE), the highest honor bestowed by the United States government on outstanding scientists and engineers beginning their careers. She received an NSF CAREER Award, the Ontario Premier’s Research Excellence Award, and an IBM Faculty Award. She formerly held the Bell Canada Chair of Information Systems at the University of Toronto and is a fellow of the ACM. Her work has focused on the long-standing open problem of data integration and has achieved the goal of building practical data integration systems. She and her colleagues received the 2013 ICDT Test-of-Time Award and the 2020 Alonzo Church Award for Outstanding Contributions to Logic and Computation for their influential work establishing the foundations of data exchange. Professor Miller is a former president of the Very Large Data Base (VLDB) Foundation and currently serves as Editor-in-Chief of the VLDB Journal. She received her PhD in Computer Science from the University of Wisconsin, Madison and bachelor’s degrees in Mathematics and Cognitive Science from MIT.
Mike Tung (Diffbot) - The Diffbot Knowledge Graph

Abstract: The Web is perhaps the largest collective corpus of human-generated information, yet 30 years later we still do not have a system that allows applications to easily leverage the scope of the information on the Web to automate our lives. Many of the challenges of web-scale heterogeneous knowledge integration have already been previously described. We will instead focus on our experiences building a production knowledge graph search engine with hundreds of paying customers. We will cover both particular areas of information extraction from natural language, images, and structured data that we study at Diffbot as well as areas of knowledge integration, including record linking, entity linking, knowledge fusion, and quality control as they relate to the overall KG building pipeline. We will also share the latest advancements to the state-of-the-art we have made in entity linking and relation extraction and the data resources we have made available to the academic community.

Speaker Bio: Mike Tung is the CEO of Diffbot, an adviser at the Stanford StartX accelerator, and the leader of Stanford's entry in the DARPA Robotics Challenge. In a previous life, he was a patent lawyer, a grad student in the Stanford AI lab, and a software engineer at eBay, Yahoo, and Microsoft. Mike studied electrical engineering and computer science at UC Berkeley and artificial intelligence at Stanford.

Workshop Program

The DI2KG'20 workshop program is available below, and in the VLDB 2020 program page. Partecipants can use Zoom to join the workshop and DI2KG Slack channel (see here to sign in to the VLDB2020 official Slack channels) to join the workshop discussion.

The workshop program will be played twice: once in time block 1 from 8AM UTC (see the VLDB 2020 Time Zone Conversion Chart) and once in time block 2 from 3PM UTC. DI2KG authors will take live QA during block 1, while invited speakers will take live QA during block 2. See table below for summary information.

time block	time (UTC)	type	title	authorlist
W1_6	08:00-08:15	recorded + live QA	Intermediate Training of BERT for Product Matching	Ralph Peeters (University of Mannheim), Christian Bizer (University of Mannheim) and Goran Glavaš (University of Mannheim)
W1_6	08:15-08:30	recorded + live QA	Fast Entity Resolution With Mock Labels and Sorted Integer Sets	Mark Blacher (Friedrich Schiller University Jena), Joachim Giesen (Friedrich Schiller University Jena), Sören Laue (Friedrich Schiller University Jena), Julien Klaus (Friedrich Schiller University Jena) and Matthias Mitterreiter (Friedrich Schiller University Jena)
W1_6	08:30-09:00	recorded + without live QA	Bringing Data Exchange to Knowledge Bases	Renée Miller (Northeastern University)
W1_6	09:00-09:15	recorded + live QA	Entity Resolution on Camera Records without Machine Learning	Luca Zecchini (Università degli Studi di Modena e Reggio Emilia), Giovanni Simonini (Università degli Studi di Modena e Reggio Emilia) and Sonia Bergamaschi (Università degli Studi di Modena e Reggio Emilia)
W1_6	09:15-09:30	recorded + live QA	CheetahER: A Fast Entity Resolution System for Heterogeneous Camera Data	Nan Deng (Southern University of Science and Technology), Wendi Luan (Southern University of Science and Technology), Haotian Liu (Southern University of Science and Technology) and Bo Tang (Southern University of Science and Technology)
W1_6	09:30-10:00	recorded + without live QA	Knowledge-graph aware language models	William W. Cohen (Google)
W1_6	10:00-10:15	recorded + live QA	An Extensible Block Scheme-Based Method for Entity Matching	Jiawei Wang (Jinan University), Haizhou Ye (Jinan University) and Jianhui Huang (Jinan University)
W1_6	10:15-10:30	recorded + live QA	Spread the good around! Information Propagation in Schema Matching and Entity Resolution for Heterogeneous Data	Gabriel Campero Durand (University of Magdeburg), Anshu Daur (University of Magdeburg), Vinayak Kumar (University of Magdeburg), Shivalika Suman (University of Magdeburg), Altaf Mohammed Aftab (University of Magdeburg), Sajad Karim (University of Magdeburg), Prafulla Diwesh (University of Magdeburg), Chinmaya Hegde (University of Magdeburg), Disha Setlur (University of Magdeburg), Syed Md Ismail (University of Magdeburg), David Broneske (University of Magdeburg) and Gunter Saake (University of Magdeburg)
W1_6	10:30-11:00	recorded + without live QA	The Diffbot Knowledge Graph	Mike Tung (Diffbot)
W1_6	11:00-12:00	live streaming + live QA	Student Panel: Towards the next generation of benchmarks for Data Integration and Knowlegde Graph construction Moderator: Donatella Firmani	Bahar Ghadiri Bashardoost (University of Toronto), Riccardo Cappuzzo (EURECOM), Daniel Obraczka (University of Leipzig)
BREAK
W2_6	15:00-15:40	live streaming + live QA	Bringing Data Exchange to Knowledge Bases	Renée Miller (Northeastern University)
W2_6	15:40-15:50	recorded + without live QA	Intermediate Training of BERT for Product Matching	Ralph Peeters (University of Mannheim), Christian Bizer (University of Mannheim) and Goran Glavaš (University of Mannheim)
W2_6	15:50-16:00	recorded + without live QA	Fast Entity Resolution With Mock Labels and Sorted Integer Sets	Mark Blacher (Friedrich Schiller University Jena), Joachim Giesen (Friedrich Schiller University Jena), Sören Laue (Friedrich Schiller University Jena), Julien Klaus (Friedrich Schiller University Jena) and Matthias Mitterreiter (Friedrich Schiller University Jena)
W2_6	16:00-16:40	recorded + live QA	Knowledge-graph aware language models	William W. Cohen (Google)
W2_6	16:40-16:50	recorded + without live QA	Entity Resolution on Camera Records without Machine Learning	Luca Zecchini (Università degli Studi di Modena e Reggio Emilia), Giovanni Simonini (Università degli Studi di Modena e Reggio Emilia) and Sonia Bergamaschi (Università degli Studi di Modena e Reggio Emilia)
W2_6	16:50-17:00	recorded + without live QA	CheetahER: A Fast Entity Resolution System for Heterogeneous Camera Data	Nan Deng (Southern University of Science and Technology), Wendi Luan (Southern University of Science and Technology), Haotian Liu (Southern University of Science and Technology) and Bo Tang (Southern University of Science and Technology)
W2_6	17:00-17:40	recorded + live QA	The Diffbot Knowledge Graph	Mike Tung (Diffbot)
W2_6	17:40-17:50	recorded + without live QA	An Extensible Block Scheme-Based Method for Entity Matching	Jiawei Wang (Jinan University), Haizhou Ye (Jinan University) and Jianhui Huang (Jinan University)
W2_6	17:50-18:00	recorded + without live QA	Spread the good around! Information Propagation in Schema Matching and Entity Resolution for Heterogeneous Data	Gabriel Campero Durand (University of Magdeburg), Anshu Daur (University of Magdeburg), Vinayak Kumar (University of Magdeburg), Shivalika Suman (University of Magdeburg), Altaf Mohammed Aftab (University of Magdeburg), Sajad Karim (University of Magdeburg), Prafulla Diwesh (University of Magdeburg), Chinmaya Hegde (University of Magdeburg), Disha Setlur (University of Magdeburg), Syed Md Ismail (University of Magdeburg), David Broneske (University of Magdeburg) and Gunter Saake (University of Magdeburg)
W2_6	18:00-19:00	recorded + without live QA	Student Panel: Towards the next generation of benchmarks for Data Integration and Knowlegde Graph construction Moderator: Donatella Firmani	Bahar Ghadiri Bashardoost (University of Toronto), Riccardo Cappuzzo (EURECOM), Daniel Obraczka (University of Leipzig)

2nd International Workshop on Challenges and Experiences from Data Integration to Knowledge Graphs

News

About DI2KG

Call for papers

Challenge

1. Challenge overview

Challenge - Dataset

Challenge - Entity resolution task

Challenge - Schema matching task

Challenge - Instance level attribute matching task

Challenge - Registration and Submission

Challenge - Evaluation

Challenge - Downloads

Committees

Speakers

Workshop Program

Proceedings