|14 July 2020 New||Keynote speakers for the DI2KG 2020 Workshop have been announced.|
|13 July 2020 New||Congratulations to the winners of the DI2KG 2020 Challenge! The final Leaderboard is available in the dedicated section.|
|08 July 2020||Submissions for the DI2KG2020 Challenge are closed. Please, check the final leaderboard in the dedicated Leaderboard section. Congratulations to the finalist teams: JNU_Cyber and SimSkipReloaded.|
|01 July 2020||We have updated the deadlines for the Challenge. Please, check the new dates in the Challenge overview section.|
|24 June 2020||Round 2 of the Challenge starts today! Please check the dedicated "Downloads" section to find the new dataset and labelled data.|
|21 June 2020||Link for the paper submission is online. Please check the Call for paper section for details.|
|10 June 2020||We have updated the deadlines. Please, check the new dates in the Call for paper section and in the Challenge overview section.|
|19 May 2020||The challenge leaderboard is online.|
|3 May 2020||Challenges begins!|
The DI2KG workshop has the goal of driving innovative solutions for data integration and knowledge graph construction: these are complex processes, which involve many issues that have been studied by different communities (data management, IR, NLP, machine learning), typically in isolation. As more holistic solutions are emerging, we claim the need for a more cross-disciplinary community, that pushes research toward the creation of the next generation of data integration and knowledge graph construction methods.
This is the second edition of the DI2KG workshop. The first edition was held in conjunction with KDD 2019 (http://di2kg.inf.uniroma3.it/2019/).
We aim for DI2KG to be a long-term venue that fosters new research based on the availability of the DI2KG benchmark, an end-to-end benchmark designed up-front for dealing with the complexity of every integration task, while building a community that can contribute to the evolution of the benchmark.
In order to stimulate advances in this direction we organize also the DI2KG Challenge: a set of fundamental integration tasks leading to the construction of a knowledge graph from a collection of product specifications extracted from the Web with their own manually-curated ground truth.
We invite researchers and practitioners to participate in our benchmark-supported DI2KG Challenge and submit a paper describing their experience on the activities on the benchmark and new insights into the strengths and weaknesses of existing integration systems.
The DI2KG Challenge comprises three main tasks:
We strongly encourage thought-provoking papers that fall under the following categories:
Workshop proceedings will be submitted for publication to CEUR (indexed by DBLP and Scopus). A selection of best papers will be recommended for inclusion in a special issue of a high-quality international journal.
Authors can submit papers up to 4 pages of content plus unlimited pages for bibliography, written in English, and in PDF according to the ACM Proceedings Format. Submissions will go through a single-blind review process, and will be evaluated on the basis of relevance and potential to raise discussion.
13 July 2020
|Paper submission deadline.|
05 August 2020
|Notification of acceptance.|
18 August 2020
|31 August 2020||Workshop.|
The challenge concentrates on some fundamental integration tasks that lead to the construction of a knowledge graph from a collection of product specifications extracted from the Web:
Participants to the challenge can join one or more tasks. All the participants are invited to submit a paper describing their solution as well as their experience with the challenge to the DI2KG workshop.
During the challenge a public leaderboard will be available online, showing Precision, Recall and F-Measure of the submitted solutions, calculated on a secret evaluation dataset.
Participants are organized in tracks, depending on technical choice of their solution. The winners of each track will be invited to present their solution at the workshop and paper describing their experience with the benchmark will be published in the workshop proceedings. Please check the dedicated section "Challenge - Registration and Submission" for more details about tracks, and the dedicated section "Challenge - Evaluation" for details about the procedure for selecting winners.
The core component of the challenge is our end-to-end ALASKA benchmark, which consists of:
The challenge is organized in two rounds, each round based on a vertical product domain:
At the beginning of the challenge (May 3, 2020), participants of each task will be provided with:
Given a task t, participants will be asked to combine the information in XMONITOR, accordingly to t.
On June 16, participants will be provided with:
In this phase, participants will be asked to repeat the same operations performed during the 1st round but on the new Xv dataset. Eventually, they can also continue to work on XMONITOR at the same time.
|3 May 2020||Round One starts: MONITOR dataset and labelled data released.|
23 June 2020
|Round Two starts: NEW dataset (of a different product category) and labelled data released.|
07 July 2020
|Result submission deadline for participants to the challenge.|
09 July 2020
|Notification of the reproducibility test.|
13 July 2020
|Paper submission deadline.|
05 August 2020
|Notification of acceptance.|
18 August 2020
|31 August 2020||Workshop.|
Participants will be provided with a set of products specifications (in short, specs) in JSON format, automatically extracted from multiple e-commerce websites.
Each specification has been extracted from a web page and refers to a real-world product. A specification consists of a list of <attribute_name, attribute_value> pairs and is stored in a file; files are organized into directories, each directory corresponds to a web source (e.g., www.ebay.com).Example of specification
"<page title>": "ASUS VT229H & Full Specifications at ebay.com",
"screen size": "21.5''",
"display type": "LED",
"dimension": "Dimensions: 19.40 x 8.00 x 11.80 inches",
"refresh rate": "75hz",
"general features": "Black",
Note that the dataset exhibits higth degree of heterogeneity both across and within sources. Attribute names are sparse (only the page title is always present), there are several homonyms — i.e., attributes with the same name but different semantics (e.g., "device type" sometimes refer to "screen type", like "LCD", others to "screen size diagonal", like "23''") — and several synonyms — i.e., attributes with the same semantics but different names (e.g., "display diagonal" and "screen size").
The Entity resolution task consists in identifying which specs of Xv represent the same real-world product (e.g. ASUS VT229H).
Participants to the Entity resolution task are provided with a labelled dataset in CSV format (i.e., YERv), containing three columns: "left_spec_id", "right_spec_id" and "label":
left_spec_id, right_spec_id, label www.ebay.com//1, www.ebay.com//2, 1 www.ebay.com//3, catalog.com//1, 1 catalog.com//1, ca.pcpartpicker.com//1, 0
Note that there might be matching pairs even within the same web source, and that the labelled dataset YERv is transitively closed (i.e., if A matches with B and B matches with C, then A matches with C).
Your goal is to find all pairs of product specs in the dataset Xv that match, that is, refer to the same real-world product. Your output must be stored in a CSV file containing only the matching spec pairs found by your system. The CSV file must have two columns: "left_spec_id" and "right_spec_id"; each row in this CSV file consists of just two ids, separated by comma.Example of output CSV file
left_spec_id, right_spec_id www.ebay.com//10, www.ebay.com//20 www.ebay.com//30, buy.net//10 ..., ...
The Schema matching task consists in identifying mappings between source attributes (e.g. the attribute "brand" from source "www.ebay.com") and a set of target attributes (e.g. "brand", "dimensions", "screen_size", etc.) defined in a given mediated schema.
Participants to the Schema matching task are provided with the mediated schema (in TXT format, one target attribute per row) and a labelled dataset in CSV format (i.e., YSMv), containing two columns: "source_attribute_id" and "target_attribute_name":
source_attribute_id, target_attribute_name www.ebay.com//producer name, brand www.ebay.com//brand, brand www.odsi.co.uk//device type, screen_type www.odsi.co.uk//device type, screen_size_diagonal
Note that some source attribute have values refer to multiple target attributes. Therefore, there might be source attributes with mappings to more than one target attribute. For instance, if the set of values related to the source attribute "www.odsi.co.uk//device type" is the following:
Your goal is to find mappings between source attributes in the dataset Xv and target attributes of the mediated schema. Your output must be stored in a CSV file containing all the mappings found by your system. The CSV file must have two columns: "source_attribute_id" and "target_attribute_name", separated by comma.
The output file valid for a submission have the same format of the labelled data YSMMONITOR and must contain mappings from the source attributes to the target attributes in the mediated schema given as an input:
source_attribute_id, target_attribute_name www.catalog.com//brand, brand www.vology.com//screen size, screen_size_diagonal ..., ...
The Instance-level attribute matching task consists in identifying mappings between instance attributes (e.g. the attribute "brand" from the specification "1.json" of the source "www.ebay.com") and a set of target attributes (e.g. "brand", "dimensions", "screen_size", etc.) defined in the given mediated schema (which is the same of the schema matching task). Thus, it's a finer grain task compared to schema matching.
Participants to the Instance-level attribute matching task are provided with the mediated schema (in TXT format, one target attribute per row) and a labelled dataset in CSV format (i.e., YILAMv), containing two columns: "instance_attribute_id" and "target_attribute_name":
instance_attribute_id, target_attribute_name www.ebay.com//1//producer name, brand www.odsi.co.uk//1//device type, screen_type www.odsi.co.uk//1//device type, screen_size_diagonal www.odsi.co.uk//2//device type, screen_size_diagonal
For instance, if the value related to the instance attribute "www.odsi.co.uk//1//device type" is "LED-backlit LCD monitor - 23''", then this attribute is mapped with both screen_type and screen_size_diagonal target attributes. Instead, if the value related to the instance attribute "www.odsi.co.uk//2//device type" is "23''", then this attribute is mapped only with screen_size_diagonal.
Your goal is to find mappings from instance attributes in the dataset Xv to the target attributes of the mediated schema. Your output must be stored in a CSV file containing all the mappings found by your system. The CSV file must have two columns: "instance_attribute_id" and "target_attribute_name", separated by comma.
The output file valid for a submission have the same format of the labelled data YILAMMONITOR and must contain mappings from the instance attributes to the target attributes in the mediated schema given as an input:
instance_attribute_id, target_attribute_name www.ebay.com//10//producer name, brand www.odsi.co.uk//10//device type, screen_type www.odsi.co.uk//10//device type, screen_size_diagonal www.odsi.co.uk//20//device type, screen_size_diagonal ..., ...
Every participant needs to register here. After registration each you will receive by e-mail (within 1 working day) an alphanumeric Team ID that will be used for the submissions.
Please note that during the challenge you will be able to submit solutions even for tasks you are not registered in, simply by filling out the submission form.
To submit a solution for the MONITOR vertical, participants must use this submission form.
To submit a solution for the NOTEBOOK vertical, participants must use this submission form.
In every submission, participants need to fill the submission form always with their Team ID.
Submissions must include only the output CSV file. Please remember that the correct format for the output CSV file depends on the task you are participating; the formats are described in the dedicated tasks description sections.
Multiple submissions are allowed. The last submitted CSV file will override the previous submitted files.
When participants want to submit a new solution, they have to specify which task and which track they are participating in. Tracks are defined according to technical details of the solution. We consider 8 tracks, each one defined by answering Yes/No to the following questions:
Note that generic external knowledge (such as, pretrained embeddings and LMs) is not considered domain specific knowledge.
For example, a solution that uses Bert and a classifier falls in the YNN track. A solution that uses Bert and computes matches based on a simple cosine similarity threshold falls in the NNN track. A machine learning solution leveraging a catalog of brands falls in the YYN track. Note that NNN is still a valid track.
If you are unsure on how to classify your solution, you can contact us by email (firstname.lastname@example.org).
In case you are participating to more than one task, or if you are implementing solutions for different tracks, please compile a new form for each task/track you have a solution for.
Submitted solutions are ranked on the basis of F-measure (the harmonic mean of precision and recall), rounded up to three decimal places. Precision and recall are computed w.r.t. a hidden evaluation dataset, i.e. Etv - Ytv.
For clarity purposes, the figures below represent how evaluation works for each available task.
In the graphs, nodes represent specs and edges represent matching relationships.
Precision and recall of the submitted solution will be evaluated only on the edges in which both nodes are included in the hidden evaluation dataset, as illustrated in the figure below.
In the bipartite graphs, nodes on the left side represent source attributes, nodes in the right side represent target attributes and edges represent matching relationships.
Precision and recall of the submitted solution will be evaluated only on the edges from source attributes which are included in the hidden evaluation dataset to the target attributes of the mediated schema.
In the bipartite graphs, nodes on the left side represent instance attributes, nodes in the right side represent target attributes and edges represent matching relationships.
Precision and recall of the submitted solution will be evaluated only on the edges from instance attributes which are included in the secret evaluation dataset to the target attributes of the mediated schema.
The results of the evaluation, in terms of Precision, Recall and F-Measure, will be shown in our public leaderboard, updated twice a week.
After the challenge deadline, we will publish the final leaderboard. The top solutions of each task/track will be submitted to a reproducibility test. Their authors will be asked to provide a package with:
|Dataset XMONITOR||Specs Dataset from the monitor vertical||3.59 Mb|
|Dataset YERMONITOR||Labelled Dataset for the Entity Resolution task||
5.22 Mb (05-26-2020 fix) New
|Dataset YSMMONITOR||Labelled Dataset for the Schema Matching task||7.70 kB|
|Dataset YILAMMONITOR||Labelled Dataset for the Instance-level attribute matching task||41.2 Kb|
|Mediated schema||Mediated schema for Schema matching and Instance-level attribute matching tasks.||1.67 Kb|
|Dataset XNOTEBOOK||Specs Dataset from the notebook vertical||7.51 Mb|
|Dataset YERNOTEBOOK||Labelled Dataset for the Entity Resolution task||2.98 Mb|
|Dataset YSMNOTEBOOK||Labelled Dataset for the Schema Matching task||8.80 kB|
|Dataset YILAMNOTEBOOK||Labelled Dataset for the Instance-level attribute matching task||19.3 Kb|
|Notebook Mediated schema||Mediated schema for Schema matching and Instance-level attribute matching tasks.||667 byte|
* This version of the labelled data for the Entity Resolution task contained 10 wrong specifications. Please download the fixed version.
Submission form for the MONITOR vertical.
Submission form for the NOTEBOOK vertical.