DI2KG Benchmark data sets downloads


The DI2KG benchmark is an end-to-end benchmark designed up-front for dealing with the complexity of every integration task.

DI2KG provides datasets of product specifications extracted from multiple eCommerce websites.

Dataset (specifications) Entity Resolution (Record Linkage) GT Schema Matching GT Instance-level attribute matching GT
CAMERA Download Download Download
MONITOR Download Download Download

DI2KG Benchmark data format


Datasets consists of a set of products specifications (in short, specs) in JSON format, automatically extracted from multiple e-commerce websites.

Each specification has been extracted from a web page and refers to a real-world product. A specification consists of a list of pairs and is stored in a file; files are organized into directories, each directory corresponds to a web source (e.g., www.ebay.com).

Example of specification
                  
{
"<page title>": "ASUS VT229H & Full Specifications at ebay.com",
"screen size": "21.5''",
"brand": "Asus",
"display type": "LED",
"dimension": "Dimensions: 19.40 x 8.00 x 11.80 inches",
"refresh rate": "75hz",
"general features": "Black",
}

Note that the dataset exhibits high degree of heterogeneity both across and within sources. Attribute names are sparse (only the page title is always present), there are several homonyms — i.e., attributes with the same name but different semantics (e.g., "device type" sometimes refer to "screen type", like "LCD", others to "screen size diagonal", like "23''") — and several synonyms — i.e., attributes with the same semantics but different names (e.g., "display diagonal" and "screen size").

Entity Resolution (Record linkage) Ground Truth


The Entity resolution task consists in identifying which specs of the dataset represent the same real-world product (e.g. ASUS VT229H).

The ground truth for this task is provided in CSV format containing two columns: "spec_id" and "entity_id"

Example of ground truth rows
entity_id, spec_id
ENTITY#001, www.ebay.com//2
ENTITY#001, catalog.com//1
ENTITY#002, ca.pcpartpicker.com//1

Note that there might be matching pairs even within the same web source.

Schema matching Ground Truth


The Schema matching task has two possible definitions. In a closed-world assumption, i.e. with a mediated schema available, it consists in identifying mappings between source attributes (e.g. the attribute "brand" from the specification "1.json" of the source "www.ebay.com") and a set of target attributes (e.g. "brand", "dimensions", "screen_size", etc.) defined in the mediated schema. In an open-world assumption (i.e. without a mediated schema available) it consists in identifying (potentially overlapping) clusters of source attributes with equivalent semantics.

The ground truth for this task consists of a mediated schema (in TXT format, one target attribute per row) and a CSV file containing two columns: "source_attribute_id" and "target_attribute_name":

Example of CSV file for schema matching
source_attribute_id, target_attribute_name
www.ebay.com//producer name, brand
www.ebay.com//brand, brand
www.odsi.co.uk//device type, screen_type
www.odsi.co.uk//device type, screen_size_diagonal

Note that some source attribute have values that refer to multiple target attributes. Therefore, there might be source attributes with mappings to more than one target attribute. For instance, if the set of values related to the source attribute "www.odsi.co.uk//device type" is the following:

Then this source attribute is mapped with target attributes "screen_type" (because of value1 and value3) and "screen_size_diagonal" (because of value1 and value2).

Instance level attribute matching task


The Instance-level attribute matching task has two possible definitions. In a closed-world assumption, i.e. with a mediated schema available, it consists in identifying mappings between attribute instances (e.g. the attribute "brand" from the specification "1.json" of the source "www.ebay.com") and a set of target attributes (e.g. "brand", "dimensions", "screen_size", etc.) defined in the mediated schema. In an open-world assumption (i.e. without a mediated schema available) it consists in identifying (potentially overlapping) clusters of attribute instances with equivalent semantics. It is a finer grain task compared to schema matching.

The ground truth consists of a mediated schema (in TXT format, one target attribute per row) and a CSV file containing two columns: "instance_attribute_id" and "target_attribute_name":

Example of CSV file
instance_attribute_id, target_attribute_name
www.ebay.com//1//producer name, brand
www.odsi.co.uk//1//device type, screen_type
www.odsi.co.uk//1//device type, screen_size_diagonal
www.odsi.co.uk//2//device type, screen_size_diagonal

For instance, if the value related to the instance attribute "www.odsi.co.uk//1//device type" is "LED-backlit LCD monitor - 23''", then this attribute is mapped with both screen_type and screen_size_diagonal target attributes. Instead, if the value related to the instance attribute "www.odsi.co.uk//2//device type" is "23''", then this attribute is mapped only with screen_size_diagonal.