Spatio-temporal search in Open Data

To show the performance and limitations of our geo-labelling approach we randomly selected ten datasets per data portal (using the ElasticSearch's built-in random function). We manually categorize the datasets' labels by assigning the following tags:

Correct column labels

The values in one or multiple columns are correctly detected and assigned.

Correct metadata labels

The correct metadata labels are included in the assigned label set.

Incorrect GeoNames labels

Some of the individual GeoNames column labels are incorrect.

Incorrect OSM labels

Some of the individual OpenStreetMap (OSM) column labels are incorrect.

Incorrect metadata labels

Some of the assigned metadata labels are incorrect.

No geo-spatial data

There is no geo-spatial data in this dataset.

Not assigned

There are potentially geo-references in the dataset but the respective labels are not in the knowledge graph or the algorithm could not detect the entities.

Initially, a total of 40 datasets - ten datasets per indexed portal - were randomly selected. Out of these 40 we identified 16 datasets which did not contain any geo-spatial data that could be mapped. All these 16 datasets without geo-spatial data are published on the two non-governmental portals opendataportal.at and offenedaten.de. From the remaining 24 datasets we identified in 17 datasets correct labels, while for 4 datasets we assigned incorrect OSM labels and also for 4 datasets incorrect GeoNames labels.

For 7 datasets we identified some content that would have been some additional geo-information but was not labelled by our approach. For instance, a dataset contains sub-district labels of a city which are not present in the knowledge graph, or city/region names are embedded in text and use abbreviations (e.g. "Str" for the German word Straße).

For 33 out of the 40 datasets we were able to derive a correct metadata label based on the title or publisher of the dataset. However, for 32 datasets we derived some incorrect metadata labels. For instance, given the publisher ``Stadt Wien'' we link the two geo-entities ``Vienna'', the city of Vienna, Austria, and ``Stadt'', a German city in Saxony. An easy fix to these issues is to restrict the metadata labels to the origin country of the portal, however, we wanted to stay with our approach as general as possible and so that we do not restrict the labels to national datasets only. Also, given that we use the labels in a search engine, these false positives can be considered as an minor issue, i.e. down-rated by an adequate ranking of the results, with the benefits of a more comprehensive result set.

data.gv.at

http://data.ooe.gv.at/files/cms/Mediendateien/OGD/ogd_abtWo/WO_2016_Sanierung_Kleinhausbau.csv

Correct column labels

Correct metadata labels

http://gis.ktn.gv.at/OGD/Bildung_Forschung/schulerklassen201516stand01102015.csv

Correct column labels

Correct metadata labels

Incorrect OSM labels

Incorrect metadata labels

https://www.wien.gv.at/politik/wahlen/ogd/bp163_99999999_9999_wvb.csv

Correct column labels

Correct metadata labels

Incorrect metadata labels

http://data.ooe.gv.at/files/cms/Mediendateien/OGD/ogd_abtGes/OGD_Rettungssanitaeter.csv

Correct column labels

Correct metadata labels

http://data.ooe.gv.at/files/cms/Mediendateien/OGD/ogd_abtStat/Gemeindekennzahlen_2015_OGD.csv

Correct column labels

Correct metadata labels

http://data.linz.gv.at/katalog/kultur/bildung/pflichtschulen/sonderschule/2010/SOSHL_2010.csv

Correct column labels

Correct metadata labels

http://open-data.noe.gv.at/ogd-data/RU2/noe_pop_sex_2011-2016_lau2.csv

Correct column labels

Correct metadata labels

https://gis.tirol.gv.at/ogd/arbeit/BESCH_OENACE_2013.csv

Correct column labels

Correct metadata labels

Incorrect GeoNames labels

http://data.linz.gv.at/katalog/stadt/gebaeude/ueberwiegende_nutzung/2009/tgenutzg_2009.csv

Correct metadata labels

Incorrect metadata labels

Not assigned

Sub-district labels of city are not present in knowledge graph

https://www.wien.gv.at/statistik/ogd/w03-accountsexpendituresectors-vie.csv

Correct column labels

Correct metadata labels

Incorrect metadata labels

govdata.de

https://www.offenesdatenportal.de/dataset/f0605fd2-e4cb-4201-b8c8-d9c8dd1840f7/resource/319cc1b7-516e-43da-ae00-5b0ca206f17c/download/einwohner2014.csv

Correct metadata labels

Incorrect metadata labels

Not assigned

District names are embedded in text

https://offenedaten-koeln.de/sites/default/files/2017_Wahlgebaeude_LTW_Koeln.csv

Correct column labels

Correct metadata labels

Incorrect metadata labels

Not assigned

Addresses use abbreviations, e.g., "Str." for "Straße"

http://wahlen.kdvz-frechen.de/kdvz/bw2005/05358036/html5/Bundestagswahl6.csv

Correct metadata labels

Incorrect metadata labels

Not assigned

District names are embedded in text

http://offenedaten.kdvz-frechen.de/sites/default/files/EWO_990_sim_OpDat259-2013_20161121.csv

Correct column labels

Correct metadata labels

Incorrect GeoNames labels

Incorrect metadata labels

https://www.landesdatenbank.nrw.de/link/tabelleDownload/22111-06iz.csv

Incorrect metadata labels

Not assigned

City/region names are embedded in text and uses abbreviations

https://opendata.meerbusch.de/dataset/8c216129-2044-496b-be06-dd886ea5eb3a/resource/c3e27d1a-ede4-4f71-91c5-887a4a19f343/download/bevolkerungsbewegung-in-osterath-ab-2010.csv

Incorrect metadata labels

Not assigned

A single value in a column, no other context.

http://offenedaten.kdvz-frechen.de/sites/default/files/Gewerbe_Titz_final.csv

Correct column labels

Correct metadata labels

Incorrect OSM labels

Incorrect metadata labels

http://offenedaten.kdvz-frechen.de/sites/default/files/EWO_990_sim_OpDat287-2010_20161121.csv

Correct column labels

Correct metadata labels

Incorrect GeoNames labels

Incorrect metadata labels

http://www.stadt-koeln.de/wahlen/verbundwahl_2014/Europawahl5.csv

Correct metadata labels

Incorrect metadata labels

Not assigned

District names are embedded in text

http://offenedaten.kdvz-frechen.de/sites/default/files/EWO_990_sim_OpDat429-2011_20161121.csv

Correct column labels

Correct metadata labels

Incorrect metadata labels

offenedaten.de

http://offenedaten.frankfurt.de/dataset/ba3f9a63-66ae-4d2b-995d-4a662960dbfb/resource/c3451694-dbcd-4586-83da-93ffc56ef995/download/bw2017zweitstadtteile.csv

Correct column labels

Correct metadata labels

Incorrect GeoNames labels

Incorrect metadata labels

Uses district names such as "Altstadt" and "Innenstadt" which are mapped incorrectly.

http://www.berlin.de/daten/liste-der-vornamen-2014/mitte.csv

Correct metadata labels

No geo-spatial data

http://daten.transparenz.hamburg.de/Dataport.HmbTG.ZS.Webservice.GetRessource100/GetRessource100.svc/355e215b-d36e-4167-bb2d-38920f9465e8/Tageskost.csv

Incorrect metadata labels

No geo-spatial data

https://geo.sv.rostock.de/download/opendata/landtagswahl_2016/landtagswahl_2016_ergebnisse.csv

Correct column labels

Correct metadata labels

http://daten.transparenz.hamburg.de/Dataport.HmbTG.ZS.Webservice.GetRessource100/GetRessource100.svc/59136f68-0d4e-425f-b572-167122e3be20/Sea_Gold_Rotbarschfilet.csv

Incorrect metadata labels

No geo-spatial data

http://daten.transparenz.hamburg.de/Dataport.HmbTG.ZS.Webservice.GetRessource100/GetRessource100.svc/53fa3af4-ab19-433e-a9ac-c2cc871469e0/Trinkwasser.csv

Incorrect metadata labels

No geo-spatial data

http://offenedaten.frankfurt.de/dataset/88fd017f-94c1-418e-8468-ed11770eedfe/resource/b2e28995-ad55-4df4-a26d-b91342576ff2/download/obr162016bewerberinnen.csv

Correct metadata labels

Incorrect metadata labels

No geo-spatial data

https://www.offenesdatenportal.de/dataset/96001940-55f6-4267-a36c-a30ebe917e3d/resource/2a9da541-9b78-48b9-9bdb-3c30e2a99dfb/download/btw-2017-ergebnis-stadt-krefeld.csv

Correct metadata labels

Incorrect OSM labels

Incorrect metadata labels

http://www.statistik-bremen.de/tabellen/kleinraum/stadt_ottab/ST135.csv

Correct metadata labels

Incorrect metadata labels

No geo-spatial data

http://daten.transparenz.hamburg.de/Dataport.HmbTG.ZS.Webservice.GetRessource100/GetRessource100.svc/bf44aafa-1339-465a-b7d6-060ed75e634b/Oberflaechenwasser.csv

Incorrect metadata labels

No geo-spatial data

opendataportal.at

http://data.wu.ac.at/portal/dataset/81738773-04d9-4d6c-860a-c6d7cdf32932/resource/c0200103-f6fc-4dcf-b14d-b74f60b6e81a/download/allcoursesandorgid12s.csv

Correct metadata labels

Incorrect metadata labels

No geo-spatial data

http://www.win2day.at/download/lo_1998.csv

Incorrect metadata labels

No geo-spatial data

http://data.opendataportal.at/dataset/fef948d8-483e-4dc5-928a-33f744115966/resource/00339813-1d71-4d0a-917d-bfe1d417ab64/download/ibmlokationen201706.csv

Correct column labels

Correct metadata labels

Incorrect OSM labels

Incorrect metadata labels

http://data.wu.ac.at/portal/dataset/79015497-df60-41c0-95f0-bcd728e543b3/resource/d9d3aafb-bac7-41fc-956d-4474c0f75037/download/allcoursesandevents05s.csv

Correct metadata labels

Incorrect metadata labels

No geo-spatial data

http://wko.at/statistik/opendata/sm/ogd_mgstat_sm_s11_bld_sp6_rf.csv

Correct metadata labels

Incorrect metadata labels

No geo-spatial data

http://data.wu.ac.at/portal/dataset/812a45f5-3f69-457a-b496-dbd598456829/resource/bfe6cda0-258f-47b2-876d-2913e5459a66/download/allcoursesandevents15w.csv

Correct metadata labels

Incorrect metadata labels

No geo-spatial data

http://wko.at/statistik/opendata/ogd_grstat_anzahl_ng_1.csv

Correct metadata labels

Incorrect metadata labels

No geo-spatial data

http://www.win2day.at/download/lo_2009.csv

Incorrect metadata labels

No geo-spatial data

http://data.wu.ac.at/portal/dataset/3bf967d9-b3ba-445b-b2f8-5c4848203ace/resource/5f8c11e2-db13-49e3-b544-acb4db7d7aff/download/allcourses12s.csv

Correct metadata labels

Incorrect metadata labels

No geo-spatial data

http://data.opendataportal.at/dataset/aa9d7108-4bbf-4db0-818e-01dec167e36b/resource/08e94fdf-0a82-461e-a214-ac89856a182c/download/regionalentwicklungschutzhuettenstmkliste2012.csv

Correct metadata labels

Incorrect metadata labels

No geo-spatial data

We inspect and report the potential false negative errors of our system, i.e. datasets where we did not assign any labels, by selecting three random sets of tables: First, we select 20 random datasets where no column labels were assigned but a metadata label is available, second, 20 datasets where no metadata labels are available but column labels exist, and third, another 20 random datasets without column or metadata labels.

We assigned the following tags to the datasets:

Missing column labels

The table contains geo-data in columns that has not been labelled appropriately by our algorithm.

Missing metadata labels

The table contains metadata information that has not been detected by our algorithm.

None of the 60 sample datasets lacks any geo-labelling based on the datasets' title and publisher. In particular, the 40 datasets without any assigned metadata labels do not provide any geo-information cues in this metadata. For 9 of the 60 datasets we identified columns with potential geo-data where the algorithm did not assign any labels. Particularly in the set of 20 datasets without any assigned columns and metadata we found 7 candidates with missing labels.

These missed labels can grouped into three basic error classes: (i) The corresponding entities are missing in the base knowledge graph so that our algorithm is not able to link the labels in the column context. (ii) The city/region names are embedded in text, or combined with other content in a single cell, e.g., the region type. (iii) The column contains very few labels, below the algorithm's threshold, or, similarly, the table consists of several sub-tables, where each sub-table has a regional geo-label as "title".

No column labels / metadata labels available

http://offenedaten.kdvz-frechen.de/sites/default/files/EWO_990_sim_OpDat412-2015_20160920.csv

SEMANTiCS 2018 Datasets & Evaluations

data.gv.at

govdata.de

offenedaten.de

opendataportal.at

No column labels / metadata labels available

Column labels available / no metadata labels

No column labels & no metadata labels