SEMANTiCS 2018 Datasets & Evaluations
To show the performance and limitations of our geo-labelling approach we randomly selected ten datasets per data portal (using the ElasticSearch's built-in random function). We manually categorize the datasets' labels by assigning the following tags:
Initially, a total of 40 datasets - ten datasets per indexed portal - were randomly selected. Out of these 40 we identified 16 datasets which did not contain any geo-spatial data that could be mapped. All these 16 datasets without geo-spatial data are published on the two non-governmental portals opendataportal.at and offenedaten.de. From the remaining 24 datasets we identified in 17 datasets correct labels, while for 4 datasets we assigned incorrect OSM labels and also for 4 datasets incorrect GeoNames labels.
For 7 datasets we identified some content that would have been some additional geo-information but was not labelled by our approach. For instance, a dataset contains sub-district labels of a city which are not present in the knowledge graph, or city/region names are embedded in text and use abbreviations (e.g. "Str" for the German word Straße).
For 33 out of the 40 datasets we were able to derive a correct metadata label based on the title or publisher of the dataset. However, for 32 datasets we derived some incorrect metadata labels. For instance, given the publisher ``Stadt Wien'' we link the two geo-entities ``Vienna'', the city of Vienna, Austria, and ``Stadt'', a German city in Saxony. An easy fix to these issues is to restrict the metadata labels to the origin country of the portal, however, we wanted to stay with our approach as general as possible and so that we do not restrict the labels to national datasets only. Also, given that we use the labels in a search engine, these false positives can be considered as an minor issue, i.e. down-rated by an adequate ranking of the results, with the benefits of a more comprehensive result set.
data.gv.at
govdata.de
offenedaten.de
opendataportal.at
We inspect and report the potential false negative errors of our system, i.e. datasets where we did not assign any labels, by selecting three random sets of tables: First, we select 20 random datasets where no column labels were assigned but a metadata label is available, second, 20 datasets where no metadata labels are available but column labels exist, and third, another 20 random datasets without column or metadata labels.
We assigned the following tags to the datasets:
None of the 60 sample datasets lacks any geo-labelling based on the datasets' title and publisher. In particular, the 40 datasets without any assigned metadata labels do not provide any geo-information cues in this metadata. For 9 of the 60 datasets we identified columns with potential geo-data where the algorithm did not assign any labels. Particularly in the set of 20 datasets without any assigned columns and metadata we found 7 candidates with missing labels.
These missed labels can grouped into three basic error classes: (i) The corresponding entities are missing in the base knowledge graph so that our algorithm is not able to link the labels in the column context. (ii) The city/region names are embedded in text, or combined with other content in a single cell, e.g., the region type. (iii) The column contains very few labels, below the algorithm's threshold, or, similarly, the table consists of several sub-tables, where each sub-table has a regional geo-label as "title".