Special Track 1
Title: Showcasing innovative data services from EU Member States – Paving the way towards transparent documentation
Seth van Hooland, European Commission (corresponding author)
Emanuele Baldacci, European Commission
Blanca Martinez De Aragon, European Commission
Joao Rodrigues Frade, European Commission
Public sector data
Data and algorithm design principles and accountability
Documenting data processing
Directorate D “Digital Services” of DIGIT, European Commission
Motivation and topics to be covered:
In the summer of 2020, the European Commission published a study on public sector data strategies policies and governance, highlighting lessons learned from concrete case studies on innovative data services from across EU Member States 1. As these case studies illustrate, capturing the life cycle of data-driven innovative projects is challenging. Successful data analytics and visualisations for policy making require a complex set of preparatory steps when moving from an experimental to an operational setting. For example, the cleaning and integration of data, stemming from heterogeneous and sometimes conflicting sources, are often needed before applying Machine Learning (ML) and Natural Language Processing (NLP) methods and tools. Both the streamlining of the data and the subsequent tasks, such as auto- classification or concept-extraction for example, are performed in an iterative manner due to the high number of variables coming into play when using unsupervised ML techniques. Minor configuration changes while applying clustering techniques to spot duplicate records in the cleaning process, or when fine-tuning the list of stop words in the context of concept extraction, have drastic consequences on the results. These challenges have been largely recognised and highlighted both from an academic (Cath, Zimmer, Lomborg, & Zevenbergen, 2018; Morley, Floridi, Kinsey, & Elhalal, 2020) and policy-development perspective (Jian & Martin, 2020; Rothenberg, 2020). As a multitude of Data Governance Frameworks emerge worldwide under the umbrella of the UN, OECD, G20 or the EU, it becomes clear that progress is needed on documenting the entire life cycle of data used for policymaking.
What can governments and public administrations learn from the developer community? Platforms such as OpenML2 and Kaggle3 help to structure the documentation of both the data sources and the various methods and tools deployed for data science projects. These platforms often remain within the realm of experimental research, but the public sector is currently starting to explore them. For example, the Open Data platform of Ireland offers through Jupyter Notebooks executable documentation to explain the core concepts for connecting to the Data.gov.ie API as well as more advanced usages in regards to data cleaning, enrichment and visualisation4. In parallel to the potential offered by these executable notebooks, the natural and social sciences with their long-standing tradition of exchanging and analysing complex data sets have already invested heavily in documenting the various layers of data transformations. Recent initiatives such as the Research Data Alliance (RDA) FAIR Data Maturity
Model, with a set of indicators and evaluation levels for assessment of FAIRness in research data, raise awareness on the topic in other fields5.
This special track aims at bringing together the above-mentioned points within the specific context of innovative data services the EC is coordinating across Member States. For example, the Big Data Test Infrastructure of the EC is embedding the notion of Jupyter Notebooks within its service6. From a policy perspective, the FAIR principles are discussed as a point of reference to be included in the upcoming regulation on data governance. It is therefore a timely moment to bring together a selection of concrete case studies, which are experimenting with how the public sector can document its usage of ML and NLP processing to prepare data for policymaking. Respecting the regulatory framework in regards to GDPR across the lifecycle of data integration, harmonization, and management, will of course also be taken into account from the start.
- Cath, C., Zimmer, M., Lomborg, S., & Zevenbergen, B. (2018). Association of internet researchers (AoIR) roundtable summary: Artificial Intelligence and the good society workshop proceedings. Philosophy & Technology, 155–162.
- Morley, J., Floridi, L., Kinsey, L., & Elhalal, A. (2020). From what to how: an initial review of publicly available AI ethics tools, methods and research to translate principles into Science and engineering ethics, 2141-2168.
- Jian, C. & Martin, S. (2020). The Geopolitics of Data Governance: Data Governance Regimes (Oxford Insights).
- Rotenberg, M. (2020). The AI Policy Sourcebook 2020 (Electronic Privacy Information Center).