In many aspects of our society there is growing awareness and consent on the need for data-driven approaches that are resilient, transparent and fully accountable. But to achieve a data-driven society, it is necessary that the data needed for public goods are readily available. Thus, it is no surprising that in recent years, both public and private organizations have been faced with the issue of publishing and exchanging open data. Although there are several works on platforms and architectures for publishing open data, there is still no formal and comprehensive methodology supporting an organization in deciding which data to publish, and carrying out precise procedures for publishing high-quality data, suitably annotated with semantic information. The recent paradigm of Ontology-based Data Management (OBDM) is an attempt to provide principles and techniques for a new way of managing data, based on knowledge representation and reasoning techniques. An OBDM system is constituted by an ontology, the data sources forming the information system, and the mapping between the ontology and the sources.
The basic assumption for this project is that the Ontology-based Data Management paradigm can provide a formal basis for a principled approach to publish high-quality, semantically annotated open data. There are three pillars of the project: foundations, system and experimentation. As for the first pillar, we will study basic problems in applyIng OBDM in open data publishing, and we will define algorithms for carrying out the corresponding tasks. For the second pillar, we will extend the MASTRO system with suitable features implementing the designed algorithms. The resulting system will be the first tool providing semantic open data publishing capabilities based on ontologies. For the third pillar, we will experiment out techniques and tools in two real-world scenarios, one related to an Italian Public Administration, and one related to Cultural Heritage archives.
In recent years, both public and private organizations have been faced with the issue of publishing open data, in particular with the goal of providing data consumers with suitable information to capture the semantics of the data they publish. As we said before, although there are several works on platforms and architectures for publishing open data, there is still no formal and comprehensive methodology supporting an organization in deciding which data to publish, and carrying out precise procedures for publishing and documenting high-quality data.
Current practices for publishing open data focus essentially on providing extensional information (often in very simple forms, such as CSV files), and they carry out the task of documenting data mostly by using metadata expressed in natural languages, or in terms of record structures. As a consequence, the semantics of datasets is not formally expressed in a machine-readable form. Conversely, the approach pursued in this project, based on OBDM, opens up the possibility of a new way of publishing data, with the idea of annotating data items with the ontology elements that describe them in terms of the concepts in the domain of the organization. When an OBDM is available in an organization, an obvious way to proceed to open data publication is as follows: (i) express the dataset to be published in terms of a SPARQL query over the ontology, (ii) compute the certain answers to the query, and (iii) publish the result of the certain answer computation, using the query expression and the ontology as a basis for annotating the dataset with suitable metadata expressing its semantics. We call such method top-down. Using this method, the ontology is the heart of the task: it is used for expressing the content of the dataset to be published (in terms of a query), and it is used, together with the query, for annotating the published data.
The top-down approach requires skill in many aspects (ontology language, SPARQL, etc.) and full awareness of the ontology. Unfortunately, in many organizations (for example, in Public Administrations) it may be the case that people are not ready to use the ontology and to base their tasks on it. Rather, the IT people might be more confident to express the specification of the dataset to be published directly in terms of the source structures (i.e., the relational tables in their databases), or, more generally, in terms of a view over the sources. To address this issue, an innovative, bottom-up approach is studied in our approach: the organization expresses its publishing requirement as a query over the sources, and, by using the ontology and the mapping, a suitable algorithm computes the corresponding query over the ontology. With such query at hand, we have reduced the problem in such a way that the top-down approach can now be followed, and the required data can be published according to the top-down method described above.
In both the top-down and the bottom-up approaches, there is the need of taking into account metamodeling capabilities in the ontology. This is another important innovation of the project. This feature is crucial in open data publishing, where the ontology is part of the open data themselves. It is even more crucial because it allows to enrich the data with information concerning, e.g., their provenance, quality, relevance or privacy.
The principles, techniques and algorithms developed in the project will be implemented in a new automated reasoning system extending MASTRO, and will be experimented in the context of two scenarios.
In the first scenario, we will collaborate with an external partner, namely ACI Informatica, in order to realize the first semantic open data portal in the Italian Public Administration.
In the second scenario, the OBDM application will be tailored to opening archival descriptions of Cultural Heritage (CH) archives. Exporting and effectively using open data is particularly crucial to make CH archives get out of a certain congenital isolation. Indeed, the concealment of the information contained in archival documents is a pure loss not only for archives, and their preservation, but also for the knowledge of the overall CH, e.g., objects, events and protagonists, territories. We will benefit from several years of collaboration between Prof.ssa Giuva and the Italian Central State Archive (Archivio Centrale dello Stato), as well as both the State Archive of Rome (Archivio di Stato di Roma) and the State Archive of Latina (Archivio di Stato di Latina), so as to apply the OBDM specification for CH archives to as many as possible databases storing archival descriptions produced with the Archimista system, and export from each such application open archival descriptions. Moreover, we will consider archival descriptions that are collected, managed and made available through the National Archival System portal by the Central Institute for Archives (Istituto Centrale per gli Archivi - ICAR).