Risks of training artificial intelligence systems with open-source data: a Spanish perspective
Published on 27th Jan 2023
Artificial Intelligence (AI) systems often employ large datasets to train the model that will later perform a specific task. It is common for these training datasets to be built by massively collecting data from open sources on the internet. This includes repositories of specialized publications or public sector data, websites, etc... In these cases it is essential to verify the legal regime applicable to these data in order to identify the potential risks of their use.
Among the main risks associated with the training of AI systems, (1) those associated with the use of personal data and (2) those associated with data ownership are of particular relevance.
Risks associated with the use of personal data
Insofar as the training dataset contains personal data, all data protection regulations must be complied with. Among others, this includes verifying the roles of the parties involved - controllers, joint controllers, processors... -, the categories of data collected, the legal basis for the processing, and/or the information to be provided to the data subjects.
Data protection obligations are independent of whether or not the system, once in operation, provides personal data to users, or whether or not the data has been anonymized after collection. It is therefore necessary to check whether personal data have been processed from the moment they are collected.
Also noteworthy is the obligation to inform data subjects that their data are going to be processed, within a maximum period of one month from their collection. When the personal data have not been collected directly from the data subjects, they must be informed of the source from which their data have been extracted. Failure to provide information constitutes a breach of personal data regulations.
If informing data subjects proves impossible, involves a disproportionate effort, renders impossible or seriously hinders the training of the model, exceptionally, the obligation to inform the interested parties shall not apply. In such a case, this must be demonstrated and measures must be taken to best protect the rights of the data subjects, including making the information publicly available.
Risks associated with data ownership
Often the preparation of a large training dataset necessarily involves automating the collection or "mining" of the data. One of the main risks associated with this automated collection is the infringement of intellectual property rights on those data that are protected as works or databases under intellectual property law.
Currently, text and data mining is permitted by Spanish law, provided that the holder of the intellectual property rights has not reserved this right. The reservation of this right must be made through means that allow both manual and automatic detection, including, among others, notice in the metadata or in the terms and conditions of licensing or use of the data.
A particular issue is data mining for scientific research purposes, which Spanish law seems in principle to allow its limitation by the holders of intellectual property rights. However, it should be noted that the Directive from which this provision derives would not allow such limitation in cases of scientific research. This possible divergence has been pointed out, among others, by the academia and for the time being, it is not known whether it will be modified and aligned with the Directive.
Finally, we should not forget the possibility for data owners to establish conditions for data reuse, including, among others, the request for authorizations, payment of fees or prohibitions of use for certain purposes. This is of particular relevance in public sector data repositories, with the recent adoption of the European Data Governance Regulation, which adds to the existing national regulations on data reuse.
With a market that is increasingly prepared to integrate AI systems into all types of commercial applications, and with the recent creation of the Spanish AI agency, it is foreseeable that regulatory demands on the entire lifecycle of AI systems will increase.
It will be essential to verify that those systems that require training data sets have the appropriate permissions for their collection and processing, as well as comply with data protection regulations.