Generative AI: can intellectual property infringements in training data be avoided?

Published on 24th Apr 2023

Some exceptions to IP infringement might be available but they are limited in scope

If artificial intelligence (AI) training data is not appropriately sourced then there is a risk of copyright or database right infringement. Data used to train AI is frequently sourced from the internet, for example, through web scraping tools. However, some of this data will be protected by copyright or database rights or both. Without an appropriate licence, using it to train an AI system may amount to infringement.

Various infringement exceptions permit use of the copyright work or database in certain scenarios. But how far do these go and how likely are they to be available?

Database right exception

There is a "fair dealing" exception with respect to databases that have been made available to the public (in any manner). However, the exception is narrow and unlikely to apply in a commercial context. Database rights in publicly available databases will not be infringed by fair dealing with a substantial part of its contents provided that:

  • the extraction is carried out by a person who is a lawful user of the database,
  • it is extracted for the purpose of illustration for teaching or research and not for any commercial purpose, and
  • the source is indicated.

This is a reasonably narrow exception, which requires lawful access and the use to be for non-commercial purposes. Therefore the extraction of a substantial part of a publicly available database to use for AI training purposes will not be covered by the exception where the purpose is commercial. This means that the onus is on the potential extractor of contents from a database to ensure that it is legal to do so.

UK copyright exceptions

There are also a number of exceptions to copyright infringement. For example, non-commercial research or private study, criticism, review and news reporting, or caricature, parody or pastiche – all of which are subject to a fair dealing restriction.

However, the UK's limited text and data mining (TDM) and temporary copies exceptions are the most relevant in this context.

Text and data mining exception

The TDM exception, introduced in the UK in 2014, states that a person does not infringe copyright in a work if a copy is made by someone who:

  • has lawful access to the work, and
  • carries out "computational analysis" for the "sole purpose of research for a non-commercial purpose".

The present TDM exception, therefore, can only be relied on if the copyright works have been accessed lawfully (for example, by paying to access works behind a paywall) and the TDM is for non-commercial research purposes.

Accordingly, the present TDM exception is fairly narrow.

Some rightsholders elect to license their works to allow them to be used for commercial purposes for a fee; others do not. The UK TDM exception, as it stands, would not be wide enough to legitimise, for example, web-scraping (including copying copyrighted content from the internet) to use for AI training, if the purpose was commercial. Again, the onus is on the person carrying out or facilitating the copying to ensure that it is legal.

Temporary copy exception

Beyond the TDM exception, there is also an exception with respect to copyright in a literary work (other than a computer program or a database) or in a dramatic, musical or artistic work or sound recording or a film by the making of a temporary copy of the work.

The exception states that copyright will not be infringed by the making of:

  • a temporary copy, which is,
  • "transient or incidental",
  • "an integral and essential part of a technological process",
  • the sole purpose of which is to enable "lawful use of the work", and
  • which has no independent economic significance.

This exception was introduced to enable acts such as browsing and caching, which allow users to view webpages. It is possible that AI developers might try to rely on this exception by arguing that any copies made for the purposes of AI training are temporary and that the tool's use of the works is akin to webpage browsing. On the other hand, generative AI systems are often trained using databases of web-scraped data that are made publicly available and not deleted after the training process for a single system. Accordingly, the exception may be of limited use.

Moreover, the Supreme Court has held that these requirements are overlapping and have to be read together. As such, it is clear that temporary copying would have to have no independent economic significance and therefore is unlikely to be available to use data for training a commercial AI system.

What about in the EU?

The position with respect to the fair dealing exception for database rights and the temporary copies exception for copyright is the same in the EU as in the UK. However, the position with respect to the TDM exception for copyright is different.

Articles 3 and 4 of the Digital Copyright Directive provide TDM exceptions. Article 3 provides an exception to the copyright (and related rights, for example, producers' phonographic rights) and database right reproduction and extraction rights for scientific research, where the works have been lawfully accessed by research organisations and cultural heritage institutions. This is similar to the UK copyright exception with respect to TDM.

Article 4, however, provides an additional general exception for TDM for any purpose, provided that the rightsholder has not "expressly reserved" or opted out its work from this exception in an appropriate manner. This could be by adding the express reservation in a machine-readable way where the content is available online. In practice, such a reservation is commonly included in the terms and conditions of website use. But in the absence of such a provision, it may be easier to show under EU law that web scraped content was lawfully copied and used for commercial purposes.

Osborne Clarke comment

The TDM exception has been a particular area of focus in the UK, with toing and froing on policy direction from the UK government. Our next Insight in this series looks at the potential future developments in this area.

Where data is web scraped for use as training data where there is a commercial motivation, there is unlikely to be an exception under English law that would protect the AI developer from a claim of copyright or database right infringement.

The position may be more positive under EU law, although it is relatively straightforward for the rightsholder to prevent availability of the exception. Users of AI systems that might have been trained using data subject to intellectual property (IP) rights should make enquiries of the supplier to understand the legal risks in relation to the system's training data.

The concluding article in this series will look at what the future could hold for IP rights and AI training data in the UK. The IP risks of AI will be covered in a webinar during Osborne Clarke's IP Month – sign up here.

* This article is current as of the date of its publication and does not necessarily reflect the present state of the law or relevant regulation.

