If data is critical to the training and development of AI systems, how are the owners of intellectual property rights protected? Can intellectual property rights be reconciled with the 'data voracity' of AI systems?
There are two critical issues related to the protection of intellectual property in the development and use of artificial intelligence systems.
The first issue relates to the indiscriminate use of information in the training of AI systems, including information covered by intellectual property rights, and essentially concerns providers.
Art. 3 AI ACT ‘provider’ means a natural or legal person, public authority, agency or other body that develops an AI system or a general-purpose AI model or that has an AI system or a general-purpose AI model developed and places it on the market or puts the AI system into service under its own name or trademark, whether for payment or free of charge;
The second issue, on the other hand, concerns the recognition or lack of protection of works generated by AI systems and concerns deployers.
Art. 3 AI ACT ‘deployer’ means a natural or legal person, public authority, agency or other body using an AI system under its authority except where the AI system is used in the course of a personal non-professional activity.
In this article, we will focus on the first aspect, leaving the second issue for a future article.
Protecting the intellectual property of training data
To train generative artificial intelligence, especially for general-purpose systems, developers very often use web scraping.
Essentially, information and data can be collected systematically through ‘web robots’ that operate in an automated manner simulating human navigation, provided that the resources they visit are accessible to the general public and not subject to access restrictions. According to a study (Imperva – Bad bot report) in 2023, 49.6 % of all Internet traffic was generated by bots. This is an increase of 2.1 % compared to the previous year, that was partially attributed to the spread of AI systems and, in particular, of Large Language Models (hereinafter also ‘LLM’) underlying generative artificial intelligence (Provvedimento Garante per la tutela dei dati personali 20 maggio 2024 – information note on web scraping).
Because of this indiscriminate data collection, not only personal data but also a lot of information originally protected by intellectual property rights ends up in the 'net' of web robots.
So, on the one hand, we have the providers of AI systems, who are interested in unlimited and free access to this enormous amount of data on the web, and on the other hand, we have the owners of intellectual property rights, who want to see their rights protected.
Legislators must find a way to balance these two needs, bearing in mind that at this historic moment, legislation that hinders the development of artificial intelligence or makes it too costly could become an obstacle to competitive development.
The European Union has been moving on several fronts, with its own EU Digital Strategy - EU4Digital (eufordigital.eu) in place for several years, and there are numerous regulatory interventions concerning personal and non-personal data. The sources of quality data that can be used are indeed numerous.
The recent EU Regulation 1689/2024 (AI ACT) provides in Article 53(1) that
“Providers of general-purpose AI models shall:
(c) | put in place a policy to comply with Union law on copyright and related rights, and in particular to identify and comply with, including through state-of-the-art technologies, a reservation of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790;
(d) | draw up and make publicly available a sufficiently detailed summary about the content used for training of the general-purpose AI model, according to a template provided by the AI Office.
The AI ACT does not primarily aim to regulate aspects of copyright protection, which it only touches on in passing. In the European legal framework, however, Article 4 of Directive 2019/970 (Digital Single Market) allows text and data mining of works or materials to which one has legal access, at least if the author has not expressed a reservation against such uses (so-called opt-out).
The Digital Single Market is explicitly mentioned in Art. 53 of the AI Act.
Article 53 of the AI Act thus imposes two specific obligations on providers in connection with the protection of IP rights.
- The adoption of company policies or codes of conduct that provide for the protection of intellectual property and, in particular, the identification and respect of any reservation of rights expressed by right holders in an appropriate manner, for example by means of tools that allow automatic reading in the case of content made publicly available online (Art. 3(4) Dir. EU 2019/790).
- The drafting and publication of a summary document of the content used to train algorithms. Article 53 requires that a template be provided by the AI office and states that the summary must be ‘sufficiently detailed’. At this stage, it is not possible to say what level of detail will be required, but it is expected that completely generic documents will not be accepted, which would fundamentally breach the transparency requirements of the Regulation. It is likely that the template to be developed by the IA Office will provide useful guidance not only for the preparation of the summary document but also for the preparation of internal policies.
On the other hand, Article 53 of the AI Act also imposes an obligation on intellectual property rights holders to provide adequate information on whether content is freely available or not (opt-out).
The European Data Strategy thus makes a variety of data sources available to providers of AI systems, and the possibility of using data on the web is not prevented, but subject to the (few) requirements of the AI Act.