Increasing affordability and ease of access to AI and machine learning technologies have amplified concerns about the use of copyrighted data for AI training. It is imperative for developing countries like India, which have a substantial and rapidly expanding user base for AI tools, to take proactive measures and enact regulations to address the complexities arising from the integration of AI in the creative field, while safeguarding the interests of human artists.
—
What is generative AI?
GENERATIVE Artificial Intelligence (AI) encompasses a broad spectrum of AI technologies capable of producing original content such as text, images, videos, audio, code or synthetic data. The popularity of this technology has skyrocketed, with ChatGPT, an AI chatbot by American AI research lab OpenAI a prime example, attracting a million users within five days of its unveiling.
The increasing affordability and ease of access to AI and machine-learning technologies have amplified concerns about the use of copyrighted data for AI training. These advancements have prompted growing ethical considerations and debates among policymakers regarding the potential implications of generative AI, especially in the realm of intellectual property, where the effects are palpable.
As the technology is relatively novel, AI must be trained on vast amounts of data, frequently including copyrighted material and personal information sourced from social media and open data repositories, leading to legal and ethical quandaries.
Also read: The possibilities and pitfalls of ChatGPT
What are the copyright implications of generative AI?
The use of copyrighted material in AI training has prompted worries among artists and programmers who perceive generative AI technologies such as DALL-E, Stable Diffusion and Codex as a threat. Their apprehensions are not unfounded.
The issue around fair use protection becomes confusing when it comes to AI, as there is currently no established legal precedent that recognises the use of copyrighted data for AI training as fair use. If courts rule that copyrighted material cannot be used to train large-scale AI models, it would significantly impact the training process of these models.
Codex, for instance, has been trained on codes found in 54 million repositories of Github, an internet hosting service for software development and version control using Git. An internal study by GitHub has found that 1 out of every 1,000 codes generated by Codex contained direct copies from the training data, reinforcing the concerns of programmers. The Free Software Foundation, a non-profit software company, has raised concerns that the usage of code from public repositories may not fall under the fair use doctrine, and developers may not be able to identify potentially infringing code.
Similarly, DALL-E has been trained on approximately 650 million images, which are claimed to be from a combination of publicly accessible and licenced sources. However, OpenAI has not disclosed the dataset, which has raised concerns about the possibility of copyrighted material being included. There have been instances where AI models have reproduced images that they were trained on rather than creating novel images, a clear violation of the owners’ copyright.
How is ‘fair use’ for AI companies ‘unfair’ for copyright owners?
The fair use doctrine allows for the use of a copyrighted work without the owner’s permission for purposes such as criticism, comment, news reporting, education, scholarship or research. In India, standard exceptions or defences to copyright infringement are listed in Section 52 (certain acts not to be infringement of copyright) of the Copyright Act, 1957.
The determination of fair use is a complex matter that involves both legal and factual considerations, and is dependent on the specific circumstances of each case.
The issue around fair use protection becomes confusing when it comes to AI, as currently there is no established legal precedent that recognises the use of copyrighted data for AI training as fair use. If courts rule that copyrighted material cannot be used to train large-scale AI models, it would significantly impact the training process of these models. On the other hand, if the courts declare that these models can be trained on any data, whether copyrighted or not, it could have significant implications for individuals who own images that are used in the training process.
Despite the lack of a legal precedent, the four-factor test laid down by the Kerala High Court in the case of Civic Chandran versus C. Ammini Amma (1996) can be useful when determining if a use is considered fair use. It is similar to the four-factor test of the US fair use doctrine. These factors are:
- The purpose of the use, including whether it is for commercial or non-profit educational purposes
- The nature of the copyrighted work
- The amount and substantiality of the portion used in comparison to the entire copyrighted work
- The impact of the use on the potential market or value of the copyrighted work
The use of copyrighted data in AI training for non-profit purposes is generally allowed. However, various companies are rolling out paid versions of their products, which is clearly for profit. For instance, OpenAI has recently launched a paid version called ChatGPT Plus, which offers faster response times and priority access during peak hours. OpenAI has stated on its website that users’ inputs and outputs may be utilised to enhance their products and services. OpenAI has also received or been promised significant investment, in the billions of dollars.
AI companies argue that the output generated by generative AI falls under the fair use category as it transforms the original work by adding new expression, meaning or message. The concern, however, is that the AI may accidentally create art or code that closely resembles the original, thereby not meeting the criteria of “transformative” and, therefore, not qualifying as fair use.
Therefore, considering the commercial aspect of ChatGPT, it is possible that OpenAI could unintentionally infringe on someone’s copyright, and both OpenAI’s and the user’s actions may be deemed as copyright infringement by a court. If a user uses ChatGPT’s output for commercial purposes, there is a higher likelihood that the output could be viewed as infringement by courts.
AI companies argue that the output generated by generative AI falls under the fair use category as it transforms the original work by adding new expression, meaning or message. The concern, however, is that the AI may accidentally create art or code that closely resembles the original, thereby not meeting the criteria of “transformative” and, therefore, not qualifying as fair use. While AI vendors are taking precautions to avoid infringing on copyright, the very fact that the AI makes patterns out of the copyrighted data fed to it, leads to the possibility of the AI producing content that is not distinct enough to be considered “transformative.”
In The Andy Warhol Foundation for the Visual Arts, Inc. versus Goldsmith (2021), the United States Second Circuit Court of Appeals determined that works of art created by American visual artist, film director and producer Andy Warhol based on a photograph of American singer, songwriter, musician, and record producer Prince were not transformative. Simply adding a new aesthetic or expression to source material is not enough to be deemed transformative. The work must demonstrate a distinct artistic purpose and significantly alter or obscure the original work to be considered transformative, it held.
Also read: Artificial Intelligence: An explainer for beginners
Why is copyright compliance a major hurdle for generative AI companies?
Complying with copyright laws poses a real challenge for AI companies, particularly with regard to generative AI. These systems learn from the data they are fed, which can lead to complexities if the original author requests that their creation be deleted from the AI’s database.
Unlike a computer resource, a generative AI processes data to create patterns for future learning and cannot simply be deleted. It is extremely difficult, if not impossible, to alter or delete information that the AI has already learned. The resources and efforts required for the development of an AI system are substantial.
Moreover, it is argued that the strength of generative AI lies in its capability to learn from diverse sources, including the styles of artists. AI vendors contend that excessively stringent regulations would have negative impacts on this emerging technology and significantly hinder innovative research in AI.
Companies could be required to mandatorily obtain permission from the owner prior to using their works for AI training programs. This approach would balance the protection of the rights of copyright owners, and the continued growth and development of AI research.
However, providing indiscriminate immunity from copyright laws in the interest of supporting a developing technology would not be appropriate. It is important to carefully consider and preserve the rights of copyright holders as well.
What is the way forward?
As AI technology presents significant economic potential, many countries will look towards easing restrictions on text and data mining to support domestic AI development. However, current intellectual property laws must be updated to align with the rapidly advancing AI.
This can be achieved through implementing data usage and governance policies for AI projects that include oversight and compliance mechanisms. For instance, the Information Technology (Intermediary Guidelines and Digital Media Ethics Code) Rules, 2021 mandate that intermediaries must designate compliance officers and submit a monthly compliance report. A comparable approach could be taken in the AI domain, whereby AI firms are mandated to appoint compliance officers responsible for overseeing and enforcing copyright protection, along with conducting frequent audits and assessments. Companies could be required to mandatorily obtain permission from the owner prior to using their works for AI training programmes. This approach would balance the protection of the rights of copyright owners, and the continued growth and development of AI research.
To tackle the issue of identifying copyrighted material within massive data collections, AI tools could be employed to detect and investigate copyright infringements. A 2022 report by the European Union Intellectual Property Office on how AI affects copyright and design infringement and enforcement, suggests that different forms of AI can be employed to enforce intellectual property rights. The study suggests that machine learning can be used to enforce copyright and design by processing vast amounts of data to identify risks, detect social engineering bots, and uncover infringement patterns, among other opportunities.
Also read: Our legal system is still not ready to regulate users’ behaviour on the metaverse
Generative AI has immense potential in the current digital economy and is expected to bring significant changes to some industries while offering growth opportunities in others. This technology is not ephemeral, as it offers practical applications and is already being utilised for commercial purposes.
It is imperative for developing countries like India, which have a substantial and rapidly expanding user base for AI tools, to take proactive measures and enact regulations to address the complexities arising from the integration of AI in the creative field while safeguarding the interests of human artists. Ultimately, maintaining a balance between preserving human creativity and promoting the growth of AI in the creative sector is crucial for the successful integration of this potentially revolutionary technology.