Training, Not Copying: Why AI Training Should Qualify as Fair Dealing in India

Mar 5
7 min read

The rapid ascent of Generative Artificial Intelligence (GenAI) has triggered a global legal adversity. From Hollywood writers striking against digital replicas to the rise of anti-AI campaign, the tension between human creativity and machine learning is palpable. In India, the examination of this intersection between copyright law and AI model training has reached the Delhi High Court in the landmark case of ANI Media Pvt. Ltd. v. OpenAI Inc.[1]

At the heart of this dispute lies a fundamental question: Does “training” an AI model on publicly accessible data constitute copyright infringement, or is it a legitimate, non-infringing act of learning?

While content creators argue that ingesting their work is theft, a closer look at the technical mechanics of AI and the legal principles of fair dealing under the Copyright Act, 1957, suggests a different reality. Treating the training of AI models as mere “copying” may overlook the unique functional nature of this technology. This piece will explore the possible legal grounds for viewing AI training as a transformative activity and consider how such an interpretation could support India’s broader technological ambitions.

1. The Mechanics: Learning Patterns, Not Plagiarising Plots

To understand the legal argument, one must first grasp the technological reality. When an AI model like ChatGPT or a vision model like Midjourney is “trained,” it is not creating a collage of stored JPEGs or PDFs.

As explained in tech-law literature, GenAI models use Machine Learning (ML) and Deep Learning (DL) to analyse vast datasets. The goal is not to memorise the specific expression of a work which copyright protects but to learn the underlying statistical relationships, patterns, and latent features.

Think of it this way: If a human student reads every mystery novel by Agatha Christie, they learn the “rules” of the genre, the red herrings, the pacing and the vocabulary of suspense. If that student writes a new mystery novel using those learned principles, we would call it inspiration, not infringement.

AI training does this on a massive scale. It converts text and images into mathematical "tokens" and adjusts billions of parameters to predict the next word in a sentence or the next pixel in an image. The original work is not stored in the final model; only the abstract mathematical weights remain. Under the principle established in R.G. Anand v. Delux Films, copyright protects the specific expression of an idea rather than the idea or technique itself. Applying this principle in the present context, extracting non-expressive patterns for computational analysis should be viewed as a legitimate use rather than an infringement.

2. The “Fair Dealing” Defence: A Flexible Shield

India’s copyright regime is governed by the Copyright Act, 1957. While the Act grants exclusive rights to creators, it additionally carves out exceptions under Section 52, also known as fair dealing.

2.1 The Research Exception

Section 52(1)(a) allows for the fair dealing of any work for the purposes of “private or personal use, including research.” The process of training an AI model may be characterised as a research activity. This is because it involves analysing data to build a functional system capable of generating new content.

Critics often argue that because companies like OpenAI or Google are commercial entities, they cannot claim “research.” However, Indian courts have not strictly barred commercial entities from this defence if the use is transformative and does not blatantly substitute the original market. The training phase, where the machine is “learning”, is distinct from the commercial deployment of the tool.

2.2 The Doctrine of “Non-Expressive Use”

While “Fair Use” is a US doctrine enshrined in 17 U.S. Code § 107, which differs from the doctrine of fair dealing under the Indian copyright law, Indian courts often look to international jurisprudence for guidance on novel issues. A critical concept here is non-expressive use. In the US cases of Authors Guild v. Google and Authors Guild, Inc. v. HathiTrust, courts ruled that digitising millions of books to create a searchable database, which didn’t display the full text to users, was fair use.

This was because the purpose was functional, not expressive. The computer wasn’t “reading” the books for entertainment; it was analysing them to create a search index. Similarly, AI training processes treated copyright works as data inputs only. It “reads” a news article not to enjoy the journalism, but to understand the syntax of a news report. This “transformative” shift from expressive content to functional data is often cited strongly for applicability of fair dealing principles.

3. The Input vs. Output Distinction

A majorly recurring issue in the current debate is conflating the Input (training) with the Output (generation).

The Input (Training) stage involves making intermediate, often transient, digital copies to analyse patterns. Whereas the Output (Generation) stage is where the AI produces new content. If an AI generates a specific copyrighted article verbatim, it may amount to infringement. However, that is a failure of the specific output, not a conclusion regarding the legality of the training process itself.

In Kadrey v. Meta, the US District Court ruled that training AI on books was fair use because the system learned patterns rather than copying expression, and no real market harm was shown. Similarly, in the summary judgement of Bartz v. Anthropic, it was held that using copyrighted works to train generative AI models is a highly transformative use, as the AI learns statistical patterns rather than reproducing the original expression.

4. Transaction Cost: A Challenge

From a policy perspective, requiring licenses for AI training data can be economically disastrous.

We are entering an era where the internet is projected to encompass 175 zettabytes of information by 2025. Research into AI scaling laws shows a direct relationship between the volume of training data and a model’s performance. To ensure a vast knowledge and minimise errors, modern systems require massive, diverse datasets, comprising trillions of tokens.

Imagine if an AI developer in India had to track down, negotiate, and pay a license fee to every single blogger, tweeter, photographer, and journalist whose work appears in a training dataset. The transaction costs would be insurmountable. It would effectively kill the Indian AI startup ecosystem, leaving the field open only to massive tech giants who can afford blanket licenses or who operate in jurisdictions with looser laws.

As noted in the DPIIT Working Paper on Generative AI and Copyright (“DPIIT Working Paper”), a model of voluntary licensing for AI training is impractical due to the sheer volume of data required. A “permission-free access” model for lawful content is essential to prevent data monopolies and ensure a level playing field for Indian startups.

5. Global Momentum: The “Fair Learning” Standard

India is not operating in a vacuum. Other innovation-forward nations are already modifying their laws to clarify that AI training is permitted; this provides a useful comparative context for India.

• Japan: Japan has adopted perhaps the most AI-friendly copyright laws globally. Their law explicitly allows the use of copyrighted works for machine learning, stating that such use does not harm the interests of the copyright owner because the purpose is data analysis, not enjoyment under Article 30-4 of their Copyright Act.

• Singapore: Singapore has introduced a text-and-data mining (TDM) exception under Section 244 of its new Copyright Act provides which permits making copies of the copyrighted work for computational data analysis and for preparing that work for such analysis covering commercial use, recognizing that data analysis is key to the digital economy.

• United States: While litigation is ongoing, recent rulings like Bartz v. Anthropic have hinted that using copyrighted work to train LLMs may be considered as fair use because the technology is highly transformative.

If India adopts a restrictive approach—forcing AI companies to delete data or pay exorbitant fees—it risks “brain drain.” Where AI developers move to other countries with friendlier rules, might resulting in a future where India is just a buyer of foreign AI instead of a leader in building it.

6. Addressing the “Theft” Narrative

Critics argue that AI companies are “unjustly enriching” themselves off the labour of human creators. They fear that AI will act as a market substitute, potentially destroying the livelihood of journalists and artists.

While this is a valid concern, the copyright law is not primarily designed to address this competitive stance. Copyright is designed to prevent reproduction, not competition. If a human reads a newspaper every day and starts a rival news blog, they are competing, but they are not infringing copyright (provided they write their own stories). AI is doing the same, albeit at a superhuman scale and speed.

A suggested solution to this potential economic displacement of creators would lie in market solutions or sui generis legislation (like a levy system), not in distorting copyright law to ban the act of “learning.” As discussed in the DPIIT Working Paper, a Hybrid Model involving statutory remuneration might be a middle ground, but a blanket ban on training or labelling it as infringement is legally unsound and economically regressive.

Conclusion

Ultimately, the right to train an AI is the digital equivalent of the right to read. If we establish a legal precedent that analysing a work to learn its patterns is infringement, we break the fundamental contract of copyright: creators get protection for their expression, but the ideas and knowledge contained within become part of the public commons for others (and now, machines) to learn from.

For India to achieve its IndiaAI Mission and secure its place as a leader in the global digital economy, it must interpret Section 52 to include AI training as fair dealing. We must protect creators from plagiarism (output infringement), but we must not lock away human knowledge from the tools that have the potential to solve our greatest challenges.

Training is not copying. It is learning. And in a progressive society, learning, by man or machine, must remain free.

[1] ANI Media Pvt. Ltd. v. OpenAI Inc. & Anr., CS(COMM) 1028/2024 (Del. HC).

Author: Pulkit Verma and Aman Mishra, Second year Law Students at ILC, Faculty of Law, University of Delhi.

Editors: Prisha Mehta and Sameer Kashyap

Training, Not Copying: Why AI Training Should Qualify as Fair Dealing in India

Recent Posts

Subscribe Form