By Adithi Iyer
The legal world is atwitter with the developing artificial intelligence (“AI”) copyright cage match between The New York Times and OpenAI. The Times filed its complaint in Manhattan Federal District Court on December 27 accusing OpenAI of unlawfully using its (copyrighted and paywalled) articles to train ChatGPT. OpenAI, in turn, published a sharply-worded response on January 8, claiming that its incorporation of the material for training purposes squarely constitutes fair use. This follows ongoing suits by authors against OpenAI on similar grounds, but the titanic scale of the Times-OpenAI dispute and its application of these issues to media in federal litigation makes it one to watch. While much of the buzz around the case has centered on its intellectual property and First Amendment implications, there may be implications for the health and biotech industries. Here’s a rundown of the major legal questions at play and the health-related stakes for a future decision.
The Heart of the Case
Reportedly, the Times-Open AI case comes after the breakdown of several attempts to negotiate a licensing deal between the two parties. The Times characterizes OpenAI’s use of their material for both ChatGPT and Copilot—an AI-powered digital assistant that is poised to transform business suite operations—as “unlawful,” “wide-scale copying,” but OpenAI contends that this is simply “training.” This dispute centers around fair use, essentially a multifactor test to determine whether one’s use of a copyrighted work is sufficiently “transformative” so as not to infringe on the original author’s copyright. If a final decision in this case comes out in favor of the Times, that precedent would likely erode the concept of open-environment training altogether. The resulting closed environments would limit AI developers to train their models only on sources that have been licensed out in preexisting deals between them and copyright owners. While this alternative would compensate authors, either through licensing deals or legal damages, for the use, it may also functionally limit the types of models available, their levels of sophistication, and the kinds of entities that can produce them in the first place. Ultimately, OpenAI argues that the effects would limit progress in the space, both for OpenAI and tools that follow.
The Times adds color to its infringement argument by emphasizing ChatGPT’s “regurgitation” of Times copyrighted content in its responses to certain prompts. The Complaint provides an example of GPT-4 spitting out a near recreation of a 2019 Times article from what they call “minimal prompting” and calls the output on both Copilot and ChatGPT interfaces an “unauthorized public display” of the work. OpenAI’s response characterizes this regurgitative function as a “rare bug” that they are actively seeking to eliminate. The case itself has generated considerable discussion across the AI/ML communities, raising new and different questions in tandem. ML pioneer Andrew Ng drew eyes with his own response to the case, aligning with OpenAI and presupposing an almost humanlike quality for AI (which itself is a yet-unsettled legal question). But the center of this story is the usability of protected material for training purposes. And with training a central component of, especially, generative AI, the stakes of this case are high.
Data Training Debates for Health and Biotech AI
This debate on training data will almost certainly trickle into the health and biotech spaces, which may feel effects quite profoundly. Indeed, the application of AI to the medical, pharmaceutical, and technology spaces is at least in large part, if not entirely, reliant on the use of copyrighted or otherwise protected materials to power tools that perform myriad functions from bench to bedside. In a field built on discovery, recency, and thoroughness, and for a private sector built on the protectability of these discoveries, IP regimes like copyright are at the center of our collective medical consciousness. Consider our knowledge base: medical academia and industry research power both the professional and practical aspects of health/care, and the research articles that inspire the next big discoveries are copyrighted by the journals that publish them. These medical journals operate under multiple models, but nearly all the most well-known, highest impact factor journals are for-profit, paywalled entities much like the Times. Resolving an access question to this material for training purposes would undoubtedly trickle over to these entities. There are, of course, well-known open-source journals, like PLOS One, but for functions like literature reviews and other medical research, comprehensiveness is critical.
For AI applications in biotech and pharmaceuticals—namely high-powered drug discovery models that are already making waves—the implications of training data limitations may still appear. While pharmaceutical data might differ from copyrighted medical research articles, the incentive structures that underlie a competitive licensing landscape for fully compensated training data may carry over to this space. Availability of small molecule, protein, and other databases for training drug discovery models may operate on a de facto tiered system, where the most comprehensive models could be accompanied by exorbitant pricing and exclusive licensing. And, of course, database creators may seek copyright protection for their compiled sets to capitalize on a forced-compensation world. Copyright in certain health-related realms can be a bit thorny. For example, if natural DNA is not copyrightable, should DNA databases be copyrighted as compilations? This potential reality might compromise the growing field of Deep Genomics, and several other applications of nucleotide-powered AI.
With that said, these copyright-driven effects on training data should be separated from the notion of “closing” the training environment—that is, limiting the pool of sources on which models are trained. Closing training environments for medical-grade AI tools is necessary to a certain point in a world where internet health misinformation is at an all-time high. This applies to tools that provide informational resources, like high-powered medical research and literature review assistants, as well as decision-making diagnostic algorithms, and even robotic surgery and patient care tools. The question, of course, is where to draw the line. In the case of healthcare and medical AI, access to copyright-eligible, but vital, training data might count as an exception to general paywalls or licensing requirements. Especially as many high-powered drug discovery training datasets are publicly available, the Times-OpenAI copyright case could well turn into an open science debate.
The Voices we Amplify—and Protect
The academic angle of copyright issues in information sharing feeds into the conversation around open science and information sharing for advancement more generally. The open science movement envisions a world in which knowledge and information move freely with access for all. It arises in response to underlying waves of systematic exclusion in academia that has resulted in underrepresentation of voices from underserved communities and health and medical research/ers from the Global South. Certainly, limitations on fair use for training purposes could perpetuate not just this, but other underrepresentation, as for smaller-scale startups lacking the capital to access certain training content for new discoveries. At the same time, copyright and other protections offer some kind of recourse to the type of regurgitative output issues and bugs that could no less perpetuate unknowing plagiarism, a news cycle buzzword as of late. In doing so, it protects vulnerable authors and creators, incentivizing continued innovation.
The line-drawing exercise in the Times/OpenAI case is far from straightforward, as are its implications for health and biotech. Hopefully, the forthcoming litigation might draw these issues and considerations into salient arguments that help instantiate the need for protections on both, with perhaps a delicate balance emerging from the thicket.