A few weeks ago, I started seeing a bunch of Instagram posts from authors who’d just learned, essentially, that their books had been used to teach AI how to imitate their writing. But before I get to that, you’ll need to wrap your head around a bit of technological arcana, so bear with me.
There’s an online dataset called Books3. It’s part of a much larger dataset known as the Pile, which was created by the nonprofit AI research group EleutherAI to train large generative language models. (OpenAI’s ChatGPT, Google’s PaLM, Meta’s LLaMa, etc.) The precursor to Books3 was a huge pirated collection of eBooks—possibly close to 200,000 in all—published mostly within the past two decades, including titles by world-famous best-selling authors. In 2020, an independent developer named Shawn Presser converted all of these eBooks to plain text, and, voila: Books3. “Suppose you wanted to train a world-class GPT model, just like OpenAI. How? You have no data,” Presser tweeted at the time. “Now you do. Now everyone does.… You now have OpenAI-grade training data at your fingertips…a direct download to 200k plaintext books.”
Books3 is one of 22 datasets that together make up the Pile, which began circulating among AI developers and programmer types shortly after Books3 emerged. This past summer, a guy name Alex Reisner decided to give the Pile a spin. A longtime independent developer with a couple of CTO titles on his résumé, Reisner was now making a go of it as a freelance writer meets programmer meets tech consultant. Curious about the secretive nature of generative AI development, and in particular about the kinds of books being used to train these large language models, Reisner examined Books3 to see what he could find. One of the first things to catch his eye was a book by Ta-Nehisi Coates, he later explained on LitHub’s Fiction/Non/Fiction podcast. “Ok,” Reisner thought, “there’s real, recent, important books in here. This isn’t just, kind of, self-published or out-of-copyright stuff.”
Reisner decided to try and identify everything stored in the Books3 trove. He wrote a program to detect ISBN numbers and used an ISBN database to look up titles, authors, publishers, and languages—whatever metadata he could scrape. Then he called The Atlantic and pitched a story. “The Atlantic is a somewhat political publication,” he told Fiction/Non/Fiction, “and this felt like a somewhat political finding.”
On August 19, The Atlantic published an article, billed as the first in a series about Books3, in which Reisner confirmed what the author community already suspected: “Pirated books are being used as inputs for computer programs that are changing how we read, learn, and communicate. The future promised by AI is written with stolen words.“ Books3, Reisner revealed, included works by the likes of Michael Pollan, Rebecca Solnit, and Jon Krakauer; George Saunders, Zadie Smith, and Junot Díaz; James Patterson and Stephen King. “To add insult to injury,” wrote Margaret Atwood in a separate piece for The Atlantic, “the bot is being trained on pirated copies of my books. Now, really! How cheap is that? Would it kill these companies to shell out the measly price of 33 books? They intend to make a lot of money off the entities they have reared and fattened on my words, so they could at least buy me a coffee.“ (King’s take was a bit more meh: “Does it make me nervous? Do I feel my territory encroached upon? Not yet.”)
Bringing his article to a close, Reisner argued, “Control is more essential than ever, now that intellectual property is digital and flows from person to person as bytes through airwaves. A culture of piracy has existed since the early days of the internet, and in a sense, AI developers are doing something that’s come to seem natural. It is uncomfortably apt that today’s flagship technology is powered by mass theft.”
A few weeks later, on September 25, a second article from Reisner went live: “These 183,000 Books Are Fueling the Biggest Fight in Publishing and Tech.” This time The Atlantic included a tool for anyone to “look up authors…and see which of their titles are included.” Which brings me back to the anecdote I began with: As The Atlantic’s searchable database made the rounds online, authors who’d discovered they were included in Books3 started posting on social media with varying degrees of anger and incredulity.
“I would never have consented for Meta to train AI on any of my books, let alone five of them,” tweeted the three-time National Book Award finalist Lauren Groff. “Hyperventilating.”
“I put so much thought, attention, and time into each word, each semicolon, of my fiction,” the novelist R.O. Kown told The Washington Post. “I don’t quite believe in souls, but if I did, a novel is as close to being a sizable piece of my soul as anything could be.”
In my Instagram feed, I scrolled past a particularly animated missive from Mary H.K. Choi, who’d just been through the ringer with the Hollywood writers strike. “As someone for whom ‘voice’ is paramount,” Choi lamented to her 27,200-something followers, “I’m completely gutted and whipsawed. I am outraged and at the same time feel utterly helpless. I feel sheepish as though I’m fist shaking at the internets or machinery and I’ve often made the joke that after Armageddon they’ll bring writers back after soul cycle instructors but I’m scared. I’m furious and want to fight but I’m also so tired.”
Back in June, I published an article about how media organizations were collectively grappling with the threat posed by AI to their already sufficiently threatened business models. In the early days of the web, many had started giving their content away for free online, a fatal decision that led to their lunch being gobbled up by tech giants like Google and Facebook. Now, as AI bots become frighteningly efficient at mining content produced by humans in order to mimic humans, news publishers wanted to “make sure they don’t get screwed again.” CEOs like Robert Thomson, Barry Diller, and others had begun speaking out about the need to “be more collectively assertive,” as Thomson put it, “in haggling for the values and virtues of journalism”; or to stand up to AI-makers and declare, in Diller’s words, “You cannot scrape our content…and use it in real time to actually cannibalize everything.”
In the four months since I wrote that story, AI has faced mounting pressure from the content community (for lack of a better umbrella term). Screenwriters have secured robust guardrails against the use of AI in Hollywood productions, a key rallying cry in the recent Writers Guild of America strike. Executives from dozens of print and digital media outlets—some of which have already started blocking AI crawlers from their websites—recently swarmed Capitol Hill to lobby lawmakers for AI protections, and representatives from various creative fields brought their concerns to the Federal Trade Commission during a roundtable on the “Creative Economy and Generative AI.” Companies like Axel Springer, Diller’s IAC, and The New York Times have reportedly explored legal action, and AI platforms have been in conversations with content providers who want to extract licensing payments.
“What you’ll see over time is a lot of litigation. Some media companies have already begun those discussions,“ News Corp’s Thomson said last month at the Goldman Sachs Communacopia + Technology Conference. “Personally, we’re not interested in that at this stage. We’re much more interested in negotiation. We have various negotiations going on.”
The most aggressive action so far has come from the book world. Just days before The Atlantic’s searchable Books3 database started making noise, the Authors Guild and a group of individual authors filed a class action copyright lawsuit in the Southern District of New York against OpenAI, which is seen as the leader in generative AI development. It was the latest in a series of class action suits brought by authors, such as Michael Chabon and Sarah Silverman, claiming that companies like OpenAI and Meta had used unauthorized and pirated copies of their books to train AI models. The Authors Guild lawsuit was the most high-profile and attention-grabbing of the bunch, landing with a New York Times story and the participation of 17 name-brand novelists, including a murderers’ row of mass-market A-listers like David Baldacci, Michael Connelly, John Grisham, and George R.R. Martin.
“I’m very happy to be part of this effort to nudge the tech world to make good on its frequent declarations that it is on the side of creativity,” George Saunders said in an accompanying statement. (He didn’t have anything to add when we exchanged emails last week.) “Writers should be fairly compensated for their work. Fair compensation means that a person’s work is valued, plain and simple. This, in turn, tells the culture what to think of that work and the people who do it. And the work of the writer—the human imagination, struggling with reality, trying to discern virtue and responsibility within it—is essential to a functioning democracy.”
Another one of the plaintiffs, Douglas Preston, expressed similar sentiments during the FTC roundtable on October 4. “Many authors were discovering that ChatGPT-3 knew everything about their books,” he said, “and some realized it was even being used to create works that imitated their own. My friend George R.R. Martin…was very disturbed when AI was used to write the last book in his Game of Thrones series using his characters, his plot lines, his settings—even his voice.… AI developers are swallowing everything they can get their hands on without regard to copyright ownership, intellectual property rights, or moral rights. And they’re doing this without the slightest consideration given to supporting the livelihood of America’s creative class.”
Initially, the Authors Guild didn’t intend to bring a lawsuit. Like others, they’d been in conversations with OpenAI about coming to terms on some sort of licensing agreement. “Very open, good-faith conversations,” says guild CEO Mary Rasenberger. (Those conversations are now suspended while the litigation plays out.)
Additionally, guild representatives had gone down to Washington to meet with congressional staffers and drum up support among lawmakers. But by late summer, according to Rasenberger, “it became kind of clear that we weren’t gonna get the legislation we needed anytime soon.” At the same time, “we started seeing more evidence”—such as Reisner’s Books3 reportage—“of the use of AI to mimic authors’ works,” she says. “Things like sequels to the George R.R. Martin books that he hasn’t yet written. Jane Friedman found five books purported to be written by her that were, presumably, AI-generated. We were starting to see a lot of AI-generated books on Amazon, and authors were getting upset. Creators are feeling an existential threat to their profession, so there’s a feeling of urgency.”
Rasenberger tells me there were more than 100 authors who wanted to join the suit; the 17 named plaintiffs were selected on a first-come-first-serve basis. They sued OpenAI in particular, she says, because OpenAI has “been out there the longest and they’re the most used, for now. We’re seeing the most use cases.“ I asked if the guild would like to see news publishers join the legal fight. “Definitely. I mean, we all want to get to the same place, which is licensed use of content.”
The copyright issues pertaining to AI are still murky. OpenAI has moved to dismiss one of the other suits, filed by authors Mona Awad and Paul Tremblay, on the grounds that ChatGPT is protected under the fair use doctrine, and that “the fair use of a copyrighted work…is not an infringement of copyright.”
Some copyright experts agree. “Copyright law does not, and should not, recognize computer systems as authors,” Emory University law school professor Matthew Sag testified in July before a Senate judiciary subcommittee. “Training generative AI on copyrighted works is usually fair use because it falls into the category of nonexpressive. Courts addressing technologies such as reverse engineering, search engines, and plagiarism detection software, have held that these ‘nonexpressive uses’ are fair use.” Others, however, believe the authors have a strong case against the AI platforms. “They’ve scraped all this content and put it into their databases without asking permission—that seems like a huge grab of content,” Edward Klaris, an intellectual property lawyer, told the Times. “I think courts are going to say that copying into the database is an infringement in itself.”
On August 30, the US Copyright Office issued a “notice of inquiry” that it is “conducting a study regarding the copyright issues raised by generative artificial intelligence (AI). This study will collect factual information and policy views relevant to copyright law and policy. The Office will use this information to analyze the current state of the law, identify unresolved issues, and evaluate potential areas for congressional action.” Input from organizations like the Authors Guild is due by October 30.
OpenAI, for its part, gave me the following statement through a spokeswoman: “Creative professionals around the world use ChatGPT as a part of their creative process. We respect the rights of writers and authors, and believe they should benefit from AI technology. We’re having productive conversations with many creators around the world, including the Authors Guild, and have been working cooperatively to understand and discuss their concerns about AI. We’re optimistic we will continue to find mutually beneficial ways to work together to help people utilize new technology in a rich content ecosystem.”
Mutually beneficial or not, the bottom line is that the guild believes authors should have control over whether and how their works are used by generative AI, and that authors should be able to receive compensation. How would that work in practice?
“We want to build a collective licensing system, and we’re in the middle of discussions with a couple different entities about that,“ says Rasenberger, who is herself an attorney with copyright expertise. “Like, you would go into a system and say, yes, I want to license my work. I’ll accept whatever the suggested amount is—or I won’t accept that—and I will allow these uses.“ Surely, publishers would want a say? “We’ve spoken to the Big Five”—Penguin Random House, Hachette, HarperCollins, Macmillan, Simon & Schuster—“just to let them know we’re doing this.… We actually have a model clause [for book contracts] that we’ve been shopping around to publishers, and agents have been requesting it. This is something to be worked out: Who owns the rights? But what I have suggested to the Big Five publishers—and nobody so far has taken issue with it—is that the rights could be split 50-50.”
In the meantime, Rasenberger hopes that a critical mass continues to build around this issue. She said the guild is in regular communication with like-minded groups such as the News/Media Alliance, which organized the recent lobbying blitz on the Hill. (Vanity Fair is a member of the alliance through Conde Nast.) She’s also closely following the other existing and potential legal actions. “We’ve got to attack this problem from all directions,” she tells me. “We have to establish the precedent that creative works—commercially, professionally created works—need to be licensed.”
Vanity Fair’s Most Read Stories of 2023
The Real Housewives Reckoning Rocking Bravo
The Untold Story of Lost’s Poisonous Culture
Kyle Deschanel, the Rothschild Who Wasn’t
The JFK Assassination Revelation That Could Upend the “Lone Gunman” Theory
Gisele Bündchen Talks About It All
The Serial Killer and the Texas Mom Who Stopped Him
Plus: Fill Out Your 2023 Emmys Ballot