Anthropic trashed millions of books to train its AI

Spread the love
Anthropic trashed millions of books to train its AI

Anthropic physically scanned millions of print books to train its AI assistant, Claude, subsequently discarding the originals, as revealed in court documents, according to Ars Tecnica. This extensive operation, detailed in a legal decision, involved the acquisition and destructive digitization of these texts. The company’s approach to data acquisition reflects a broader industry demand for high-quality textual information.

Anthropic engaged Tom Turvey, formerly the head of partnerships for Google Books, in February 2024. His mandate was to procure “all the books in the world” for the company. This hiring decision aimed to replicate Google’s legally validated book digitization strategy, which had successfully navigated copyright challenges and established fair use precedents. While destructive scanning is common in smaller-scale operations, Anthropic implemented it on a massive scale. The destructive process offered faster speed and lower costs, outweighing the need to preserve the physical books.

Judge William Alsup ruled this destructive scanning operation constituted fair use. This determination was contingent on several factors: Anthropic legally purchased the books, destroyed each print copy post-scanning, and maintained the digital files internally without distribution. The judge analogized the process to “conserv[ing] space” through format conversion, deeming it transformative. Had this method been consistently applied from the outset, it might have established the first legally sanctioned instance of AI fair use. However, Anthropic’s earlier use of pirated material undermined its initial legal standing.

The AI industry exhibits a significant demand for high-quality text, which serves as a fundamental driver behind these data acquisition strategies. Large language models (LLMs), such as those powering Claude and ChatGPT, are trained by ingesting billions of words into neural networks. During this training, the AI system processes the text repeatedly, establishing statistical relationships between words and concepts. The quality of the training data directly influences the capabilities of the resulting AI model. Models trained on well-edited books and articles generally produce more coherent and accurate responses compared to those trained on lower-quality text sources.

Publishers retain legal control over content that AI companies seek for training purposes. Negotiating licenses for this content can be complex and time-consuming. The first-sale doctrine provided a legal workaround for Anthropic: once a physical book is purchased, the buyer can dispose of that specific copy, including destroying it. This principle allowed for the legal acquisition of physical books, circumventing direct licensing negotiations. Despite the legality, the procurement of physical books represented a substantial financial outlay.

Initially, Anthropic opted to use digitized versions of pirated books to acquire high-quality training data, a strategy chosen to avoid what CEO Dario Amodei termed the “legal/practice/business slog” of complex licensing negotiations. By 2024, however, Anthropic had become “not so gung ho about” utilizing pirated ebooks due to “legal reasons,” necessitating a more secure source of data. Purchasing used physical books offered a method to bypass licensing issues entirely while providing the professionally edited text essential for AI model training. Destructive scanning facilitated the rapid digitization of millions of volumes.

Anthropic invested “many millions of dollars” in this book buying and scanning operation. The company often acquired used books in bulk. The process involved stripping books from their bindings, cutting pages to workable dimensions, and scanning them as stacks of pages into PDFs. These PDFs included machine-readable text and covers. All paper originals were subsequently discarded. Court documents do not indicate that any rare books were destroyed, as Anthropic procured its books in bulk from major retailers. Other methods exist for extracting information from paper while preserving the physical documents; for example, The Internet Archive developed non-destructive book scanning techniques that maintain the integrity of physical volumes while creating digital copies.

In a related development, OpenAI and Microsoft announced a collaboration with Harvard’s libraries to train AI models using nearly 1 million public domain books, some dating back to the 15th century. These books are fully digitized but are preserved.


Featured image credit

FAQs

Frequently Asked Questions

What is a Premium Domain Name?   A premium domain name is the digital equivalent of prime real estate. It’s a short, catchy, and highly desirable web address that can significantly boost your brand's impact. These exclusive domains are already owned but available for purchase, offering you a shortcut to a powerful online presence. Why Choose a Premium Domain? Instant Brand Boost: Premium domains are like instant credibility boosters. They command attention, inspire trust, and make your business look established from day one. Memorable and Magnetic: Short, sweet, and unforgettable - these domains stick in people's minds. This means more visitors, better recall, and ultimately, more business. Outshine the Competition: In a crowded digital world, a premium domain is your secret weapon. Stand out, get noticed, and leave a lasting impression. Smart Investment: Premium domains often appreciate in value, just like a well-chosen piece of property. Own a piece of the digital world that could pay dividends. What Sets Premium Domains Apart?   Unlike ordinary domain names, premium domains are carefully crafted to be exceptional. They are shorter, more memorable, and often include valuable keywords. Plus, they often come with a built-in advantage: established online presence and search engine visibility. How Much Does a Premium Domain Cost?   The price tag for a premium domain depends on its desirability. While they cost more than standard domains, the investment can be game-changing. Think of it as an upfront cost for a long-term return. BrandBucket offers transparent pricing, so you know exactly what you're getting. Premium Domains: Worth the Investment?   Absolutely! A premium domain is more than just a website address; it's a strategic asset. By choosing the right premium domain, you're investing in your brand's future and setting yourself up for long-term success. What Are the Costs Associated with a Premium Domain?   While the initial purchase price of a premium domain is typically higher than a standard domain, the annual renewal fees are usually the same. Additionally, you may incur transfer fees if you decide to sell or move the domain to a different registrar. Can I Negotiate the Price of a Premium Domain? In some cases, it may be possible to negotiate the price of a premium domain. However, the success of negotiations depends on factors such as the domain's demand, the seller's willingness to negotiate, and the overall market conditions. At BrandBucket, we offer transparent, upfront pricing, but if you see a name that you like and wish to discuss price, please reach out to our sales team. How Do I Transfer a Premium Domain?   Transferring a premium domain involves a few steps, including unlocking the domain, obtaining an authorization code from the current registrar, and initiating the transfer with the new registrar. Many domain name marketplaces, including BrandBucket, offer assistance with the transfer process.