6 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
Common Crawl has been scraping the internet for over a decade, creating a vast archive of webpages that AI companies use to train language models. Despite claims of only collecting freely available content, the organization has allegedly included paywalled articles, misleading publishers about removal requests. This practice raises significant concerns about copyright and the ethics of using journalistic work without compensation.
If you do, here's more
Common Crawl, a nonprofit organization that archives vast amounts of web content, has come under fire for scraping paywalled articles from major news sites and providing this data to AI developers. Founded over a decade ago, Common Crawl claims to collect only freely available content. However, its archives include millions of articles from prominent publications like The New York Times and The Wall Street Journal, enabling AI companies such as OpenAI and Google to train their language models on high-quality journalism without compensating the original creators.
Rich Skrenta, Common Crawl's executive director, defends the practice, suggesting that publishers are mistaken for trying to restrict their content's availability. He argues that if publishers post their work online, they shouldn't be surprised when it's utilized by others. Despite claims of compliance with removal requests from publishers, evidence indicates that Common Crawl has not effectively deleted the requested articles. For instance, The New York Times requested the removal of its content in July 2023, yet many of its articles remain in Common Crawl's archives, which have not seen significant modifications since 2016.
The organization's scraping technology bypasses many paywall systems, allowing it to capture full articles before access restrictions activate. This has led to increasing tension between Common Crawl and news publishers, resulting in some publishers blocking its scrapers. However, this action only prevents future scraping and does not affect archived content. Reports from various publishers reveal frustration over the lack of timely compliance with removal requests, with Common Crawl often providing vague updates on the status of deletions. The ongoing situation highlights the complex interplay between data access, copyright, and the rapidly evolving landscape of AI development.
Questions about this article
No questions yet.