When content scraping forces bloggers to become their own digital immune system

Small blogging platforms are being knocked offline — not by traffic surges from readers, but by AI bots. Independent sites like Bear, RationalWiki, and solo-run WordPress blogs have all reported outages or severe performance degradation caused by automated crawlers scraping their content to feed large language models. The bots arrive in waves, consume server resources at a rate no human audience would, and leave without sending a single visitor back.

These aren’t edge cases anymore. AI bot scraping activity has surged 225% over the past year alone, and the targets increasingly aren’t just major news outlets. They’re the independent operators — the niche bloggers, the recipe creators, the solo publishers — who make up the backbone of the open web.

If you run a blog or any kind of independent publication, this is no longer a theoretical concern. It’s a practical one. And while you can’t make your site completely immune to scraping, there are concrete steps you can take to significantly reduce your exposure.

The scale of the problem is growing faster than most bloggers realise

AI bot scraping activity has increased by 225% during 2025, according to data from Akamai. That’s not a gentle upward trend. It’s an acceleration.

According to The Register, roughly 5.6 million websites have now added OpenAI’s GPTBot to the disallow list in their robots.txt file — up almost 70% since early July 2025. Anthropic’s ClaudeBot is blocked on approximately 5.8 million sites. Tollbit, a company that tracks AI crawler behaviour, reported a 336% increase in sites blocking AI crawlers over the past year.

But here’s the uncomfortable part: blocking doesn’t always work. Tollbit’s Q2 2025 report found that 13.26% of AI bot requests ignored robots.txt directives entirely, up from 3.3% in Q4 2024. By Q4 2025, 30% of total AI bot scrapes bypassed explicit robots.txt permissions. OpenAI’s ChatGPT-User agent was the worst offender, with 42% of its scrapes accessing content from sites that had explicitly blocked it.

For major publishers with dedicated legal and technical teams, this is a solvable problem — or at least a manageable one. For independent bloggers and small WordPress site owners, it can feel like trying to lock a door that keeps getting kicked in.

What’s actually being taken, and why it matters

When an AI crawler scrapes your blog, it’s typically doing one of three things: collecting content to train a large language model, retrieving information in real time to answer a user query (known as retrieval-augmented generation, or RAG), or indexing your content for AI-powered search.

The distinction matters. Training crawls absorb your content permanently into a model’s knowledge base. RAG bots pull your content on demand to generate responses — often without sending the user back to your site. AI search indexers catalogue your content so it can appear in summaries that replace the click-through to your actual page.

Tollbit’s data shows RAG bot activity rose 33% and AI search indexer traffic rose 59% between Q2 and Q4 2025, even as training crawls fell. The nature of scraping is shifting — from building models to powering interfaces that serve your content to users without them ever visiting your site. Click-through rates from AI tools to publisher sites dropped nearly threefold over 2025. AI surfaces now send an average of just 0.12% of publishers’ overall referral traffic.

If your revenue depends on page views, ad impressions, or affiliate clicks, every visitor who gets your answer from an AI summary instead of your site is a visitor you’ve lost — on content you created.

A layered approach to protection

No single measure will fully protect your content. But a multi-layered strategy can significantly reduce your exposure — and importantly, it establishes a clear legal and ethical position that you do not consent to AI scraping.

The first and most basic step is your robots.txt file. This is a simple text file in your site’s root directory that tells crawlers which parts of your site they can and cannot access. Adding disallow rules for known AI user agents — GPTBot, ChatGPT-User, CCBot, Google-Extended, Bytespider, ClaudeBot, and others — is straightforward and free.

For WordPress users, plugins like Block AI Crawlers or Bot Traffic Shield handle this automatically, maintaining updated lists of known AI crawler user agents and adding the appropriate directives to your virtual robots.txt. These require no technical expertise — install, activate, and the plugin manages the rest.

The limitation is that robots.txt is a request, not an enforcement mechanism. Well-behaved bots from established companies generally comply. Smaller or less scrupulous operations may not. That’s why the second layer matters.

Cloudflare, which sits in front of a significant portion of the web’s traffic, now blocks AI crawlers by default on all plans — including free accounts. Their AI Audit tool shows which bots are accessing your site, lets you set granular policies, and tracks compliance. For bloggers already using Cloudflare (and many are, given it’s free and improves site performance), enabling this protection takes minutes.

Raptive, the ad management company that works with thousands of independent creators, has standardised Creator Terms of Content Use across its network and offers a WordPress plugin that blocks AI bot traffic. Food creator Sarah Leung of The Woks of Life and recipe writer Gina Homolka of Skinnytaste are among the creators who’ve adopted these tools. Raptive ran a study between May 2024 and June 2025 and found no negative impact on traffic or search rankings from blocking AI bots.

That last point is worth emphasising. A common fear among bloggers is that blocking AI crawlers will somehow hurt their Google rankings. The evidence so far suggests it doesn’t. AI crawlers and search engine crawlers are separate entities. Blocking GPTBot does not affect Googlebot.

See Also

The loopholes you should know about

Even with protections in place, AI companies can access your content indirectly. They’ve reportedly grabbed content from Common Crawl and the Internet Archive rather than scraping sites directly — your robots.txt doesn’t prevent someone from accessing a cached version of your site stored elsewhere.

If you participate in content distribution programmes like SmartNews or NewsBreak, the terms may grant them the right to sublicense your content to third parties, including AI companies. Review any distribution agreements you’ve signed.

And the emergence of AI-powered browsers complicates things further. Perplexity’s Comet browser and tools like Firecrawl are, as Tollbit puts it, essentially indistinguishable from human visitors in server logs.

None of these loopholes should discourage you from implementing protections. They should calibrate your expectations. Protection here means raising the cost and difficulty of accessing your content without permission — not making it impossible.

Deciding what makes sense for your site

Before implementing any scraping protections, it’s worth asking a strategic question: is visibility in AI answers more valuable to you than the traffic you’d retain by blocking?

For some publishers — particularly those in highly competitive national news or commodity content spaces — appearing in AI summaries might be the only visibility they get as traditional search declines. For niche bloggers, B2B publishers, and creators with strong email lists and direct audiences, the calculus is different. Their content is specialised, their readers come through channels other than search, and the risk of AI regurgitating their expertise without attribution far outweighs the benefit of appearing in a ChatGPT response.

Most independent bloggers fall into the second category. The practical advice is clear: update your robots.txt, install a blocking plugin if you’re on WordPress, enable Cloudflare’s AI protections if you use it, review your distribution agreements, and add explicit terms of use to your site stating that AI scraping is not permitted.

None of this requires a legal team or a technical background. It requires about an hour of focused attention and the willingness to draw a line around your work. In a landscape where AI companies have shown they’ll take what they can until someone tells them to stop, that line is worth drawing — not because it guarantees protection, but because it establishes that your content is yours, and access to it requires your consent.

Picture of Lachlan Brown

Lachlan Brown

Lachlan is the founder of HackSpirit and a longtime explorer of the digital world’s deeper currents. With a background in psychology and over a decade of experience in SEO and content strategy, Lachlan brings a calm, introspective voice to conversations about creator burnout, digital purpose, and the “why” behind online work. His writing invites readers to slow down, think long-term, and rediscover meaning in an often metrics-obsessed world. Lachlan is an author of the best-selling book Hidden Secrets of Buddhism: How to Live with Maximum Impact and Minimum Ego.

RECENT ARTICLES