January 14, 2026

Half the Web Is AI-Written… Not Really

Half the Web Is AI-Written... Not Really

Half the Web Is AI-Written... Not Really

A new Graphite study sampled 65,000 English-language URLs from Common Crawl (2020–May 2025) and used Surfer’s AI detector to label pages. In case you didn’t know, Common Crawl is a non-profit project that regularly crawls the public web and publishes the raw data for anyone to use.

It reports that AI-generated articles briefly exceeded human-written ones in Nov 2024, and that AI now accounts for roughly half of newly published articles online. Not that surprising, as we reported earlier that we see a similar phenomena with AI-created scientific papers (with odd results).

Immediately comments and videos appeared online saying that half the web is now AI-written. Wrong of course as it’s half of new sampled articles, not half of every page on the web.

The data showed that search still favors humans for instance. It also found that 86% of top-ranking Google results are human-written; with only 14% being AI-generated. Ahrefs/eMarketer similarly report that 86.5% of top pages contain some AI, but only 4.6% are fully AI-generated. Together, these imply blended workflows dominate ranking content, and not pure AI.

If half of the web would indeed already be AI-generated, we’d very quickly end up in a model collapse as models would by then have been predominantly trained on their own outputs. But, we are not there yet, and we maybe will never get there.

In short, the phrase is catchy, but is quite imprecise.

The real question: what does this mean for trust?

AI now writes a huge share of what you read online. Your feed looks the same on the surface with headlines, quotes, and charts. But the production line behind it has changed, drastically sometimes. Trust doesn’t break with just one bad article, because that can happen to every editor. But trust will erode quickly when speed replaces verification, when sources blur (or simply don’t exist), and when thousands of near-identical posts drown out the few with original reporting.

The practical question is this: which page do you believe, which number do you cite, which product do you buy, which policy do you support?

Three shifts are already visible.

  1. Discovery systems are adapting: Search and assistant systems appear to demote pure AI-slop and reward original reporting, citations, expert authorship, and links. That’s why human-led or hybrid content dominates top results despite the flood.
  2. Hybrid becomes default: Top-ranking pages often combine human ideation and editing with AI drafting or summarization, it goes fast(er), but it comes with a firm editorial spine. That’s the story behind the “86.5% use some AI / 4.6% fully AI” split.
  3. Data provenance matters: Publishers and labs need clean human-generated corpora and content provenance (watermarks, cryptographic signatures, C2PA) to avoid the model-collapse trap and to keep the open web useful.

Below are concrete, real-world scenarios that show how trust gets tested and how to handle it.

1) Health advice that reads right but sources wrong

A “new study” claims a supplement reduces anxiety by 42%. The article lists no trial ID, no journal, and links only to another summary. The copy is fluent and confident, in short it presents the reader classic AI polish on weak inputs.

✅ You should demand a DOI or clinical trial registry number. No identifier, no trust. Also keep a short whitelist of journals and institutional sites for medical claims.

❌ If you ignore this, imagine you share the piece, friends buy the product, and a month later a correction reveals the number came from an N=18 pilot with no control group.

2) Finance coverage that amplifies a pump

A coin spikes after ten near-duplicate posts “report” a partnership. Each post cites the others, not the company’s 8-K or press room. AI accelerates the echo chamber.

✅ You should jump straight to the issuer’s filings or the company newsroom archive. Verify the announcement date, counterparty quote, and contractual scope.

❌ If you trade on rumor, the coin might retrace, and the “news” vanishes, because it never existed.

3) Local news, national reach, zero context

A viral story says a city “bans” a common app. In reality, the council opened a consultation. AI-rewrites strip the caveats and keep the heat.

✅ You should read the municipal agenda or minutes. Check whether the item is a motion, draft ordinance, or adopted statute.

❌ If you post a take about censorship; locals might push back; and you burn credibility with the audience that knows.

4) Shopping pages with five-star déjà vu

Product reviews look human, but if you look closely they repeat unusual phrasing and carry timestamps spaced minutes apart across dozens of items. Many came from the same “style.”

✅ You should sort reviews by “Newest” and scan rater histories. Cross-check with third-party testers that publish repeatable methods (load cycles, failure rates, lab photos).

❌ If you still buy the gadget, it fails in week two, and the seller hides behind “mixed reviews.”

5) Academic shortcuts that poison citations

A student cites three sources that don’t exist. The abstracts look persuasive but is AI-hallucinated, both the sources’ ‘info’ and the links were generated by AI. This is the case in many free versions of AI tools where subpar (or older) versions are offered for free while the paid for versions do a lot better work in finding sources and associated links.

✅ What you need to do for sure, is click on every citation. Confirm the journal site, volume, issue, and page range. You can also use Crossref/DOI lookups for this.

❌ If you don’t do these checks, the error will make it into a policy memo for instance, then a budget decision. And we all know that retractions don’t travel as far as the first claim.

6) “Data” visuals without data

A chart circulates showing a sharp drop or spike around a news event. No axis labels, no series description, no methodology is visible.

✅ In that case you should ask for the dataset and the transform, and recreate the chart with the raw file before you share it.

❌ If you don’t do this you will amplify a narrative built on a smoothing trick or cherry-picked window.

7) Corporate blogs that never leave the desk

A B2B article states “customers saved 37%” after adoption. But there’s no sample size, no baseline, and no time period.

✅ Before using this you need a reference case with the named company, before/after metrics, and a measurement window. You can then add a standing policy that claims without units, dates, and denominators simply can not be published.

❌ If you ignore this the outcome will be simple, prospects will churn mid-funnel when the numbers don’t match demos.

8) Breaking news with no byline history

A site you don’t recognize breaks a big story with an unfamiliar author whose other pieces all published last week, literally dozens of them.

✅ The least you should do is check the author page, prior output, and possible correction logs you can find. Compare the copy to wire text to spot rewritten material.

❌ If you ignore this you inherit a correction after your newsletter goes out, and your next scoop gets fewer opens.

9) AI-summarized research that misses the caveats

A summary nails the headline result but drops the sensitivity analysis and limitations section.

✅ Skim the paper’s Methods and Limitations before quoting. Extract one constraint into your own copy to prove you read it.

❌ If you ignore this, readers will assume overclaiming; and you will train your audience to doubt you.

10) Government stats with silent revisions

A viral post cites last month’s unemployment rate without noting routine revisions. AI recirculates the first print; the corrected figure is lower.

✅ Check the statistical office’s revision policy and pull the latest series release. Quote both: “initial” and “revised.”

❌ If you ignore this, your forecast model will drift, and you will misstate the trend on air.

Real examples where AI-written (or AI-assisted) news/content went wrong

I first gave you some theoretical examples above, but here’s a list of real cases where AI-written (or AI-assisted) news/content went wrong. The news got published, and then had to be fixed, pulled, or publicly explained.

You’ll notice that speed over verification is often a problem. AI drafted the material very fast, however the fact-checking lagged (far) behind. You’ll also see that there is opaque sourcing. Readers for instance weren’t told AI was used, or third-party vendors slipped content into trusted brands. The examples below also show good examples of hallucinations and stale training. The models did generate confident nonsense and the editors simply missed it.

  1. CNET’s finance explainers (Jan 2023): AI wrote dozens of personal-finance articles. An internal review found errors in 41 of 77 pieces, including basic math like misstating compound-interest gains. Corrections followed.
  2. Gizmodo / io9’s “Star Wars” timeline (July 2023): An AI-generated listicle about Star Wars movies/TV went live with multiple factual mistakes and omissions. Staff said they hadn’t approved it; the management got caught heat and updated the material after a backlash.
  3. Gannett high-school sports recaps (Aug 2023): Local papers in the USA Today Network ran AI-generated game stories with bizarre phrasing and even visible placeholders like [[WINNING_TEAM_MASCOT]]. Gannett in the end paused the program.
  4. Microsoft/MSN’s Ottawa travel guide (Aug 2023): An AI travel piece recommended the Ottawa Food Bank as a “can’t miss” tourist attraction. Microsoft pulled the article.
  5. Sports Illustrated’s “authors” that didn’t exist (Nov 2023): Product-review content on SI ran under AI-generated headshots and invented bios; the publisher blamed a third-party vendor and removed pages amid scrutiny.
  6. Google’s AI Overviews (May 2024 → 2025): Google’s AI summaries surfaced wrong or unsafe advice (e.g., glue on pizza, eating rocks), prompting fixes after viral examples and media coverage. While not a “news story,” these AI answers appeared atop newsy queries and misled readers.
  7. Fake/erroneous AI obituaries (2024–2025): A wave of AI-generated obits – often syndicated by low-quality sites – published fabricated details and even non-existent deaths, prompting investigations and warnings from reporters and security firms.
  8. Chicago Sun-Times & partners’ summer guide (May 2025): A print/digital summer reading package included AI-invented book titles attributed to real authors and quotes from nonexistent experts. The paper removed the content and updated policies.

How to operationalize trust

Trust now depends less on how good the prose looks and more on whether the claims can be traced, checked, and replicated. You should treat every polished page as a lead and try to verify it. Build tiny habits like clicking the source, verifying the number, and naming the method. This way you’ll keep your signal sharp while the rest of the web gets noisier.

There are some rules to follow:

  • Insist on provenance by default. Bylines, timestamps with time zones, working links to primary docs, and a visible corrections policy.
  • Use a three-link rule. Before sharing, click at least three layers deep: article → source article → primary file (PDF, filing, dataset).
  • Adopt a red-flag checklist. No DOI for science, no SEC/ESMA filing for finance, no ordinance text for policy, no test protocol for reviews—then do not publish.
  • Reward information gain. Prefer pages that add interviews, new data, code, or methods. They outlast floods of paraphrase.
  • Track your own accuracy. Keep a simple log of claims you amplified, their sources, and any later corrections. Trust compounds; so do mistakes.
  • Label AI’s role. When AI helps draft or summarize, say so. Then show your human verification step in one sentence.

How reliable are AI detectors?

Detectors (like Surfer) are probabilistic and can make errors, especially on edited AI text or highly formulaic human text, or bias, especially for non-native writers. There’s no definitive way to measure the exact share of AI content today. Instead you can only interpret percentages as estimates.

Here are a few tools I have tested as well. Feel free to contact us if you have found tools that also deserve to be in this list.

OpenAI from its side shut down its own text classifier because of “low accuracy,” and points users toward provenance methods instead. But several institutions disabled AI flags altogether since they only offered more confusion.

The best is to adapt the following two groundrules:

  1. Don’t use AI-detector scores as evidence on their own. Corroborate with drafts, version history, citations, or platform logs.
  2. Prefer provenance where possible. C2PA for instance provides an open technical standard for publishers, creators and consumers to establish the origin and edits of digital content. It’s called Content Credentials, and it ensures content complies with standards. Other solutions are SynthID-enabled tools for instance that detect watermarks from Google’s own models. But key is to maintain an audit trail of prompts, edits, and sources.

An interesting take comes from Grammarly’s “Authorship” that surfaces document-history signals (edits, time on task) inside Docs/Word. Several universities already partnered for this approach.

Prove it, or it simply didn’t happen

AI now drafts an immense volume of new pages, yet top results still reward original (human) reporting, named sources, and expert authorship. Publishers will win trust by linking the DOI, the filing, or the dataset, and by showing their edit trail. The basis is to label where AI helped and explain your method in one simple line. It’s also good practice to keep a public correction log.

So what will the future bring? Search and assistants will boost content with verifiable origins – C2PA content credentials, cryptographic signatures, reproducible code, and clean human corpora. Hybrid workflows will become default (and that is not a crime), but humans will set the angle while AI drafts and summarizes. It will be humans who will verify and add new facts. You can expect that watermarked outputs and platform-level version history will follow content across the web.

Key will be that models will train on higher-purity data, and not the AI-slop that is being generated en masse.


Become a Sponsor

Our website is the heart of the mission of WINSS – it’s where we share updates, publish research, highlight community impact, and connect with supporters around the world. To keep this essential platform running, updated, and accessible, we rely on the generosity of you, who believe in our work.

We offer the option to sponsor monthly, or just once choosing the amount of your choice. If you run a company, please contact us via info@winssolutions.org.

Select a Donation Option (USD)

Enter Donation Amount (USD)