News

AI Search Is Contaminating Itself: The Retrieval Poisoning Crisis and What Google Click Signals Actually Do

56% of Google AI Overview citations are ungrounded. Synthetic SEO content is poisoning RAG systems in real time. Plus: DOJ documents reveal how Navboost and RankEmbedBERT actually process click data.

Updated April 24, 2026 Francisco Leon de Vivero
AI Search Is Contaminating Itself: The Retrieval Poisoning Crisis and What Google Click Signals Actually Do

AI search systems are contaminating their own outputs through a real-time retrieval loop that requires no retraining cycle to spread misinformation. An Oumi analysis of 4,326 AI Overview responses found that while 85–91% appear accurate on the surface, 56% of correct answers are ungrounded — the cited sources don't actually support the claims. Separately, DOJ antitrust documents finally clarify how Google actually uses click data through Navboost and RankEmbedBERT.

Together, these findings expose two fundamental misunderstandings in the SEO industry: that AI citations equal trustworthiness, and that clicks directly influence rankings. Neither is true — and the gap between perception and reality is widening.

56%
Correct AI Overview answers that are ungrounded
4,326
AI Overview responses tested (Oumi)
44%
ChatGPT citations that are "best X" listicles
1/100th
Data used by RankEmbedBERT vs. predecessors

1. The Retrieval Poisoning Crisis: AI Search Is Eating Itself

Unlike traditional model contamination (which requires retraining over months), RAG-based systems like Google AI Overviews, Perplexity, and ChatGPT fetch live web content and present it as authoritative answers. When that live content is itself AI-generated, hallucinated, or fabricated, the contamination is instantaneous. The retrieval layer is not a filter — it is the infection vector.

The Speed of Contamination: A BBC journalist published a fabricated post about hot dog eating rankings. Within 24 hours, it ranked first in Google and was cited by both Google AI Overviews and OpenAI as factual. No retraining required — the retrieval layer treated an indexable URL as a trustworthy source immediately.

This is fundamentally different from the "model collapse" researchers have warned about. Model collapse is a slow degradation over training cycles. Retrieval poisoning is real-time. A speculative blog post published at 9 AM can be cited as authoritative fact by 10 AM. This dynamic connects to the ghost citation problem — AI systems are citing content without verifying it, and now without even verifying that the citations support the claims.

Isometric visualization of the AI retrieval poisoning loop showing how synthetic content cycles through RAG systems

2. The Numbers: How Bad Is the Contamination?

MetricFindingSource
AI Overview surface accuracy85–91% across 4,326 testsOumi analysis
Ungrounded correct answers56% cite unsupportive sourcesOumi analysis
ChatGPT "best X" listicle citations44% of all citationsAhrefs study
GPT-5.4 vs GPT-5.3 false claimsPaid tier produces 33% fewerSEJ analysis
Free-tier OpenAI users94% use less reliable versionsSEJ analysis

The Oumi analysis reveals a critical distinction between surface accuracy and grounded accuracy. A response can sound correct while citing sources that don't actually support the claim. Over half of all "correct" answers fall into this category — they give the illusion of citation-backed authority without the substance. Across 5,380 sources analyzed, Facebook and Reddit ranked as the second and fourth most-cited platforms — neither of which has mechanisms to verify human authorship or factual accuracy.

The Quality Stratification Problem: GPT-5.4 (paid tier) produces 33% fewer false claims than the free GPT-5.3 — yet 94% of OpenAI's users access the less reliable free version. The most vulnerable users receive the least accurate answers.
56% of correct AI Overview answers cite sources that don't support the claims — ungrounded citations analysis

3. The Mechanism: Why RAG Systems Are the Infection Vector

Two academic papers demonstrate the structural vulnerability. PoisonedRAG (Zou et al., 2024) showed that a small number of crafted passages can control RAG system outputs without compromising the model itself — injecting content into the retrieval corpus is sufficient. BadRAG (Xue et al., 2024) demonstrated semantic backdoors enabling similar manipulation through content designed to trigger specific retrieval patterns.

The practical attack chain works like this: an AI content pipeline generates a speculative article → the article gets indexed within hours → a RAG system fetches it during a user query and cites it → other AI pipelines observe the citation and reference the same content → the fabricated claim becomes "consensus" across multiple AI systems without any human verification.

Documented case: Perplexity confidently cited a nonexistent "September 2025 Perspective Core Algorithm Update" by pulling from AI-generated SEO blog posts. The update never happened. Multiple SEO blogs had speculated about it using AI content tools, and the speculation became citation-laundered into apparent fact.

xAI's Grokipedia exemplifies the endpoint of this trend — an AI-rewritten encyclopedia that bases articles on contaminated web content, including Instagram reels as sources. There is no human responsibility mechanism for correcting errors.

4. The SEO Industry's Role in the Contamination Loop

The irony is acute: the SEO industry is simultaneously the victim and the accelerant of this crisis. When AI Overviews and AI search tools began capturing traffic that previously went to publishers, agencies responded by deploying AI content pipelines at scale. But the content these pipelines generate — speculative algorithm analyses, "best X" roundups, generic how-to articles — became the raw material that other AI systems now cite.

The Self-Reinforcing Cycle: AI search reduces publisher traffic → Publishers deploy AI content pipelines to maintain volume → AI-generated content floods the index → RAG systems cite AI-generated content as fact → Citation laundering legitimizes fabricated claims → Information quality degrades → Users trust AI search less but use it more (convenience wins) → Cycle repeats.

This connects to the ChatGPT citation mechanics research showing that 44% of ChatGPT citations are "best X" listicles — the exact content formats that AI pipelines produce at highest volume, typically structured around self-interested product rankings rather than independent evaluation.

Meanwhile, human creators are abandoning the open web as the traffic bargain collapses. The content that would provide genuine first-hand expertise is increasingly published behind paywalls, in newsletters, or not at all — leaving the open web to synthetic content that AI systems will continue to ingest and cite. The zero-click survival strategies we covered earlier become even more critical in this context.


How Google processes click data through RankEmbedBERT, Click Fraction formula, and Navboost — DOJ documents reveal

5. Google Click Signals: What the DOJ Documents Actually Reveal

DOJ antitrust documents from September 2025 cut through persistent myths about how Google uses click data. The key finding: clicks are the lowest-level data point, not a ranking factor. They are processed, aggregated, and transformed before influencing anything.

3
Primary ways Google processes click data
1/100th
Data used by RankEmbedBERT vs. earlier models

How Click Data Actually Flows Through Google's Systems

Processing PathSystemWhat Happens
AI Model TrainingRankEmbedBERTClick data combined with human rater scores trains ranking models. Uses 1/100th the data of earlier models while producing higher quality results.
Aggregate MeasurementClick Fraction formulaIndividual clicks are summed and normalized into statistical measures, then smoothed to prevent spam manipulation.
Popularity SignalsNavboostMeasures popularity through aggregate user feedback — not individual click tracking.

The Click Fraction Formula

A 2006 Google patent describes how individual clicks become aggregate signals:

// Google's Click Fraction Formula (2006 Patent)

LCC_BASE = [#WC(Q,D)] / [#C(Q,D) + S0]

// #WC(Q,D) = weighted click count for query Q and document D
// #C(Q,D) = total click count for that query-document pair
// S0 = smoothing constant to prevent gaming

The smoothing constant S0 is critical: it prevents low-volume queries from being gamed by artificial clicks. Individual click manipulation is diluted by the normalization process. This is not a "more clicks = higher ranking" system — it's a statistical aggregation designed to resist exactly that kind of manipulation.

The Practical Takeaway: Click-through rate manipulation (clickbait titles, misleading snippets) does not directly boost rankings. Google processes clicks through aggregation, normalization, and smoothing before they influence any ranking system. Focus on satisfying user intent rather than maximizing raw clicks.

RankEmbedBERT: Less Data, Better Results

The DOJ documents reveal that RankEmbedBERT is trained on 1/100th the data of its predecessors while producing higher quality search results. This suggests Google has shifted from quantity-dependent approaches to architectures that extract more signal from less data — making the quality of training signals (including click-derived ones) more important than their volume.

6. Google's GEO Job Posting: A Mixed Signal

Google's ads organization posted a "GEO Partner Manager, Performance Solutions" role within its Large Customer Sales team. The listing mentions "Generative Engine Optimization" seven times and references analyzing "Share of Model" — a brand's visibility in AI-generated answers.

The Contradiction: Google's Gary Illyes stated that standard SEO practices suffice for AI Overviews. Now Google's ads team is hiring for GEO. The search and ads divisions appear to be operating from different playbooks.

This is worth monitoring but not overstating. It represents one hiring signal from Google's advertising sales organization. The practical implication: Google's ads team sees commercial opportunity in the GEO space, even if the search quality team doesn't endorse the framework. The "Share of Model" metric is the most interesting element — if Google develops tooling to measure brand visibility within AI-generated answers, that's a signal that AI answer optimization will eventually become a paid advertising surface, not just an organic discovery channel.

Infographic showing the AI retrieval poisoning cycle, 56% ungrounded citation rate, Google click signal processing through Navboost and RankEmbedBERT, and the GEO mixed signals from Google

Related Articles

Frequently Asked Questions

What is retrieval-layer poisoning in AI search?

Retrieval-layer poisoning occurs when RAG-based AI search systems fetch live web content that contains AI-generated misinformation, then cite it as factual. Unlike training-data contamination which requires retraining cycles, retrieval poisoning happens in real time — a fabricated article can be indexed and cited within 24 hours.

What percentage of Google AI Overview citations are ungrounded?

According to an Oumi analysis of 4,326 AI Overview tests, while 85–91% showed surface accuracy, 56% of correct answers were ungrounded — the cited sources did not actually support the claims being made.

Does Google use clicks as a direct ranking factor?

No. According to DOJ antitrust documents from September 2025, clicks are the lowest-level data point that gets processed into higher-level signals. Google aggregates click data into statistical measures and uses it to train AI models like RankEmbedBERT. Individual clicks do not directly rank websites.

What is Navboost and how does it affect rankings?

Navboost is a Google ranking system that measures popularity through aggregate user feedback. It processes aggregated click data — not individual clicks — to create signals about user satisfaction and content relevance.

How does synthetic SEO content create a contamination loop?

SEO agencies deploy AI content pipelines that generate speculative articles. Other AI pipelines cite those articles as sources. RAG systems fetch this content in real time and present it as factual. A documented example: Perplexity cited a nonexistent "September 2025 Perspective Core Algorithm Update" sourced entirely from AI-generated SEO blogs.

What is Google's position on Generative Engine Optimization (GEO)?

Google sends mixed signals. Gary Illyes stated that standard SEO suffices for AI Overviews. However, Google's ads organization posted a "GEO Partner Manager" role mentioning GEO seven times and referencing "Share of Model" analysis. The search and ads teams appear misaligned.

What is "Share of Model" and why does it matter?

Share of Model measures a brand's visibility in AI-generated answers — how often a brand appears when AI systems respond to relevant queries. It represents a shift from traditional Share of Voice metrics toward measuring influence within AI answer engines, and may signal future paid advertising surfaces.

Francisco Leon de Vivero
About the Author

Francisco Leon de Vivero is VP of Growth at Growing Search and a global SEO expert with 15+ years of experience across enterprise, ecommerce, and international search. He previously led Global SEO Framework at Shopify and has spoken at UnGagged, SEonthebeach, and other international conferences.

LinkedIn · YouTube · Book a Consultation

Next step

Turn this background reading into a more current SEO plan.

Use the most relevant current page below if this topic is still on your roadmap, then review the proof and contact paths if you want direct support.

Current service page

Technical SEO Advisory

The goal is not audit sprawl. It is translating complex technical issues into prioritized actions that development and marketing teams can actually execute.

Explore this service