AI Search Is Contaminating Itself: The Retrieval Poisoning Crisis and What Google Click Signals Actually Do
56% of Google AI Overview citations are ungrounded. Synthetic SEO content is poisoning RAG systems in real time. Plus: DOJ documents reveal how Navboost and RankEmbedBERT actually process click data.
AI search systems are contaminating their own outputs through a real-time retrieval loop that requires no retraining cycle to spread misinformation. An Oumi analysis of 4,326 AI Overview responses found that while 85–91% appear accurate on the surface, 56% of correct answers are ungrounded — the cited sources don't actually support the claims. Separately, DOJ antitrust documents finally clarify how Google actually uses click data through Navboost and RankEmbedBERT.
Together, these findings expose two fundamental misunderstandings in the SEO industry: that AI citations equal trustworthiness, and that clicks directly influence rankings. Neither is true — and the gap between perception and reality is widening.
1. The Retrieval Poisoning Crisis: AI Search Is Eating Itself
Unlike traditional model contamination (which requires retraining over months), RAG-based systems like Google AI Overviews, Perplexity, and ChatGPT fetch live web content and present it as authoritative answers. When that live content is itself AI-generated, hallucinated, or fabricated, the contamination is instantaneous. The retrieval layer is not a filter , it is the infection vector.
2. The Numbers: How Bad Is the Contamination?
| Metric | Finding | Source |
|---|---|---|
| AI Overview surface accuracy | 85–91% across 4,326 tests | Oumi analysis |
| Ungrounded correct answers | 56% cite unsupportive sources | Oumi analysis |
| ChatGPT "best X" listicle citations | 44% of all citations | Ahrefs study |
| GPT-5.4 vs GPT-5.3 false claims | Paid tier produces 33% fewer | SEJ analysis |
| Free-tier OpenAI users | 94% use less reliable versions | SEJ analysis |
3. The Mechanism: Why RAG Systems Are the Infection Vector
Two academic papers demonstrate the structural vulnerability. PoisonedRAG (Zou et al., 2024) showed that a small number of crafted passages can control RAG system outputs without compromising the model itself , injecting content into the retrieval corpus is sufficient. BadRAG (Xue et al., 2024) demonstrated semantic backdoors enabling similar manipulation through content designed to trigger specific retrieval patterns.
The practical attack chain works like this: an AI content pipeline generates a speculative article → the article gets indexed within hours → a RAG system fetches it during a user query and cites it → other AI pipelines observe the citation and reference the same content → the fabricated claim becomes "consensus" across multiple AI systems without any human verification.
xAI's Grokipedia exemplifies the endpoint of this trend , an AI-rewritten encyclopedia that bases articles on contaminated web content, including Instagram reels as sources. There is no human responsibility mechanism for correcting errors.
4. The SEO Industry's Role in the Contamination Loop
The irony is acute: the SEO industry is simultaneously the victim and the accelerant of this crisis. When AI Overviews and AI search tools began capturing traffic that previously went to publishers, agencies responded by deploying AI content pipelines at scale. But the content these pipelines generate , speculative algorithm analyses, "best X" roundups, generic how-to articles , became the raw material that other AI systems now cite.
5. Google Click Signals: What the DOJ Documents Actually Reveal
DOJ antitrust documents from September 2025 cut through persistent myths about how Google uses click data. The key finding: clicks are the lowest-level data point, not a ranking factor. They are processed, aggregated, and transformed before influencing anything.
How Click Data Actually Flows Through Google's Systems
| Processing Path | System | What Happens |
|---|---|---|
| AI Model Training | RankEmbedBERT | Click data combined with human rater scores trains ranking models. Uses 1/100th the data of earlier models while producing higher quality results. |
| Aggregate Measurement | Click Fraction formula | Individual clicks are summed and normalized into statistical measures, then smoothed to prevent spam manipulation. |
| Popularity Signals | Navboost | Measures popularity through aggregate user feedback , not individual click tracking. |
The Click Fraction Formula
A 2006 Google patent describes how individual clicks become aggregate signals:
LCC_BASE = [#WC(Q,D)] / [#C(Q,D) + S0]
// #WC(Q,D) = weighted click count for query Q and document D
// #C(Q,D) = total click count for that query-document pair
// S0 = smoothing constant to prevent gaming
RankEmbedBERT: Less Data, Better Results
The DOJ documents reveal that RankEmbedBERT is trained on 1/100th the data of its predecessors while producing higher quality search results. This suggests Google has shifted from quantity-dependent approaches to architectures that extract more signal from less data , making the quality of training signals (including click-derived ones) more important than their volume.
6. Google's GEO Job Posting: A Mixed Signal
Google's ads organization posted a "GEO Partner Manager, Performance Solutions" role within its Large Customer Sales team. The listing mentions "Generative Engine Optimization" seven times and references analyzing "Share of Model" , a brand's visibility in AI-generated answers.
Frequently Asked Questions
What is retrieval-layer poisoning in AI search?
Retrieval-layer poisoning occurs when RAG-based AI search systems fetch live web content that contains AI-generated misinformation, then cite it as factual. Unlike training-data contamination which requires retraining cycles, retrieval poisoning happens in real time , a fabricated article can be indexed and cited within 24 hours.
What percentage of Google AI Overview citations are ungrounded?
According to an Oumi analysis of 4,326 AI Overview tests, while 85–91% showed surface accuracy, 56% of correct answers were ungrounded , the cited sources did not actually support the claims being made.
Does Google use clicks as a direct ranking factor?
No. According to DOJ antitrust documents from September 2025, clicks are the lowest-level data point that gets processed into higher-level signals. Google aggregates click data into statistical measures and uses it to train AI models like RankEmbedBERT. Individual clicks do not directly rank websites.
What is Navboost and how does it affect rankings?
Navboost is a Google ranking system that measures popularity through aggregate user feedback. It processes aggregated click data , not individual clicks , to create signals about user satisfaction and content relevance.
How does synthetic SEO content create a contamination loop?
SEO agencies deploy AI content pipelines that generate speculative articles. Other AI pipelines cite those articles as sources. RAG systems fetch this content in real time and present it as factual. A documented example: Perplexity cited a nonexistent "September 2025 Perspective Core Algorithm Update" sourced entirely from AI-generated SEO blogs.
What is Google's position on Generative Engine Optimization (GEO)?
Google sends mixed signals. Gary Illyes stated that standard SEO suffices for AI Overviews. However, Google's ads organization posted a "GEO Partner Manager" role mentioning GEO seven times and referencing "Share of Model" analysis. The search and ads teams appear misaligned.
What is "Share of Model" and why does it matter?
Share of Model measures a brand's visibility in AI-generated answers , how often a brand appears when AI systems respond to relevant queries. It represents a shift from traditional Share of Voice metrics toward measuring influence within AI answer engines, and may signal future paid advertising surfaces.
