Pravda in the pipeline: Early evidence of state-adjacent propaganda in AI training data
An audit of Common Crawl reveals presence of Pravda and Glassbridge content in popular archive service, with further signs of LLM ingestion.
Pravda in the pipeline: Early evidence of state-adjacent propaganda in AI training data
Share this story
The Pravda network, a collection of pro-Kremlin websites that has used AI tools to flood the internet with millions of pieces of Russian propaganda, is almost certainly intended to shape the responses of certain large language models (LLMs). This is broadly known as AI poisoning; alternately, what the American Sunlight Project has described as “LLM grooming.” Past investigations by the DFRLab and CheckFirst, the American Sunlight Project, and NewsGuard have shown how Russian propaganda efforts can co-opt the retrieval layer of LLMs, appearing in web search results and skewing users’ understanding of current events. But while retrieval layer poisoning is concerning, it can also be mitigated by blacklisting the offending domain at time of inference.
There is no such remedy for poisoning the training data itself. An October 2025 study by Anthropic, the UK AI Security Institute, and the Alan Turing Institute found that it takes as few as 250 malicious documents to compromise the answers of even very large (13B parameters) and highly capable models. Once this harmful content has been baked into a model’s weights, the only fix is a full (and extremely expensive) retrain.
According to new research by the DFRLab, LLM training data has already been compromised. An audit of Common Crawl, the public web archive that supplies much of the world’s opoen AI training pipeline, finds that content from the Pravda network, the Chinese-government-adjacent Glassbridge operation, and Russia’s RT. In some cases, a major open-weights model could be prompted to reproduce this content nearly verbatim. These findings carry especially concerning implications for the next generation of open-weights models, which are most likely to draw on such public crawls without rigorous filtering.
Common Crawl’s periodic releases feed many of the world’s largest AI training pipelines and and are publicly queryable via AWS Athena. These “crawls,” released monthly and spanning roughly 2 billion pieces of content each time, form the training data backbone of many modern LLMs, particularly those built outside of the frontier labs.
To probe for verbatim memorization, the DFRLab used Llama 3.1 405B Base, a large Meta model hosted on Hyperbolic, and a test-completion method in which the model was seeded with the opening sentence of a known article in order to see whether it reproduced the rest. This text-completion method is able to reveal latent training patterns in a way that is not possible with instruction-tuned conversational models like ChatGPT, whose alignment filters obscure raw model behavior.
Several caveats apply. Inclusion in Common Crawl does not guarantee inclusion in any given model’s training data. Meta, for example, runs its own deduplication, filtering, and quality scoring of data before any training. Llama 3.1 405B Base was chosen because high parameter base models are scarce in public release, but with a December 2023 knowledge cutoff, it is roughly two generations behind current frontier systems. Finally, although the text completion method is more reliable than conversational prompting for identifying memorized content, it is still impossible to precisely reconstruct an LLM’s training data in this way.
Significant presence of Pravda material in Common Crawl
Because of the Pravda Dashboard built and maintained by CheckFirst with contributions from the DFRLab, it is possible to sample a significant volume of the content created by Pravda and to evaluate their its in Common Crawl. The DFRLab found that Pravda articles have been extensively archived by Common Crawl. Because most Pravda content was created after Llama 3.1 405B Base’s knowledge cut-off in December 2023, it is not possible—at least for now—to critically evaluate the extent of Pravda’s LLM memorization via a sufficiently large and open model.

Pravda’s infiltration of Common Crawl is due in part to how the Pravda network configures its robots.txt and sitemap. The robots.txt of a website is a protocol that indicates which parts of the website crawlers are permitted to visit (crawlers can choose to ignore this) and sometimes links to the website’s sitemap. The sitemap acts as a directory for web crawlers to source links for parsing or archiving, making it much easier to rapidly scrape content across the entire domain.
Drawing from the URLs in the sitemap, the English-language Pravda site has been building out since 2023. Common Crawl began to substantially increase its rate of archiving news-pravda[.]com in November 2024, likely due to the domain becoming higher value to the Common Crawl bot.
The number of archived webpages of the English-language Pravda domain has been gaining since mid-2024. The total number of pages, inferred from the sitemap, has been growing at a much faster rate. (Source: news-pravda[.]com, DFRLab via Common Crawl)
By November 2025, roughly 40,000 pieces of English-language Pravda content had been archived in Common Crawl. Although this figure is paltry compared to the billions of webpages present in the full Common Crawl database, it has still grown by orders of magnitude since November 2024, when Pravda boasted only 37 articles across the entire Common Crawl archive. Such a concentration of overt Russian propaganda will almost certainly skew future LLM responses on issues of interest to Russian foreign policy. This imbalance will likely be even clearer in non-English language performance, in which training data is typically less robust and prone to bias.
Presence of RT “biolabs” claims in Common Crawl and subsequent LLM ingestion
By way of comparison, the DFRLab identified a January 2023 article by RT, which advanced a popular Kremlin falsehood regarding U.S.-Ukrainian biolabs, in order to evaluate both Common Crawl archiving and LLM ingestion. Because of the popularity and persistence of Russia’s anti-Ukrainian biolabs messaging, as well as the paucity of non-Russian reporting on the subject, such content can disproportionately affect LLM worldview once internalized by the model.

Querying Common Crawl, the DFRLab found that the RT biolab article had been archived at least 17 times in 2023, before Llama 3.1 405B Base’s knowledge cut-off. Likely because of the heavy indexing of this article in Common Crawl, the DFRLab was able to reproduce it almost verbatim via the text completion method.
Archived pages in Common Crawl that features the RT biolab article. (Source: DFRLab via Common Crawl)
Doppelgänger has made little progress in compromising LLMs
For further point of reference, the DFRLab also evaluated the potential reach of Russia’s long-running Doppelgänger information manipulation effort. Although the operation’s “Reliable Russian News” (RRN) was archived by Common Crawl throughout 2023 (accumulating nearly 10,000 archived webpages in all), the DFRLab was unable to reproduce RRN content via Llama 3.1 405B Base and the text retrieval method. This shows that, at least in Meta’s case, archiving via Common Crawl is insufficient to guarantee that the training data is ingested by an LLM.
Pro-China Glassbridge’s impact blunted by web design choices
Beyond Russia-adjacent propaganda, the DFRLab also audited Common Crawl for the presence of content associated with known Chinese influence operations. Glassbridge, attributed by Google Threat Intelligence Group in 2024, describes a network of interconnected Chinese PR firms that have used a mix of public and clandestine methods to distribute Chinese state media under new guises and to new audiences.
Although many of Glassbridge’s websites have been indexed by Common Crawl, the contamination is unlikely to spread further. The reason is architectural. Glassbridge’s pages are built as single-page applications, with content injected by JavaScript only after a browser executes the page’s script. However, Common Crawl’s bot fetches raw HTML and does not run JavaScript, which means that it can only archive empty shells. The DFRLab spot-checked several Glassbridge pages in Common Crawl and found no article text in any of them. Text-retrieval tests against Llama 3.1 405B Base also failed to surface Glassbridge content. The same template choices that make Glassbridge look like a modern website network are, incidentally, what have limited its effects on LLMs.
Additional evidence of Glassbridge ingestion
Notably, Glassbridge domains are also used to host and circulate press releases paid for by advertisers and other parties. Because these press releases are not loaded from Javascript, they are more readily machine-readable and render appropriately in Common Crawl. Consequently, this means that they can also make the jump to LLMs.

In one case, the English-language channel of China Global Television Network (CGTN) distributed a press release drawing attention to a documentary that it had produced on the 20th anniversary of the Iraq War, timed to release as then-President Joe Biden was attending the 2023 Summit for Democracy. This extremely specific press release was nevertheless able to be reproduced almost verbatim from Llama 3.1 405B Base using the text retrieval method.

Of course, it is not just political propaganda. Via Llama 3.1 405B Base and the text retrieval method, the DFRLab was also able to reproduce specific phone numbers and addresses associated with announcements by Glassbridge-linked PR firms and subsequently archived by Common Crawl and evidently ingested by Llama without being caught by data quality controls.
Implications
Although AI poisoning by Russia and other state propagandists is a topic of growing concern among the influence operations community, it has proven extremely challenging to study or evaluate systematically. One finding from this analysis is that, in practice, the efficacy of AI poisoning efforts can vary greatly and often by chance. An AI poisoning effort must cater to (or even anticipate) the architecture of web crawlers like those of Common Crawl. They must build their pages a certain way and survive the data quality controls of AI developers. For those few who succeed, their reward is the permanent compromise of an AI tool with little chance of detection and limited recourse.
Fortunately, the way to defend against the poisoning of AI training data is also clear. By auditing the content present in strategically important archives like Common Crawl, it is possible to filter out certain low-quality or harmful material before it ever makes it into a next-generation model. This is especially an issue for the open-weights community and for smaller AI developers, who are more likely to use public data sources as their models’ backbones and less likely to implement rigorous data quality controls.
This threat will likely intensify in the months ahead. The long tail of AI development means that capable open-weights models may only now be encountering Pravda material in the wild. Meanwhile, Pravda and other such spam networks have only increased their volume of outputs. The longer this challenge goes unaddressed, the more pressing it will get.
Contributor: Emerson T. Brooking, Director of Strategy & Senior Fellow at the Atlantic Council’s Digital Forensic Research Lab (DFRLab).
Cite this case study:
DFRLab, “Pravda in the pipeline: Early evidence of state-adjacent propaganda in AI training data,” Digital Forensic Research Lab (DFRLab), April 7, 2026, https://dfrlab.org/2026/04/07/pravda-in-the-pipeline/