AI doesn't need to censor you. It just needs you to disappear.
......................................... -#........................................................
.........................................:@* .......................................................
........................................ +*#: ....... .............................................
........................................ =#:#=. ....... ............................................
................................... .... #-.+#: .....+- ...........................................
..................................... ...=# .-%-.... =@:...........................................
....................................+*... #+..:=# . :#@- ..........................................
................................... -@* .**.. :-% :+*=#......... .................................
.................................. #+#-#*:... *+ -#= +* . .....=. ................................
................................. .#+.%%=:... -%..%=:.=% *: ....*%.................................
.......................... .... -#= -@=:.....#+ .%:.. +#=@#... *@+ ...............................
...........................: .. =%-..:@-. .:#@-=@- .. -%:+*. .**=# ... ..........................
...........................*+...%-:.. -:..:..--#@@*... . -% =#= %- .. -: .........................
.......................... =@-..%-:. ..-#..:::. ..=: ..:**=#- +# . .+% ..........................
...........................##* %-:.=.... =@= ..... *# .:-@=%-:.#+ .- #%: .........................
..........................#=.%**#.. *+ . =##=.. . .=@+ .:=@@@...-%:-@-:*#:.........................
........................ -%...==. :@= .**:%..... =+#+...:-*#=.. -**=% .@-.........................
.........................:@-. ..-##..#+..%: .. =+=+ .. .. .-+....=%.:+. .......................
....................... +@+:....#:#..%.- :+:.. +:*= +- ... -@-..:%: . :........................
.......................*####*=. ==-+=%+:. .....-:*=-=##....-++: -@++..=% .......................
..................... :%----==++==-==-::--. .... ::::. .:----===#@+ .%@- ......................
......................++.. ..:--==+++===--:.. ...::-==++++++==-:%-.: #=#- .....................
..................... =*.. .. ..::-==+**++++***++==-::.. .:% .%+%::%:.....................
......................:%+-:... ... ===#*==--:::...........:%:.##-.:*+.....................
.....................#@#*+**+++=--:. :+==%--::::.. ....::--==*%*-:...%= ....................
......................=#-==+=***####**+=-::..===%-:::::--==++++++++++++*%@=.*#... ..................
..................... %: +* .::-=+**#%%%#*##*##+++++++======+++++==-:%-.#+ =#...................
.................. +* :%. :%#: ...:-==+**###*+--=++++++=-:..- :% -@::=#@: .................
.................. ##*%*--:=*:#- ... .++%*+=--::... .@= .#-.=**=:%...................
.................. *+.--#@%@= :@==-:.. .....-+=#::::::.... :**#=+*@+.. .%= ..................
.................. #+ -#.*+ .*#######*+==-::... .-=+*::::::--==*#:.%*++*@%.=%- .. ................
..................+= %-*=:* .*= ..::-=++*#####**+==-*=%+++***++++%* +#=-=#.-%. =*..................
................. =@*%-#==+ :#... .::-=++*##*#*#*++++++++=-#:.#:..*+ #%.=@#..................
................. ++-- -#=#..*: ...... ..*-:-**#+++=--:..:#:.=...=# :#%#:+* ................
..................%: .=-=+*+=**=-::. ..... =* ==%=-:::.....-. .:-=##- . +# ................
..................%- :@+=+++++++**+++=--:.. ++..++#::::::....:-==++++++*%%-.+%:.................
................. =%- -* ..::--===++*******+=-:.-*. +=#::::--=+++++++++++++=%--#+. .................
.................. -#++# . ....::-==+*#####%*+*+%+++++++===++++=-::..=# @#:=..................
................... :*@=.. .. ..::-=+*##%%%+===+++++=-:.......-%.-*+-..................
.................. +#*#%###*=-:.. .... ..:=+*%++=-::..........:-##=:: ..................
................. :%. :+=*=*#%%%#*+=--:.. ..... +:#=::::::....:-==++**+*#@#...................
..................+* .+:+. .:-=+**####*+=--::... .+:%:-::::-==++********+=#+ ...................
................. =# +:* ... ..:-=++**#***++==-:*-#+==++*******++==-:. %:....................
..................:%+-+-+ ..... ..:-=+*****+*+*#******++=-::. ...:%: .................
.................::=*###%*=-:. ...... .+-*-==+*++**++=-::............+#+:..................
................ ....:-=*####*+=-:. .....-=+: ++%+=--:::.........:-=+**##-:.................
.................... ...:-=+*####*+-:.. =-+...:+#+::::::.....:-=+*###*+-:... ................
........................ ....:-=+*####*+=-.=-+. -+#=:::::::-=+*###*+=-:. ..................
.............................. ....::-=+*######=:.:=*+:--=+**##*+=-:... .......................
................................... ....::-=+*###*##@**###+=-:.. ...........................
......................................... . ..:--=+++=-:.. ................................
cat /alexandria — No such file or directoryCensorship used to need a censor. Someone burning a book, someone rewriting a textbook, someone calling the printer.
It doesn’t anymore. In the AI-mediated web, disappearing from the index is enough.
the model is the new front page
For twenty years, reading the web meant opening a page. You typed a query, you got ten blue links, you clicked one.
Now you ask a model. The model reads for you and tells you what’s there. The page itself is something you almost never see. People talk to ChatGPT, Claude, Perplexity and Gemini more than they talk to the source of the information. The author who wrote it, the site that hosts it, the journalist who reported it — all sit one layer below a chat window most readers never leave.
This sounds like a UX shift. It isn’t. It’s a substrate shift. Once a page isn’t in the training set or the retrieval index, it stops existing for the median reader. Not “harder to find” — gone. The reader doesn’t know what they didn’t ask, and didn’t get told.
The same thing happened to card catalogs, to phone books, to RSS. Each layer wasn’t deleted — it just stopped being the layer anyone consulted. After a while, “no longer consulted” becomes “no longer real.”
AI passed human writing in November 2024
This wouldn’t matter much if the underlying corpus stayed representative of human writing. It isn’t.
Graphite scored 65,000 English-language articles from Common Crawl, sampled monthly from January 2020 to May 2025, with Surfer’s AI detector. AI-generated articles crossed human-written articles in November 2024 and reached 51.7% by May 2025.[^1]
AI-generated articles surpassed human-written articles in November 2024, reaching 51.7% by May 2025, per Graphite analysis of 65,000 Common Crawl articles. 0% 25% 50% 75% 100% 2020 2021 2022 2023 2024 2025 ChatGPT · Nov 2022 crossover · Nov 2024 AI · 52% Human · 48%
A caveat worth keeping honest: only about 14% of articles ranking in Google Search are AI-generated, per the same study.[^1] Publication volume is not the same as visibility. But training corpora are built from publication volume, not from search rankings. The model sees the slop. The reader sees the curated front.
The web is not getting smaller. It’s getting thinner. More tokens, less signal.
the model is training on itself
When a model is trained mostly on output from earlier models, the rare tails of the original distribution disappear first.
Shumailov et al. published this in Nature in 2024 under a title that doesn’t bury the lede: AI models collapse when trained on recursively generated data.[^2] Not the popular middle — the edges. Niche dialects. Minority opinions. The one historian who disagreed with the consensus. The post that nobody linked to but happened to be right.
Each generation eats its own tail. The long tail goes first.
Call this recursive forgetting. It’s not malicious. It’s not even a bug. It’s the geometry of training on a distribution that no longer matches reality. And the share of synthetic training input is forecast to dominate by the end of the decade — Gartner has projected synthetic data will overshadow real data in AI training by 2030.[^3]
Synthetic data is forecast to overshadow real data in AI model training by 2030. Real data grows roughly linearly; synthetic data grows roughly exponentially. 2020 2022 2024 2026 2028 2030 Data used for AI → Today's AI Future AI Synthetic Real
After enough cycles, the model’s worldview converges to its own prior. What was outside that prior is no longer reachable, no matter how true it was.
one percent of classical literature survived the last index
We’ve seen this before, on a slower timescale.
Roughly 1% of classical Greek and Latin literature survived to today.[^4] Of about three hundred Athenian tragedies known by name, thirty-two complete texts came down to us. The cause, contrary to the school-textbook narrative, was almost never a single fire. The Library of Alexandria didn’t fall in one night. It was indexing decisions, copying priorities, taste shifts, storage failures, compounding over centuries. The texts that didn’t get re-copied didn’t survive. The texts that didn’t get translated didn’t get re-copied.
Survivorship was determined upstream of any reader. It was a function of which scribes thought a text was worth their week.
The same machinery is running again. Different scribes.
most books ever written are not in any model
Before we even get to the indexing question, there’s a more boring one: a lot of human writing simply isn’t digitized.
Google estimated in 2010 that there were about 130 million distinct book titles in the world.[^5] Fifteen years later, Google Books has scanned roughly 40 million; HathiTrust, the largest open scholarly mirror, holds about 18 million volumes.[^6] Counting the overlap, well under a third of everything ever published sits in any digital corpus an LLM can train on.
The rest sit in physical libraries, university basements, private collections. For a model, they aren’t just hard to access — they don’t exist. The author’s argument was made, the book was published, and the model has never heard of it.
Most of human knowledge, from the model’s perspective, is already gone.
the archive is one lawsuit thick
The thin part of the corpus is also fragile.
In June 2024, the Internet Archive removed about 500,000 books from its Open Library after losing Hachette v. Internet Archive at the district court level.[^7] In September 2024, the Second Circuit affirmed.[^8] One of the largest non-commercial digital archives in the world lost half a million titles to a single ruling.
The Internet Archive is one organization. The Wayback Machine — the only meaningful record of what the open web used to look like — depends on the same nonprofit budget, the same legal exposure, the same single point of failure. There is no second archive of the same scale.
If the next ruling goes the same way, or the funding cliff hits, or the building floods, that record goes with it. The model doesn’t need to be told the books are gone. They just stop being in the next training run.
censorship without a censor
Pull these threads together and a quieter pattern shows up.
You don’t have to ban a book. You only have to keep it out of a few key sources before the next training cycle. You don’t have to silence a writer. You only have to keep their site out of the index. You don’t have to rewrite history. You only have to wait for the long tail to thin, and let recursion do the rest.
This is censorship in the mathematical sense — a reduction in the support of the distribution — without anything that looks like a censor. No fire, no banned-books list, no public villain. Just an indexing decision, propagated.
And because all of this happens upstream of the reader, the reader can’t tell. You don’t notice the books that aren’t on the shelf when you’ve never seen the shelf.
four models read it for everyone
You don’t need most of the world’s content to be missing. You only need it to be missing from three or four models.
People use what’s good and convenient. The LMArena leaderboard is dominated by a handful — GPT, Claude, Gemini, Grok — and the next thirty are footnotes for hobbyists. If you ask a real person a real question this year, the answer almost certainly came through one of four models.
Concentration is the multiplier. If your work isn’t in the corpus those four were trained on, or it sits below their relevance threshold, the median reader of 2030 will never encounter it — even if the page still exists, even if Common Crawl still has a snapshot somewhere. The funnel is narrow on purpose. Better models cost more to train and serve. The economics push toward a few winners. The few winners pick what survives.
what this means for your work
If you write on the internet today, you’re competing for a slot in a small number of models’ worldviews. Most of what gets published doesn’t get one.
The Substack post you wrote last March. The Notion site about your startup. The LinkedIn essay that did 200 likes. Your work probably won’t vanish — most of it gets scraped eventually. What happens is harder to fight: it gets diluted into a sea of AI-drafted content written around the same prompts. The model knows you wrote something. It just doesn’t weight you above the noise.
And if you don’t make the next training cycle at all, or you make it but get down-weighted to background, the practical effect is the same as if you’d never published. The reader asking the model never finds out you wrote it.
You can’t control the index. You can control what you put in it, and how often. You’re either in the index, or you’re already invisible.
a record, in case
The probability that this very post survives the next decade in any LLM’s worldview isn’t zero. But it isn’t one either.
Neither is yours.
This blog is a small bet against erasure — a place to leave what I think and what people I admire think, in stable URLs, mirrored, fed to the index, so that some version of it has a chance to land in whatever the readers of 2035 are using.
I don’t think the internet has to go dead. The library can be rebuilt — better this time, because we know what’s at stake. This is a challenge to solve, not a fate to accept.
The default outcome is erasure. The exception is what you choose to put in the index, and how often. Write the thing. Publish it somewhere stable. Mirror it.
[^1]: Graphite, More Articles Are Now Created by AI Than Humans, Oct 2025 — n = 65,000 English-language Common Crawl articles, Jan 2020 – May 2025, scored with Surfer’s AI detector. The 14% Google-Search visibility figure is from the same study. [^2]: Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson, R. & Gal, Y. AI models collapse when trained on recursively generated data. Nature 631, 755–759 (2024). [^3]: Gartner has projected for several years that synthetic data will overshadow real data in AI training by 2030. Earlier waypoint: Gartner (2021) projected 60% of AI/analytics data would be synthetically generated by 2024. [^4]: Standard scholarly estimate — see Reynolds & Wilson, Scribes and Scholars — though contested as to method. Of ~300 Athenian tragedies known by name, 32 complete texts survive. [^5]: Leonid Taycher, “Books of the world, stand up and be counted! All 129,864,880 of you,” Google Books blog, August 2010. [^6]: Google Books reported >40M scanned at the 15-year anniversary in October 2019. HathiTrust reported >18M volumes in September 2024. [^7]: Internet Archive blog, “Let Readers Read,” June 17, 2024 — cites the ~500,000 figure for books removed from CDL after the publishers’ victory at the SDNY. [^8]: Hachette Book Group, Inc. v. Internet Archive, No. 23-1260, 2d Cir. Sept. 4, 2024 (affirming SDNY).
