Via the Algorithmic Sabotage Research Group (@asrg) ...
... here's a list of code designed to poison the well for AI web-scrapers
tldr.nettime.org/@asrg/1138674…
(thanks to @peterfr for pointing this one out!)
Sabot in the Age of AIHere is a curated list of strategies, offensive methods, and tactics for (algorithmic) sabotage, disruption, and deliberate poisoning.
🔻 iocaine
The deadliest AI poison—iocaine generates garbage rather than slowing crawlers.
🔗 git.madhouse-project.org/alger…🔻 Nepenthes
A tarpit designed to catch web crawlers, especially those scraping for LLMs. It devours anything that gets too close. @aaron
🔗 zadzmo.org/code/nepenthes/🔻 Quixotic
Feeds fake content to bots and robots.txt-ignoring #LLM scrapers. @marcusb
🔗 marcusb.org/hacks/quixotic.htm…🔻 Poison the WeLLMs
A reverse-proxy that serves diassociated-press style reimaginings of your upstream pages, poisoning any LLMs that scrape your content. @mike
🔗 codeberg.org/MikeCoats/poison-…🔻 Django-llm-poison
A django app that poisons content when served to #AI bots. @Fingel
🔗 github.com/Fingel/django-llm-p…🔻 KonterfAI
A model poisoner that generates nonsense content to degenerate LLMs.
🔗 codeberg.org/konterfai/konterf…
peterfr
Als Antwort auf Clive Thompson • • •Clive Thompson
Als Antwort auf peterfr • • •Marisa
Als Antwort auf Clive Thompson • • •Clive Thompson
Als Antwort auf Marisa • • •@marisa
basically, one key way that companies like OpenAI train their language AI is by using "web crawler" software that roams around online, copying the text off web sites ("web scraping", as it's called) so they can have a consistently refreshed pile o' text for training their AI
you need *lots* of freshly written human words to train an AI -- and people are constantly writing stuff on their sites!
So what these tools do is ...
Clive Thompson
Als Antwort auf Clive Thompson • • •@marisa
... attempt to detect if an OpenAI web-crawler is trying to copy the text off a web site ...
... and if so, it generates fake pages with crap text -- which the OpenAI web crawler *assumes are real*, and thus dutifully copies
So OpenAI winds up feeding junk/fake/mangled text as training material into its next version of ChatGPT
The attitude is: "So, you wanna copy our site, so you can train your AI -- without us getting a penny from you? Okay, here's some *junk data*"
Marisa
Als Antwort auf Clive Thompson • • •Clive Thompson
Als Antwort auf Marisa • • •@marisa
🤘 🤖
Marisa
Als Antwort auf Clive Thompson • • •Clive Thompson
Als Antwort auf Marisa • • •@marisa
that's what the internet is for!
Jonathan Lamothe
Als Antwort auf Clive Thompson • • •Clive Thompson
Als Antwort auf Jonathan Lamothe • • •@me
🤘
Clive Thompson
Unbekannter Ursprungsbeitrag • • •very cool!
Wulfy
Als Antwort auf Clive Thompson • • •elle mundy
Als Antwort auf Clive Thompson • • •Chris
Als Antwort auf Clive Thompson • • •Clive Thompson
Als Antwort auf Chris • • •Clive Thompson
Unbekannter Ursprungsbeitrag • • •@raph_v
good question!
I don't really know
Clive Thompson
Unbekannter Ursprungsbeitrag • • •Oh I did, thank you for pointing it out!
peterfr
Als Antwort auf Clive Thompson • • •@raph_v maybe you missed @asrg ‘s reply?
tldr.nettime.org/@asrg/1139072…
ASRG (@asrg@tldr.nettime.org)
tldr.nettime