Zum Inhalt der Seite gehen


Dieser Beitrag wurde bearbeitet. (6 Monate her)
Als Antwort auf Marisa

@marisa
basically, one key way that companies like OpenAI train their language AI is by using "web crawler" software that roams around online, copying the text off web sites ("web scraping", as it's called) so they can have a consistently refreshed pile o' text for training their AI

you need *lots* of freshly written human words to train an AI -- and people are constantly writing stuff on their sites!

So what these tools do is ...

Als Antwort auf Clive Thompson

@marisa
... attempt to detect if an OpenAI web-crawler is trying to copy the text off a web site ...

... and if so, it generates fake pages with crap text -- which the OpenAI web crawler *assumes are real*, and thus dutifully copies

So OpenAI winds up feeding junk/fake/mangled text as training material into its next version of ChatGPT

The attitude is: "So, you wanna copy our site, so you can train your AI -- without us getting a penny from you? Okay, here's some *junk data*"

Als Antwort auf Clive Thompson

thank you for taking the time to explain. ✨ am learning so much here
Unbekannter Ursprungsbeitrag

hometown - Link zum Originalbeitrag
Clive Thompson
@carraway
very cool!
Als Antwort auf Clive Thompson

“The Net interprets censorship as damage and routes around it.”
Als Antwort auf Clive Thompson

adding a watermark over text, making and flattening a pdf is a solid way to really mess with anything involving OCR for machine learning
Unbekannter Ursprungsbeitrag

hometown - Link zum Originalbeitrag
Clive Thompson

@raph_v
good question!

I don't really know

Unbekannter Ursprungsbeitrag

hometown - Link zum Originalbeitrag
Clive Thompson
@raph_v
Oh I did, thank you for pointing it out!
Als Antwort auf Clive Thompson

@raph_v maybe you missed @asrg ‘s reply?

tldr.nettime.org/@asrg/1139072…