OpenAI strikes Reddit deal to train its AI on your posts

return2ozma@lemmy.world · 2 months ago

OpenAI strikes Reddit deal to train its AI on your posts

boatsnhos931@lemmy.world · 2 months ago

No wonder AI is crazy AF.

macrocephalic@lemmy.world · 2 months ago

All future AI will have autocorrect errors and will look like no one read it before hitting enter. You’re welcome.

FaizalR@kbin.social · 2 months ago

But Reddit is full of NSFW content.

Possibly linux@lemmy.zip · 2 months ago

And the problem is?

just another dev@lemmy.my-box.dev · 2 months ago

Not through the API.

jordanlund@lemmy.world · 2 months ago

BRB - changing my entire 15 year reddit comment history to “Fuck Spez”. LOL.

micka190@lemmy.world · 2 months ago

Realistically, when you’re operating at Reddit’s scale, you’re probably keeping a history of each comment for analytics purposes.

RecluseRamble@lemmy.dbzer0.com · 2 months ago

That was really my thought - future iterations of Chat GPT won’t like spez very much.

return2ozma@lemmy.world · 2 months ago

Know any bots or ways to perma delete all Reddit comments?

catloaf@lemm.ee · 2 months ago

https://github.com/j0be/PowerDeleteSuite

bobs_monkey@lemm.ee · edit-2 2 months ago

I used redact.dev to mass edit all my comments, worked pretty well. Problem is that if you mass delete, they’ll restore them pretty quick, but so far they haven’t reverted my edits.

Rolando@lemmy.world · 2 months ago

Back when I deleted all my comments, I was told I could claim to be in Europe and make a request citing the European law that Reddit has to follow. I think Reddit had a page where you could make the request, but of course it was hard to find.

thejml@lemm.ee · 2 months ago

Reddit has backups, permanently isn’t an option.

metaStatic@kbin.social · 2 months ago

yep they fuckin got us

but it’s not like our posts are safe here either. This is the world we live in now.

the_doktor@lemmy.zip · 1 month ago

We have to either make AI illegal or make it accountable by giving references to where it gets its data so it can properly cite its sources.

andrew@lemmy.stuart.fun · 2 months ago

But here, the API is open and I can run my own copy and train my own LLM same as anyone else. It’s not one asshole who decides to whom and for how much he’ll sell the content we all gave him for free, so he can justify his $193 million paycheck.

PseudorandomNoise@lemmy.world · 2 months ago

Does that really matter? The owner of a given instance can still choose to sell everything on their server, no?

db2@lemmy.world · 2 months ago

They’re not multiple though, edit it and then delete it and it’s gone. They disabled all the tools to do it though so it’s manually or nothing now.

Coasting0942@reddthat.com · 2 months ago

Damn. You outsmarted them well paid data jockeys. And assuming your edits change the actual comment and don’t simply hide the original.

I could be an idiot too though. Reddit might have been running this whole shit show on the original version of the database system and be upselling to buyers.

SchmidtGenetics@lemmy.world · 2 months ago

They just reload a previous cached comment, doesn’t matter how many times you edit or delete, it’s all logged and backed up.

Imgonnatrythis@sh.itjust.works · 2 months ago

Will be interesting to see if they stoop so low as to allow this. Probably wouldn’t be a super wise move as most deleted posts are likely material that would not be great to train on anyway. My first thought when I read this was, “well, not on MY posts” I’m clean off of reddit.

FaceDeer@fedia.io · 2 months ago

There are torrents of complete Reddit comment archives available for any random person who wants them, I’m sure Reddit themselves has a comprehensive edit history of everything.

mox@lemmy.sdf.org · 2 months ago

There have already been reports of people being banned and finding their posts restored in response to their attempts to delete them.

noorbeast@lemmy.zip · 2 months ago

Finally found a use for MS Edge, loaded up Nuke Reddit History and removed all comments and posts: https://microsoftedge.microsoft.com/addons/detail/nuke-reddit-history/bklbcgohenjegdibgmppligaapohkgip

gravitas_deficiency@sh.itjust.works · 2 months ago

Hate to break it to you, but the time to do that was over a year ago, and even then it wasn’t ever really a sure thing - we don’t really know what their backup policies are around that stuff.

This is what the former power user community that made an exodus from Reddit roughly a year ago has been trying to communicate, but a ton of people here seem to enjoy keeping their toes in the water over there, with rather predictable consequences (literally, the post we’re commenting on).

All that said: I am very much looking forward to the absolutely titanic lawsuit around GDPR I’m sure is in the works over this.

AlexWIWA@lemmy.ml · 2 months ago

Not even a year ago. Reddit has been used for training data for well over a decade. We used it in 2012 in an AI class.

gravitas_deficiency@sh.itjust.works · 2 months ago

My point is that there was not a revenue-generating b2b contract allowing another company to exploit it at scale, while compensating Reddit directly.

AlexWIWA@lemmy.ml · 2 months ago

My apologies. I missed it

snownyte@kbin.social · 2 months ago

Wish I had known this beforehand in like several accounts I’ve had with that shit-ass place.

Then again, it’s likely that Reddit has shit archived because Spez is one of them data-farmers like Mark is. Nothing is truly deleted from their sites. It’s just archived.

There’s been lots of evidence that proves this, because people have dug up old comments, even down to who posted it originally. Then, even if your account is deleted, your comment body is still there, I know because I’ve deleted an account and checked back where I was before.

humorlessrepost@lemmy.world · 2 months ago

Worth doing, but I suspect they’re sending OpenAI snapshots of the database from before you did that.

Everythingispenguins@lemmy.world · 2 months ago

Some day historians will be able to look back at this moment and be able to determine it was what caused ChatGPT to become horny and weird.

frickineh@lemmy.world · 2 months ago

My comment history was like 50% shitposting about the beauty industry and 50% hating on Christian fundamentalists. There’s honestly no way it won’t make AI at least a little bit worse, and I’m not mad about it.

Flying Squid@lemmy.world · 2 months ago

That AI is going to be super anti-Christian fundementalist (or possibly just anti-Christian), so maybe there is an upside.

assassin_aragorn@lemmy.world · 2 months ago

Only an idiot would decide to mindlessly trawl Reddit to train an LLM. They’ll be confused when their model suddenly is confidently wrong about everything and have no clue.

Everythingispenguins@lemmy.world · 2 months ago

You are a hundred percent right, but how many idiots are there out there?

assassin_aragorn@lemmy.world · 2 months ago

Uncountably many

Nougat@fedia.io · 2 months ago

Then they will learn that Spez doesn’t get to profit from me anymore.

FaceDeer@fedia.io · 2 months ago

You think they don’t have the originals archived?

db2@lemmy.world · 2 months ago

Not my posts. Go ahead, look at what remains. The rest was edited and then deleted.

Fuck you, Steve. Right in the ass.

yeehaw@lemmy.ca · 2 months ago

If only snapshots and backups were a thing…

Todgerdickinson@lemmy.world · 2 months ago

Yea that’s the problem isn’t it. I had a great idea involving bullshit-efying my comments by editing them slowly with a LLM via long running script and repeatedly over months.

I realised that they probably don’t delete the original text on edit anyway which, as you say is probably buried in a backup someplace.

Ace! _SL/S@ani.social · 2 months ago

I don’t think it is in backups only. My guess is they store your full edit history for each comment/post/whatever. Newest one will be shown on the frontend, rest is for data vampires

yeehaw@lemmy.ca · 2 months ago

This is it exactly. Edits to use are “changed”. To the back end it’s just an iteration while the rest still exist.

CeeBee@lemmy.world · 2 months ago

It’s theoretically possible, but the issue that anyone trying to do that would run into is consistency.

How do you restore the snapshots of a database to recover deleted comments but also preserve other comments newer than the snapshot date?

The answer is that it’s nearly impossible. Not impossible, but not worth the massive monumental effort when you can just focus on existing comments which greatly outweigh any deleted ones.

yeehaw@lemmy.ca · 2 months ago

It’s a piece of cake. Some code along the lines of:

If ($user.modifyCommentRecentlyCount > 50){

Print “user is nuking comments” $comment = $previousComment }

Or some shit. It can be done quite easily, trust me.

CeeBee@lemmy.world · 2 months ago

It can be done quite easily, trust me.

The words of every junior dev right before I have to spend a weekend undoing their crap.

I’ve been there too many times.

There are always edge cases you need to account for, and you can’t account for them until you run tests and then verify the results.

And you’d be parsing billions upon billions of records. Not a trivial thing to do when running multiple tests to verify. And ultimately for what is a trivial payoff.

You don’t screw around with infinitely invaluable prod data of your business without exhausting every single possibility of data modification.

It’s a piece of cake.

It hurts how often I’ve heard this and how often it’s followed by a massive screw up.

yeehaw@lemmy.ca · 2 months ago

The words of every junior dev right before I have to spend a weekend undoing their crap.

There are so many ways this can be done that I think you are not thinking of. Say a user goes to “shreddit” (or some other similar app) their comments. They likely have thousands. On every comment edit, it’s quite easy to check the last time the users edited one of their comments. All they need is some check like checking if the last 10 consecutive comments were edited in hours or milliseconds/seconds. After that, reddit could easily just tell the user it’s editing their comments but it’s not. Like a shadowban kind of method. Another way would be at the data structure level. We don’t know what their databases and hardware are like, but I can speculate. What if each user edited comment is not an update query on a database, but an add/insert. Then all you need to do is update the live comments where the date is before the malicious date where the username=$username. Not to mention when you start talking Nimble storage and stuff like that, the storage is extremely quick to respond. Hell I would wager it didn’t even hit storage yet, probably still on some all flash cache or in memory. Another way could be at the filesystem level. Ever heard of zfs? What if each user had their own dataset or something, it’s extremely easy and quick to roll back a snapshot, or to clone the previous snapshot. There are so many ways.

At the end of the day a user is triggering this action, so we don’t necessarily need to parse “billions” of records. Just the records for a single user.

CeeBee@lemmy.world · edit-2 2 months ago

There are so many ways this can be done that I think you are not thinking of.

No, I can think of countless ways to do this. I do this kind of thing every single day.

What I’m saying is that you need to account for every possibility. You need to isolate all the deleted comments that fit the criteria of the “Reddit Exodus”.

How do you do that? Do you narrow it down to a timeframe?

The easiest way to do this is identify all deleted accounts, find the backup with the most recent version of their profile with non-deleted comments, and insert that user back into the main database (not the prod db).

Now you need to parse billions upon billions upon billions of records. And yes, it’s billions because you need the system to search through all the records to know which record fits the parameters. And you need to do that across multiple backups for each deleted profile/comment.

It’s a lot of work. And what’s the payoff? A few good comments and a ton of “yes this ^” comments.

I sincerely doubt it’s worth the effort.

Edit: formatting

yeehaw@lemmy.ca · 2 months ago

How do you do that? Do you narrow it down to a timeframe?

When a user edits a comment, they submit a response. When they submit a response, they trigger an action. An action can do validation steps and call methods, just like I said above, for example. When the edit action is triggered, check the timestamp against the previously edited comment’s timestamp. If the previous - or previous 5 are less than a given timeframe, flag it. “Shadowban” the user. Make it look like they’ve updated their comments to them, but in reality they’re the same.

We’ve had detection methods for this sort of thing for a long time. Thing about how spam filtering works. If you’re using some tool to scramble your data, they likely have patterns. To think reddit doesn’t have some means to protect itself against this is naive. It’s their whole business. All these user submitted comments are worth money.

Now you need to parse billions upon billions upon billions of records. And yes, it’s billions because you need the system to search through all the records to know which record fits the parameters. And you need to do that across multiple backups for each deleted profile/comment.

This makes me thing you don’t understand my meaning. I think you’re talking about one day reddit decides to search for an restore obfuscated and deleted comments. Yes, that would be a large undertaking. This is not what I’m suggesting at all. Stop it while it’s happening, not later. Patterns and trends can easily identify when a user is doing something like shreddit or the like, then the code can act on it.

It’s a lot of work. And what’s the payoff? A few good comments and a ton of “yes this ^” comments.

this

Patrick@lemmy.today · 2 months ago

Everyone wants a piece of the AI pie…

jabathekek@sopuli.xyz · 2 months ago

Little did they know, the pie was just a hallucination.

LilDestructiveSheep@lemmy.world · 2 months ago

Gonk

AutoTL;DR@lemmings.world · 2 months ago

This is the best summary I could come up with:

OpenAI has signed a deal for access to real-time content from Reddit’s data API, which means it can surface discussions from the site within ChatGPT and other new products.

It’s an agreement similar to the one Reddit signed with Google earlier this year that was reportedly worth $60 million.

The deal will also “enable Reddit to bring new AI-powered features to Redditors and mods” and use OpenAI’s large language models to build applications.

Recently, following news of a partnership between OpenAI and the programming messaging board Stack Overflow, people were suspended after trying to delete their posts.

No financial terms were revealed in the blog post announcing the arrangement, and neither company mentioned training data, either.

That last detail is different from the deal with Google, where Reddit explicitly stated it would give Google “more efficient ways to train models.” There is, however, a disclosure mentioning that OpenAI CEO Sam Altman is also a shareholder in Reddit but that “This partnership was led by OpenAI’s COO and approved by its independent Board of Directors.”

The original article contains 334 words, the summary contains 174 words. Saved 48%. I’m a bot and I’m open source!

villainy@lemmy.world · 2 months ago

“Strikes” made me think they were cancelling the deal. Like strike-through, crossed it out, etc. Too bad.

jeanofthedead@sh.itjust.works · 2 months ago

Does this mean I can stop prefacing my AI requests with “According to Reddit…”?

AlexWIWA@lemmy.ml · 2 months ago

LLMs have been training on Reddit posts since at least 2012. Nothing really new here.

YIj54yALOJxEsY20eU@lemm.ee · 2 months ago

Now they get to train on all the “deleted” comments/posts as well.

SparrowRanjitScaur@lemmy.world · edit-2 2 months ago

Probably not, I’m sure they’re training on Reddit’s internal data set which likely includes all deleted posts.

YIj54yALOJxEsY20eU@lemm.ee · 2 months ago

Did you just say probably not then agree with me?

SparrowRanjitScaur@lemmy.world · edit-2 2 months ago

Ya, lol. Sorry, I’m not sure if I replied to the wrong comment or just misread your comment earlier. I agree with you.

YIj54yALOJxEsY20eU@lemm.ee · 2 months ago

Lol no worries

UnderpantsWeevil@lemmy.world · 2 months ago

It’s ground zero for Bots training on other Bots

Possibly linux@lemmy.zip · edit-2 2 months ago

They now are paying Reddit? I thought they could just scrape for free.

Also, you can not delete anything on the internet. Once something is public there will always be a copy somewhere.

nondescripthandle@lemmy.dbzer0.com · 2 months ago

My guess is reddit was cheap enough that it made sense to pay them as sort of insurance they dont get sued in the future.

Fetus@lemmy.world · 2 months ago

Scraping through a website at the scale they are talking about isn’t really viable. You need access to the API so that you can have very targeted requests.

This is why reddit changed their API pricing and screwed over everyone using third party apps. They can make more money selling access to LLM trainers than they could from having millions of people using apps that rely on the API.

Dr. Moose@lemmy.world · edit-2 2 months ago

Scraping at scale is actually cheaper than buying API access. It’s a massive rising market, try googling “web scraping service” and there are hundreds of services that provide API to scrape any public web page and bypass the blocks for you and render all of the javascript.

BatrickPateman@lemmy.world · 2 months ago

Scraping ia nice for static conten, no doubt. But I wonder at what point it is easier to request changes to a developing thread via API than to request the whole page with all nested content over and over to find the new answes in there.

micka190@lemmy.world · 2 months ago

There’s actually legal precedent against scrapping a website through unofficial channels, even if the information is private. But basically, if you scrape a website and hinder their ability to operate, it falls under “virtual trespassing”.

I’m assuming it would be even worse now that everyone is using the cloud and that scrapping their site would cause a noticeable increase in resource cost (and thus, directly cost them more money because of cloud usage fees).

It’s why APIs are such a big deal. They provide you with an official, controlled, entry point to a platform’s data.

Dr. Moose@lemmy.world · edit-2 2 months ago

It’s the opposite! There’s legal precedence that scraping public data is 100% legal in the US.

There are few countries where scraping is illegal though like Japan and China. European countries often also have things called “database protection” laws that forbid replicating public databases through scraping or any other means but that has to be a big chunk of overal database. Also there are personally identifiable info (PII) protection laws that protect storing of people data without their consent (like GDPR).

Source: I work with anti bot tech and we have to explain this to almost every customer who wants to “sue the web scrapers” that lol if Linkedin couldn’t do it, you’re not sueing anyone.

General_Effort@lemmy.world · 2 months ago

Refreshing to see a post on this topic that has its facts straight.

EU copyright allows a machine-readable opt-out from AI training (unless it’s for scientific purposes). I guess that’s behind these deals. It means they will have to pay off Reddit and the other platforms for access to the EU market. Or more accurately, EU customers will have to pay Reddit and the other platforms for access to AIs.

RizzRustbolt@lemmy.world · 2 months ago

Those poor silicon atoms…

OpenAI strikes Reddit deal to train its AI on your posts

OpenAI strikes Reddit deal to train its AI on your posts

Reddit’s deal with OpenAI will plug its posts into “ChatGPT and new products”