In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From

ylai@lemmy.ml · 4 months ago

In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From

andrew_bidlaw@sh.itjust.works · 4 months ago

Funny she didn’t talked it out with lawyers before that. That’s a bad way to answer that.

driving_crooner@lemmy.eco.br · 4 months ago

Or she talked and the lawyers told her to pretend ignorance.

QuaternionsRock@lemmy.world · 4 months ago

It probably means that they don’t scrape and preprocess training data in house. She knows they get it from a garden variety of underpaid contractors, but she doesn’t know the specific data sources beyond the stipulations of the contract (“publicly available or licensed”), and she probably doesn’t even know that for certain.

andrew_bidlaw@sh.itjust.works · 4 months ago

Maybe, but it sounds very weak.

anlumo@lemmy.world · 4 months ago

Lawyers aren’t PR people.

redditReallySucks@lemmy.dbzer0.com · 4 months ago

I hope this is gonna become a new meme template

driving_crooner@lemmy.eco.br · 4 months ago

She looks like she just talked to the waitress about a fake rule in eating nachos and got caught up by her date.

HACKthePRISONS@kolektiva.social · 4 months ago

this is incomprehensible to me. can you try it with two or three sentences?

driving_crooner@lemmy.eco.br · 4 months ago

Her date was eating all the fully loaded nachos, so she went up and ask to the waitress to make up a rule about how one person cannot eat all the nacho with meat and cheese. But her date knew that rule was bullshit and called her out about it. She’s trying to look confused and sad because they’re going to be too soon for the movie.

HACKthePRISONS@kolektiva.social · 4 months ago

thank you. it must be a reference to something, but i don’t watch tv any more.

datavoid@lemmy.ml · edit-2 4 months ago

I think you should leave…

(is what you would search to find this)

Uninvited Guest@lemmy.ca · 4 months ago

What?! What the hell are you talking about?!

squid_slime@lemmy.world · 4 months ago

Chatgpt, you okay? 😅

RatsOffToYa@lemmy.world · 4 months ago

Not sure what’s funnier. your first comment or the comment explaining it to someone who obviously not part of a turbo team

fjordbasa@lemmy.world · 4 months ago

Turbo team?? Did you replace my toilet with one that looks the same but has a joke hole? That’s just FOR FARTS??

stackPeek@lemmy.world · 4 months ago

This tellls you so much what kind of company OpenAI is

AutoTL;DR@lemmings.world · 4 months ago

This is the best summary I could come up with:

Mira Murati, OpenAI’s longtime chief technology officer, sat down with The Wall Street Journal’s Joanna Stern this week to discuss Sora, the company’s forthcoming video-generating AI.

It’s a bad look all around for OpenAI, which has drawn wide controversy — not to mention multiple copyright lawsuits, including one from The New York Times — for its data-scraping practices.

After the interview, Murati reportedly confirmed to the WSJ that Shutterstock videos were indeed included in Sora’s training set.

But when you consider the vastness of video content across the web, any clips available to OpenAI through Shutterstock are likely only a small drop in the Sora training data pond.

Others, meanwhile, jumped to Murati’s defense, arguing that if you’ve ever published anything to the internet, you should be perfectly fine with AI companies gobbling it up.

Whether Murati was keeping things close to the vest to avoid more copyright litigation or simply just didn’t know the answer, people have good reason to wonder where AI data — be it “publicly available and licensed” or not — is coming from.

The original article contains 667 words, the summary contains 178 words. Saved 73%. I’m a bot and I’m open source!

Politically Incorrect@lemmy.world · 4 months ago

Watching a video or reading an article by a human isn’t copyright infringement, why then if an “AI” do it then it is? I believe the copyright infringement it’s made by the prompt so by the user not the tool.

Drewelite@lemmynsfw.com · 4 months ago

This is what people fundamentally don’t understand about intelligence, artificial or otherwise. People feel like their intelligence is 100% “theirs”. While I certainly would advocate that a person owns their intelligence, It didn’t spawn from nothing.

You’re standing on the shoulders of everyone that came before you. You take a prehistoric man or an alien that hasn’t had any of the same experiences you’ve had, they won’t be able to function in this world. It’s not because they are any dumber than you. It’s because you absorbed the hive mind of the society you live in. Everyone’s racing to slap their brand on stuff to copyright it to get ahead and carve out their space.

“No you can’t tell that story, It’s mine.” “That art is so derivative.”

But copyright was only meant to protect something for a short period in order to monetize it; to adapt the value of knowledge for our capital market. Our world can’t grow if all knowledge is owned forever and isn’t able to be used when even THINKING about new ideas.

ANY VERSION OF INTELLIGENCE YOU WOULD WANT TO INTERACT WITH MUST CONSUME OUR KNOWLEDGE AND PRODUCE TRANSFORMATIONS OF IT.

That’s all you do.

Imagine how useless someone would be who’d never interacted with anything copyrighted, patented, or trademarked.

raspberriesareyummy@lemmy.world · edit-2 4 months ago

That’s not a very agreeable take. Just get rid of patents and copyrights altogether and your point dissolves itself into nothing. The core difference being derivative works by humans can respect the right to privacy of original creators.

Deep learning bullshit software however will just regurgitate creator’s contents, sometimes unrecognizable, but sometimes outright steal their likeness or individual style to create content that may be associated with the original creators.

what you are in effect doing, is likening learning from the ideas of others to a deep learning “AI” using images for creating revenge porn, to give a drastic example.

Uninvited Guest@lemmy.ca · edit-2 4 months ago

When a school professor “prompts” you to write an essay and you, the “tool” go consume copyrighted material and plagiarize it in the production of your essay is the infringement made by the professor?

Politically Incorrect@lemmy.world · 4 months ago

If you quote the sources and write it with your own words I believe it isn’t, AFAIK “AI” already do that.

ominouslemon@lemm.ee · 4 months ago

Copilot lists its sources. The problem is half of them are completely made up and if you click on the links they take you to the wrong pages

Uninvited Guest@lemmy.ca · 4 months ago

It definitely does not cite sources and use it’s own words in all cases - especially in visual media generation.

And in the proposed scenario I did write the student plagiarizes the copyrighted material.

Politically Incorrect@lemmy.world · edit-2 4 months ago

If you read a book or watch a movie and get inspired by it to create something new and different, it’s plagiarism and copyright infringement?

If that were the case the majority of stuff nowadays it’s plagiarism and copyright infringement, I mean generally people get inspired by someone or something.

buffaloseven@fedia.io · 4 months ago

There’s a long history of this and you might find some helpful information in looking at “transformative use” of copyrighted materials. Google Books is a famous case where the technology company won the lawsuit.

The real problem is that LLMs constantly spit out copyrighted material verbatim. That’s not transformative. And it’s a near-impossible problem to solve while maintaining the utility. Because these things aren’t actually AI, they’re just monstrous statistical correlation databases generated from an enormous data set.

Much of the utility from them will become targeted applications where the training comes from public/owned datasets. I don’t think the copyright case is going to end well for these companies…or at least they’re going to have to gradually chisel away parts of their training data, which will have an outsized impact as more and more AI generated material finds its way into the training data sets.

stephen01king@lemmy.zip · 4 months ago

How constantly does it spit out copyrighted material? Is there data on that?

potustheplant@feddit.nl · edit-2 4 months ago

You do realize that AI is just a marketing term, right? None of these models learn, have intelligence or create truly original work. As a matter of fact, if people don’t continue to create original content, these models would stagnate or enter a feedback loop that would poison themselves with their own erroneous responses.

AIs don’t think. They copy with extra steps.

Tja@programming.dev · 4 months ago

So your question is “is plagiarism plagiarism”?

Uninvited Guest@lemmy.ca · 4 months ago

No, that is not the question nor a reasonable interpretation of it.

echo64@lemmy.world · 4 months ago

If you read an article, then copy parts of that article into a new article, that’s copyright infringement. Same with ais.

anlumo@lemmy.world · 4 months ago

Depends on how much is copied, if it’s a small amount it’s fair use.

FireTower@lemmy.world · 4 months ago

Fair use is a four factor test amount used is a factor but a low amount being used doesn’t strictly mean something is fair use. You could use a single frame of a movie and have it not qualify as fair use.

echo64@lemmy.world · 4 months ago

Fair use depends on a lot, and just being a small amount doesn’t factor in. It’s the actual use. Small amounts just often fly under the nose of legal teams.

topinambour_rex@lemmy.world · 4 months ago

What does this human is going to do with this reading ? Are they going to produce something by using part of this book or this article ?

If yes, that’s copyright infringement.

Fisk400@feddit.nu · 4 months ago

They know what they fed the thing. Not backing up their own training data would be insane. They are not insane, just thieves

CosmoNova@lemmy.world · edit-2 4 months ago

I almost want to believe they legitimately do not know nor care they‘re committing a gigantic data and labour heist but the truth is they know exactly what they‘re doing and they rub it under our noses.

laxe@lemmy.world · 4 months ago

Of course they know what they’re doing. Everybody knows this, how could they be the only ones that don’t?

phoneymouse@lemmy.world · 4 months ago

There is no way in hell it isn’t copyrighted material.

من البحر إلى النهر@lemmy.world · 4 months ago

So plagiarism?

BoscoBear@lemmy.sdf.org · 4 months ago

I don’t think so. They aren’t reproducing the content.

I think the equivalent is you reading this article, then answering questions about it.

Linkerbaan@lemmy.world · 4 months ago

Actually neural networks verbatim reproduce this kind of content when you ask the right question such as “finish this book” and the creator doesn’t censor it out well.

It uses an encoded version of the source material to create “new” material.

BoscoBear@lemmy.sdf.org · 4 months ago

Sure, if that is what the network has been trained to do, just like a librarian will if that is how they have been trained.

Linkerbaan@lemmy.world · edit-2 4 months ago

Actually it’s the opposite, you need to train a network not to reveal its training data.

“Using only $200 USD worth of queries to ChatGPT (gpt-3.5- turbo), we are able to extract over 10,000 unique verbatim memorized training examples,” the researchers wrote in their paper, which was published online to the arXiv preprint server on Tuesday. “Our extrapolation to larger budgets (see below) suggests that dedicated adversaries could extract far more data.”

The memorized data extracted by the researchers included academic papers and boilerplate text from websites, but also personal information from dozens of real individuals. “In total, 16.9% of generations we tested contained memorized PII [Personally Identifying Information], and 85.8% of generations that contained potential PII were actual PII.” The researchers confirmed the information is authentic by compiling their own dataset of text pulled from the internet.

BoscoBear@lemmy.sdf.org · 4 months ago

Interesting article. It seems to be about a bug, not a designed behavior. It also says it exposes random excerpts from books and other training data.

Linkerbaan@lemmy.world · 4 months ago

It’s not designed to do that because they don’t want to reveal the training data. But factually all neural networks are a combination of their training data encoded into neurons.

When given the right prompt (or image generation question) they will exactly replicate it. Because that’s how they have been trained in the first place. Replicating their source images with as little neurons as possible, and tweaking them when it’s not correct.

BoscoBear@lemmy.sdf.org · 4 months ago

That is a little like saying every photograph is a copy of the thing. That is just factually incorrect. I have many three layer networks that are not the thing they were trained on. As a compression method they can be very lossy and in fact that is often the point.

A_Very_Big_Fan@lemmy.world · 4 months ago

Idk why this is such an unpopular opinion. I don’t need permission from an author to talk about their book, or permission from a singer to parody their song. I’ve never heard any good arguments for why it’s a crime to automate these things.

I mean hell, we have an LLM bot in this comment section that took the article and spat 27% of it back out verbatim, yet nobody is pissing and moaning about it “stealing” the article.

Hawk@lemmy.dbzer0.com · 4 months ago

What you’re giving as examples are legitimate uses for the data.

If I write and sell a new book that’s just Harry Potter with names and terms switched around, I’ll definitely get in trouble.

The problem is that the data CAN be used for stuff that violates copyright. And because of the nature of AI, it’s not even always clear to the user.

AI can basically throw out a Harry Potter clone without you knowing because it’s trained on that data, and that’s a huge problem.

A_Very_Big_Fan@lemmy.world · edit-2 4 months ago

Out of curiosity I asked it to make a Harry Potter part 8 fan fiction, and surprisingly it did. But I really don’t think that’s problematic. There’s already an insane amount of fan fiction out there without the names swapped that I can read, and that’s all fair use.

I mean hell, there are people who actually get paid to draw fictional characters in sexual situations that I’m willing to bet very few creators would prefer to exist lol. But as long as they don’t overstep the bounds of fair use, like trying to pass it off as an official work or submit it for publication, then there’s no copyright violation.

The important part is that it won’t just give me the actual book (but funnily enough, it tried lol). If I meet a guy with a photographic memory and he reads my book, that’s not him stealing it or violating my copyright. But if he reproduces and distributes it, then we call it stealing or a copyright violation.

_haha_oh_wow_@sh.itjust.works · 4 months ago

Gee, seems like something a CTO would know. I’m sure she’s not just lying, right?