OpenAI's CTO Dodges Questions on Sora AI's Training Data Origins

openai-sora-training-data-controversy-cto-interview.jpg

Sunday, March 17, 2024 4:41 AM UTC

In a revealing Wall Street Journal interview, OpenAI CTO Mira Murati was tight-lipped about the specific data sources used to train Sora, the organization's advanced AI video generator. Amidst growing scrutiny over AI training practices, Murati's reluctance to detail the origins of the data highlights the ongoing debate surrounding copyright and ethical AI development.

OpenAI CTO Evades Detailed Queries on Sora's Training Data Amid Copyright Concerns

That trend has stayed the same with OpenAI's Sora, the company's upcoming text-to-video generative AI that has demonstrated the ability to create lifelike and realistic videos.

In an interview video with the Wall Street Journal, OpenAI's former CEO (she was CEO for two days when Sam Altman was temporarily removed) and current CTO Mira Murati discussed the company's new technology. Murati's interview was intended to discuss the benefits of Sora and hype the upcoming technology. That happens, but Joanna Stern of the WSJ did more than throw softballs; she also asked some difficult questions.

In a three-minute segment, Stern questions Sora's training set. Before the interview, Stern provided OpenAI with some new text descriptions that would be used to create videos for their interview.

"Every time I watch a Sora clip, I wonder what videos this AI model learned from," Joanna says. Did the model see any clips of Ferdinand to know what a bull in a Chinese shop should look like? Was it a fan of Spongebob?"

While she asks these questions, clips from the animated film Ferdinand and the children's television show Spongebob appear side by side with Sora's work, making it difficult not to notice the similarities. The next question was, naturally, what data was used to train Sora?

"We used publicly available data and licensed data," Murati responds.

"So videos on YouTube?" Stern asked. "Videos from Facebook, Instagram? What about Shutterstock? I know you guys have a deal with them."

"I'm actually not sure about that. If they were publicly available, publicly available to use, there might be that data, but I'm not sure. I'm not confident about it," Murati said.

"I'm just not going to go into the details of the data that was used, but it was publicly available or licensed data."

Murati confirmed to Stern after the interview that the licensed data includes Shutterstock content, but her refusal to discuss the topic on camera is telling.

Ethical Quandaries: AI's Content Creation Sparks Copyright Controversy and Artist Concerns

PetaPixel reports that as impressive as generative AI, the debate over how these companies create visual content and the likelihood that it violates artists' copyrights remains constant. There have been reports that the people behind AI image generators specifically target specific artists in their training data under the guise of making it "publicly available." Even when this is not the case, the ease with which photographers can recreate their photos with minimal effort, or the fact that iconic images are just as simple to recreate with minimal effort, tells the story.

These AI systems have likely seen and been trained on those copyrighted images, which explains why they can easily recreate their versions. However, speculation isn't necessary. Midjourney's founder admitted that its AI used a "hundred million" image as a training set without permission. OpenAI admitted that it is "impossible" to train AI without relying on copyrighted content.

That said, Murati is likely aware that discussing using stolen content to train its AI is not something OpenAI wants to admit regularly, so she refuses to respond to Stern's question. It is, however, an easy way to argue that these companies care little about human artists' rights and demonstrate how far they will go to further their interests, regardless of the cost.

Photo: Levart_Photographer /Unsplash

Editor's Picks

Gold is meant to be a ‘safe haven’ in uncertain times. Why is it crashing amid a war?

Wall Street Slides as Iran War Uncertainty, Oil Surge, and AI Fears Rattle Markets

SpaceX IPO Filing Expected This Week as Valuation Could Surpass $75 Billion

Stagflation Shadows: U.S. Manufacturing Shines as Services Slump to 11-Month Low

Scarcity and Strife: Bitcoin Hits Historic Milestone Amid Geopolitical Tremors

Strategy's Multi-Billion Dollar Bitcoin Accumulation: A 2026 Milestone

OpenAI's CTO Dodges Questions on Sora AI's Training Data Origins

Related Stories

Editor's Picks

Welcome to EconoTimes