In a revealing Wall Street Journal interview, OpenAI CTO Mira Murati was tight-lipped about the specific data sources used to train Sora, the organization's advanced AI video generator. Amidst growing scrutiny over AI training practices, Murati's reluctance to detail the origins of the data highlights the ongoing debate surrounding copyright and ethical AI development.
OpenAI CTO Evades Detailed Queries on Sora's Training Data Amid Copyright Concerns
That trend has stayed the same with OpenAI's Sora, the company's upcoming text-to-video generative AI that has demonstrated the ability to create lifelike and realistic videos.
In an interview video with the Wall Street Journal, OpenAI's former CEO (she was CEO for two days when Sam Altman was temporarily removed) and current CTO Mira Murati discussed the company's new technology. Murati's interview was intended to discuss the benefits of Sora and hype the upcoming technology. That happens, but Joanna Stern of the WSJ did more than throw softballs; she also asked some difficult questions.
In a three-minute segment, Stern questions Sora's training set. Before the interview, Stern provided OpenAI with some new text descriptions that would be used to create videos for their interview.
"Every time I watch a Sora clip, I wonder what videos this AI model learned from," Joanna says. Did the model see any clips of Ferdinand to know what a bull in a Chinese shop should look like? Was it a fan of Spongebob?"
While she asks these questions, clips from the animated film Ferdinand and the children's television show Spongebob appear side by side with Sora's work, making it difficult not to notice the similarities. The next question was, naturally, what data was used to train Sora?
"We used publicly available data and licensed data," Murati responds.
"So videos on YouTube?" Stern asked. "Videos from Facebook, Instagram? What about Shutterstock? I know you guys have a deal with them."
"I'm actually not sure about that. If they were publicly available, publicly available to use, there might be that data, but I'm not sure. I'm not confident about it," Murati said.
"I'm just not going to go into the details of the data that was used, but it was publicly available or licensed data."
Murati confirmed to Stern after the interview that the licensed data includes Shutterstock content, but her refusal to discuss the topic on camera is telling.
Ethical Quandaries: AI's Content Creation Sparks Copyright Controversy and Artist Concerns
PetaPixel reports that as impressive as generative AI, the debate over how these companies create visual content and the likelihood that it violates artists' copyrights remains constant. There have been reports that the people behind AI image generators specifically target specific artists in their training data under the guise of making it "publicly available." Even when this is not the case, the ease with which photographers can recreate their photos with minimal effort, or the fact that iconic images are just as simple to recreate with minimal effort, tells the story.
These AI systems have likely seen and been trained on those copyrighted images, which explains why they can easily recreate their versions. However, speculation isn't necessary. Midjourney's founder admitted that its AI used a "hundred million" image as a training set without permission. OpenAI admitted that it is "impossible" to train AI without relying on copyrighted content.
That said, Murati is likely aware that discussing using stolen content to train its AI is not something OpenAI wants to admit regularly, so she refuses to respond to Stern's question. It is, however, an easy way to argue that these companies care little about human artists' rights and demonstrate how far they will go to further their interests, regardless of the cost.
Photo: Levart_Photographer/Unsplash


SoftBank to Invest €75 Billion in France AI Data Center Expansion by 2031
Mega IPOs Like SpaceX and OpenAI Could Reshape S&P 500 and Nasdaq 100 Portfolios in 2026
Huawei Chip Breakthrough Sparks Rally in Chinese Semiconductor Stocks
Blue Origin New Glenn Rocket Explodes During Launch Pad Test, Delaying Space Ambitions
PDG Explores $1 Billion Sale of China Data Center Assets
MongoDB Q1 FY2027 Earnings Beat Expectations, Raises Full-Year Outlook
Morgan Stanley Names Top AI Security and Data Center Stocks for 2026
Snowflake Stock Soars 30% After Q1 Earnings Beat and Major AWS AI Partnership
Xiaomi Shares Drop After Weak Q1 Earnings Amid Rising Smartphone Costs
Salesforce Q1 FY2027 Earnings Beat Expectations Despite Soft Q2 Revenue Outlook
Dell Raises 2027 Revenue Forecast as AI Server Demand Drives Record Quarterly Results
SpaceX IPO Hype Raises Questions as Many Major Stock Debuts Underperform Market
HP Q2 2026 Earnings Beat Expectations Despite Memory Chip Pressure
Meta AI Push Could Add $26 Billion in Revenue by 2027, Wolfe Research Says
SpaceX IPO Could Become Largest in History with $1.8 Trillion Valuation Target
Autodesk Beats Q1 Estimates, Acquires MaintainX for $3.6 Billion 



