Announced by siliconANGLE’s Duncan Riley. Companies Used YouTube to Train AI.

A new report released today reveals that companies such as Anthropic PBC, Nvidia Corp., Apple Inc., and Salesforce Inc. have used subtitles from YouTube videos to train their AI services without obtaining permission. This raises significant ethical questions about the use of publicly available materials and facts without consent.

According to Proof News, these companies allegedly utilized subtitles from 173,536 YouTube videos sourced from over 48,000 channels to enhance their AI models. Rather than scraping the content themselves, Anthropic, Nvidia, Apple, and Salesforce reportedly used a dataset provided by EleutherAI, a nonprofit AI organization.

EleutherAI, founded in 2020, focuses on the interpretability and alignment of large AI models. The organization aims to democratize access to advanced AI technologies by developing and releasing open-source AI models like GPT-Neo and GPT-J. EleutherAI also advocates for open science norms in natural language processing, promoting transparency and ethical AI development.

The dataset in question, known as “YouTube Subtitles,” includes transcripts from educational and online learning channels, as well as several media outlets and YouTube personalities. Notable YouTubers whose transcripts are included in the dataset are Mr. Beast, Marques Brownlee, PewDiePie, and left-wing political commentator David Pakman.

Some creators whose content was used are outraged. Pakman, for example, argues that using his transcripts jeopardizes his livelihood and that of his staff. David Wiskus, CEO of streaming service Nebula, has even called the use of the data “theft.”

Despite the data being publicly accessible, the controversy revolves around the fact that large language models are utilizing it. This situation echoes recent legal actions regarding the use of publicly available data to train AI models. For instance, Microsoft Corp. and OpenAI were sued in November over their use of nonfiction authors’ works for AI training. The class-action lawsuit, led by a New York Times reporter, claimed that OpenAI scraped the content of hundreds of thousands of nonfiction books to develop their AI models.

Additionally, The New York Times accused OpenAI, Google LLC, and Meta Holdings Inc. in April of skirting legal boundaries in their use of AI training data.

While the legality of using AI training data remains a gray area, it has yet to be extensively tested in court. Should a case arise, the key issue will likely be whether publicly stated facts, including utterances, can be copyrighted.

Relevant U.S. case law includes Feist Publications Inc. v. Rural Telephone Service Co., 499 U.S. 340 (1991) and International News Service v. Associated Press (1918). In both cases, the U.S. Supreme Court ruled that facts cannot be copyrighted.

Related Posts
Who is Salesforce?
Salesforce

Who is Salesforce? Here is their story in their own words. From our inception, we've proudly embraced the identity of Read more

Salesforce Marketing Cloud Transactional Emails
Salesforce Marketing Cloud

Salesforce Marketing Cloud Transactional Emails are immediate, automated, non-promotional messages crucial to business operations and customer satisfaction, such as order Read more

Salesforce Unites Einstein Analytics with Financial CRM
Financial Services Sector

Salesforce has unveiled a comprehensive analytics solution tailored for wealth managers, home office professionals, and retail bankers, merging its Financial Read more

AI-Driven Propensity Scores
AI-driven propensity scores

AI plays a crucial role in propensity score estimation as it can discern underlying patterns between treatments and confounding variables Read more