Used YouTube to Train AI

July 16, 2024in Salesforce

Announced by siliconANGLE’s Duncan Riley. Companies Used YouTube to Train AI.

A new report released today reveals that companies such as Anthropic PBC, Nvidia Corp., Apple Inc., and Salesforce Inc. have used subtitles from YouTube videos to train their AI services without obtaining permission. This raises significant ethical questions about the use of publicly available materials and facts without consent.

According to Proof News, these companies allegedly utilized subtitles from 173,536 YouTube videos sourced from over 48,000 channels to enhance their AI models. Rather than scraping the content themselves, Anthropic, Nvidia, Apple, and Salesforce reportedly used a dataset provided by EleutherAI, a nonprofit AI organization.

EleutherAI, founded in 2020, focuses on the interpretability and alignment of large AI models. The organization aims to democratize access to advanced AI technologies by developing and releasing open-source AI models like GPT-Neo and GPT-J. EleutherAI also advocates for open science norms in natural language processing, promoting transparency and ethical AI development.

The dataset in question, known as “YouTube Subtitles,” includes transcripts from educational and online learning channels, as well as several media outlets and YouTube personalities. Notable YouTubers whose transcripts are included in the dataset are Mr. Beast, Marques Brownlee, PewDiePie, and left-wing political commentator David Pakman.

Some creators whose content was used are outraged. Pakman, for example, argues that using his transcripts jeopardizes his livelihood and that of his staff. David Wiskus, CEO of streaming service Nebula, has even called the use of the data “theft.”

Despite the data being publicly accessible, the controversy revolves around the fact that large language models are utilizing it. This situation echoes recent legal actions regarding the use of publicly available data to train AI models. For instance, Microsoft Corp. and OpenAI were sued in November over their use of nonfiction authors’ works for AI training. The class-action lawsuit, led by a New York Times reporter, claimed that OpenAI scraped the content of hundreds of thousands of nonfiction books to develop their AI models.

Additionally, The New York Times accused OpenAI, Google LLC, and Meta Holdings Inc. in April of skirting legal boundaries in their use of AI training data.

While the legality of using AI training data remains a gray area, it has yet to be extensively tested in court. Should a case arise, the key issue will likely be whether publicly stated facts, including utterances, can be copyrighted.

Relevant U.S. case law includes Feist Publications Inc. v. Rural Telephone Service Co., 499 U.S. 340 (1991) and International News Service v. Associated Press (1918). In both cases, the U.S. Supreme Court ruled that facts cannot be copyrighted.