Meta Secretly Trained Its AI on a Notorious Piracy Database, Newly Unredacted Court Docs Reveal

“Meta treats the so-called 'public availability' of shadow datasets as a get-out-of-jail-free card, despite internal Meta records showing every relevant decision-maker at Meta, until to and including its CEO, Mark Zuckerberg, knew that LibGen was 'a dataset that we knew was pirated,'” the plaintiffs allege in this motion. (Originally filed in late 2024, the motion was a request to file a third amended complaint.)

In addition to the plaintiffs' briefs, another filing was not redacted in response to Chhabria's order—Meta's opposition on the motion to file an amended complaint. It argues that the authors' attempts to add additional claims to the case are an “eleventh-hour gamble based on a false and inflammatory premise” and denies that Meta waited to disclose important discovery information. Instead, Meta contends that it first disclosed to the plaintiffs that it used a LibGen dataset in July 2024. (Because most discovery materials remain confidential, it's difficult for WIRED to confirm that claim.)

Meta's argument hinges on its claim that the plaintiffs already knew about LibGen's use and should not be given additional time to file a third amended claim when they had ample time to do so before the end of discovery. in December 2024. “Plaintiffs were aware of Meta's download and use of LibGen and other alleged 'shadow libraries' since mid-July 2024,” the tech giant's lawyers argue.

In November 2023, Chhabria granted Meta's motion to dismiss some of the lawsuit's claims, including its claim that Meta's alleged use of the authors' work to train AI violated Digital Millennium Copyright Acta US law introduced in 1998 to prevent people from selling or duplicating copyrighted works on the internet. At that time, the judge agreed in Meta's assertion that the plaintiffs did not provide sufficient evidence to prove that the company removed so-called “copyright management information,” such as the author's name and the title of the work.

The unredacted documents argue that the plaintiffs should be allowed to amend their complaint, saying the information Meta disclosed is evidence that the DMCA claim is warranted. They also say the discovery process unearthed reasons to add new allegations. “Meta, through a corporate representative who testified on November 20, 2024, has now admitted under oath to uploading (aka 'seeding') pirated files containing Plaintiffs' works to 'torrent' sites,” the motion says. (Seeding is when torrented files are shared with other peers after they've been downloaded.)

“This streaming activity makes Meta itself a distributor of the same pirated copyrighted material that it also downloads for use in its commercially available AI models,” one of those said. new unredacted documents, alleging that Meta, in other words, not only used copyrighted material without permission but also disseminated it.

LibGen, an archive of books uploaded to the internet that originated in Russia in 2008, is one of the largest and most controversial “shadow libraries” in the world. In 2015, a judge in New York ordered a preliminary injunction against the site, a measure theoretically designed to temporarily shut down the archive, but its anonymous administrators simply changed its domain. In September 2024, a different judge in New York ordered LibGen to pay $30 million to rights holders for infringing their copyrights, despite not knowing who actually runs the piracy hub.

Meta detection problems for this case are still not over. In the same vein, Chhabria warned the tech giant against any sweeping redaction requests in the future: “If Meta resubmits an unreasonably broad sealing request, all materials will unsealed,” he wrote.

Leave a Reply Cancel reply

Related News

Grab the Yamaha TW-E5B wireless earbuds while they are only $ 35

Answers to NYT Mini Crossword now for March 31