Не пейте эти помои: названы популярные марки кофе, которые не стоит брать даже по акции

Сотрудники фитнес-клуба в Москве случайно залили в канистру с хлором ядовитый газ

Велофестиваль объединил долголетов Каширы и Ступина

«Динамо» — «Локомотив»: стартовые составы команд на матч 2-го тура РПЛ

A new tool for copyright holders can show if their work is in AI training data

25.07.2024 20:09

Technology Review

Since the beginning of the generative AI boom, content creators have argued that their work has been scraped into AI models without their consent. But until now, it has been difficult to know whether specific text has actually been used in a training data set.

Now they have a new way to prove it: “copyright traps” developed by a team at Imperial College London, pieces of hidden text that allow writers and publishers to subtly mark their work in order to later detect whether it has been used in AI models or not. The idea is similar to traps that have been used by copyright holders throughout history—strategies like including fake locations on a map or fake words in a dictionary.

These AI copyright traps tap into one of the biggest fights in AI. A number of publishers and writers are in the middle of litigation against tech companies, claiming their intellectual property has been scraped into AI training data sets without their permission. The New York Times’ ongoing case against OpenAI is probably the most high-profile of these.

The code to generate and detect traps is currently available on GitHub, but the team also intends to build a tool that allows people to generate and insert copyright traps themselves.

“There is a complete lack of transparency in terms of which content is used to train models, and we think this is preventing finding the right balance [between AI companies and content creators],” says Yves-Alexandre de Montjoye, an associate professor of applied mathematics and computer science at Imperial College London, who led the research. It was presented at the International Conference on Machine Learning, a top AI conference being held in Vienna this week.

To create the traps, the team used a word generator to create thousands of synthetic sentences. These sentences are long and full of gibberish, and could look something like this: ”When in comes times of turmoil … whats on sale and more important when, is best, this list tells your who is opening on Thrs. at night with their regular sale times and other opening time from your neighbors. You still.”

The team generated 100 trap sentences and then randomly chose one to inject into a text many times, de Montjoy explains. The trap could be injected into text in multiple ways—for example, as white text on a white background, or embedded in the article’s source code. This sentence had to be repeated in the text 100 to 1,000 times.

To detect the traps, they fed a large language model the 100 synthetic sentences they had generated, and looked at whether it flagged them as new or not. If the model had seen a trap sentence in its training data, it would indicate a lower “surprise” (also known as “perplexity”) score. But if the model was “surprised” about sentences, it meant that it was encountering them for the first time, and therefore they weren’t traps.

In the past, researchers have suggested exploiting the fact that language models memorize their training data to determine whether something has appeared in that data. The technique, called a “membership inference attack,” works effectively in large state-of-the art models, which tend to memorize a lot of their data during training.

In contrast, smaller models, which are gaining popularity and can be run on mobile devices, memorize less and are thus less susceptible to membership inference attacks, which makes it harder to determine whether or not they were trained on a particular copyrighted document, says Gautam Kamath, an assistant computer science professor at the University of Waterloo, who was not part of the research.

Copyright traps are a way to do membership inference attacks even on smaller models. The team injected their traps into the training data set of CroissantLLM, a new bilingual French-English language model that was trained from scratch by a team of industry and academic researchers that the Imperial College London team partnered with. CroissantLLM has 1.3 billion parameters, a fraction as many as state-of-the-art models (GPT-4 reportedly has 1.76 trillion, for example).

The research shows it is indeed possible to introduce such traps into text data so as to significantly increase the efficacy of membership inference attacks, even for smaller models, says Kamath. But there’s still a lot to be done, he adds.

Repeating a 75-word phrase 1,000 times in a document is a big change to the original text, which could allow people training AI models to detect the trap and skip content containing it, or just delete it and train on the rest of the text, Kamath says. It also makes the original text hard to read.

This makes copyright traps impractical right now, says Sameer Singh, a professor of computer science at the University of California, Irvine, and a cofounder of the startup Spiffy AI. He was not part of the research. “A lot of companies do deduplication, [meaning] they clean up the data, and a bunch of this kind of stuff will probably get thrown out,” Singh says.

One way to improve copyright traps, says Kamath, would be to find other ways to mark copyrighted content so that membership inference attacks work better on them, or to improve membership inference attacks themselves.

De Montjoye acknowledges that the traps are not foolproof. A motivated attacker who knows about a trap can remove them, he says.

“Whether they can remove all of them or not is an open question, and that’s likely to be a bit of a cat-and-mouse game,” he says. But even then, the more traps are applied, the harder it becomes to remove all of them without significant engineering resources.

“It’s important to keep in mind that copyright traps may only be a stopgap solution, or merely an inconvenience to model trainers,” says Kamath. “One can not release a piece of content containing a trap and have any assurance that it will be an effective trap forever.”

Международный фестиваль искусств проходит в Москве

Подмосковная спортсменка завоевала «золото» на международных соревнованиях по теннису

Сотрудники фитнес-клуба в Москве случайно залили в канистру с хлором ядовитый газ

Российские турагентства зазывают к себе туристов из страны с самым большим населением в мире

Владимир Путин включил Айсена Николаева в новый состав президиума Госсовета РФ

После появления «New Москва» на этом райском острове Богов, Россия открывает там Генеральное Консульство: русских так много, что надо их пересчитать

В Москве состоялся фестиваль «ДэнсхелпФест»

Композитор Алексей Чернаков: «Связать свою жизнь с музыкой я решил в купе поезда Саратов — Москва»

Штаб-квартиру Роскосмоса предложили разместить в Амурской области

Были б деньги, жил бы в Сочи: курортная столица России собирает «сливки общества»

Призер Олимпиады-2012 гимнастка Комова об открытии Игр в Париже: «Это провал, со дна постучали. Блевать охота»

Минниханов поздравил участника из Бахрейна с победой в конкурсе чтецов Корана

Россиян предупредили, что боль в глазах и в пояснице может быть симптомом лихорадки Западного Нила, которую разносят комары

Вкусные оттенки зеленого: модный цвет сезона

22 июля Всемирный день мозга. Отвечаем на популярные вопросы о когнитивных расстройствах

Saint Laurent pre-fall 2024

Тарпищев: Медведев не будет жить в Олимпийской деревне

Теннисисты Медведев и Джокович отказались жить в Олимпийской деревне

Лекарство против будней: почему предстоящая Олимпиада в Париже будет уникальна для России

Олимпийка с титулом // Теннисистка Мирра Андреева перед стартом Игр в Париже впервые выиграла турнир WTA

Филиал № 4 ОСФР по Москве и Московской области информирует: С 1 августа Соцфонд увеличит страховые пенсии россиян

Портативный ТСД корпоративного класса Saotron RT-T70

В Севастополе подведены итоги работы военных следственных органов Следственного комитета России по Черноморскому флоту в первом полугодии текущего года

Филиал № 4 ОСФР по Москве и Московской области информирует: За полгода 14,9 тысячи жителей Московского региона оформили страховую пенсию в автоматическом режиме на портале госуслуг

Не пейте эти помои: названы популярные марки кофе, которые не стоит брать даже по акции

Сотрудники фитнес-клуба в Москве случайно залили в канистру с хлором ядовитый газ

Велофестиваль объединил долголетов Каширы и Ступина

«Динамо» — «Локомотив»: стартовые составы команд на матч 2-го тура РПЛ

A new tool for copyright holders can show if their work is in AI training data

Читайте на 123ru.net

Фоторепортажи

Работа

Вопросы - ответы

Частные объявления в Вашем городе, в Вашем регионе и в России

Новости от наших партнёров в Вашем городе

Международный фестиваль искусств проходит в Москве

Подмосковная спортсменка завоевала «золото» на международных соревнованиях по теннису

Сотрудники фитнес-клуба в Москве случайно залили в канистру с хлором ядовитый газ

Российские турагентства зазывают к себе туристов из страны с самым большим населением в мире

Владимир Путин включил Айсена Николаева в новый состав президиума Госсовета РФ

После появления «New Москва» на этом райском острове Богов, Россия открывает там Генеральное Консульство: русских так много, что надо их пересчитать

В Москве состоялся фестиваль «ДэнсхелпФест»

Композитор Алексей Чернаков: «Связать свою жизнь с музыкой я решил в купе поезда Саратов — Москва»

Штаб-квартиру Роскосмоса предложили разместить в Амурской области

Были б деньги, жил бы в Сочи: курортная столица России собирает «сливки общества»

Призер Олимпиады-2012 гимнастка Комова об открытии Игр в Париже: «Это провал, со дна постучали. Блевать охота»

Минниханов поздравил участника из Бахрейна с победой в конкурсе чтецов Корана

Россиян предупредили, что боль в глазах и в пояснице может быть симптомом лихорадки Западного Нила, которую разносят комары

Вкусные оттенки зеленого: модный цвет сезона

22 июля Всемирный день мозга. Отвечаем на популярные вопросы о когнитивных расстройствах

Saint Laurent pre-fall 2024

Тарпищев: Медведев не будет жить в Олимпийской деревне

Теннисисты Медведев и Джокович отказались жить в Олимпийской деревне

Лекарство против будней: почему предстоящая Олимпиада в Париже будет уникальна для России

Олимпийка с титулом // Теннисистка Мирра Андреева перед стартом Игр в Париже впервые выиграла турнир WTA

Филиал № 4 ОСФР по Москве и Московской области информирует: С 1 августа Соцфонд увеличит страховые пенсии россиян

Портативный ТСД корпоративного класса Saotron RT-T70

В Севастополе подведены итоги работы военных следственных органов Следственного комитета России по Черноморскому флоту в первом полугодии текущего года

Топ новостей на этот час

В Подмосковье появился на свет ребенок с пулей в животе

Не пейте эти помои: названы популярные марки кофе, которые не стоит брать даже по акции

Глава DNS Алексеев предупредил владелицу Wildberries о риске потери бизнеса

Лужнецкую набережную благоустроят