Reddit, Google, and the Real Cost of the AI Data Rush

28.07.2024 12:00

Thecut.com

Photo-Illustration: Intelligencer; Photo: Getty Images

As of this past week, the only major search engine that still includes Reddit is Google. You won’t find recent Reddit links in Microsoft’s Bing, or get useful results from the platform on privacy-centric search engine DuckDuckGo. For most people, this change won’t matter much in the short term — most people, after all, use Google, which hovers at somewhere around 90 percent of global search market share, and they can still visit Reddit directly — but it’s a weird one. Reddit is a platform with deep connections to the web, beginning its life as a link aggregator and growing into an online community with more than 80 million daily active users (the more the better). Why, according to a report in 404 Media, is it suddenly battening its hatches?

As is often the case when tech companies are behaving strangely or unpredictably these days, the answer has something to do with AI. Earlier this year, Google entered into a deal with Reddit, reportedly valued at $60 million per year, to license the site’s data for training AI models. In recent months, Google watchers have also noticed an increase in Reddit posts showing up in search results, with user comments ranking highly in a wide range of queries. While these stories are related, they’re also somewhat independent: Google, like others in AI, wants to license data so it can build better models and avoid getting sued; Google users were already adding “Reddit” to search queries as a sort of search quality hack, so the company was in some sense following their lead.

It’s where the two stories overlap that things are starting to break down. Search engines gather relevant and up-to-date information by crawling the web with bots, indexing what they find, and ranking it for relevance to users’ searches. Websites have some control over whether and how this crawling takes place, and there are various reasons they might refuse crawling of part of all of their sites (a private individual might want to keep an old blog online but unsearchable, while a company like Facebook might want Google users to be able to find profiles but not search their contents). For decades, though, crawling was part of a straightforward and mutually beneficial arrangement. Search engines gathered lots of users by offering a useful service; websites allowed and even catered to search engine crawling in order to connect with those audiences.

In the last few years, though, crawling has assumed an additional purpose. Those robots that are indexing your site and reading all your data aren’t just building a search index. They might be building an AI model, too. This, for many websites, is very much not part of the deal. As David Pierce writes at The Verge, the sudden pivot from search index to AI training means that “the basic social contract of the web is falling apart.” A mutually beneficial arrangement is being replaced by an extractive one, driven by frantic and unilateral actions of startups and tech giants alike.

At first, the consequences of this breakdown were relatively contained, with large websites and platforms — owned by big companies like Facebook and Amazon — explicitly blocking crawlers from firms like OpenAI. This clarity didn’t last long. Google is all-in on AI. Bing is owned by Microsoft, which is OpenAI’s largest investor and partner. Suddenly, all the search companies were also AI companies, and there were new crawlers in the mix. All crawling was AI crawling — to assume otherwise would be naive. The sudden harvest was apparent to anyone paying attention to their traffic stats. The bots were scraping everything they could.

In response to 404 Media’s reporting, critics have made the case that Google — a company that otherwise seemed spooked by past and potential antitrust enforcement — is nonetheless buying an unfair advantage for a product that’s already nearly a monopoly, and they have a point: Without Reddit, one of the largest repositories of authentic human text on the internet, smaller search engines can’t compete.

But the story isn’t complete without Reddit, which is the company actually doing the blocking. (Microsoft, for its part, has confirmed its crawlers have been prohibited.) As a Reddit spokesperson told The Verge:

This is not at all related to our recent partnership with Google. We have been in discussions with multiple search engines. We have been unable to reach agreements with all of them, since some are unable or unwilling to make enforceable promises regarding their use of Reddit content, including their use for AI.

This is practical behavior by the leadership of Reddit, a public company with a duty to its shareholders, but also plainly bad for the general public: in addition to reducing access for users who don’t want to use Google, it further subordinates Reddit to Google’s specific search incentives, which are changing fast anyway, in part because of AI; already, spammers are polluting popular threads on Reddit, which is suddenly getting enormous amounts of traffic from Google, in hopes of getting more visibility in search results.

Google, like Reddit, owes its existence and success to the principles and practices of the open web, but exclusive arrangements like these mark the end of that long and incredibly fruitful era. They’re also a sign of things to come. The web was already in rough shape, reduced over the last 15 years by the rise of walled-off platforms, battered by advertising consolidation, and polluted by a glut of content from the AI products that used it for training. The rise of AI scraping threatens to finish the job, collapsing a flawed but enormously successful, decades-long experiment in open networking and human communication to a set of antagonistic contracts between warring tech firms.

More screen time

Новости от наших партнёров в Вашем городе

Ria.city

123ru.net

Социальный фонд оказывает помощь эвакуированным жителям Курской области

АТОР: поток организованных интуристов вырос в РФ в 2,5–3 раза

Развлечения для Катюши: к 1 сентября студенты колледжа подарили панде из Московского зоопарка развивающую игру

Глава Можайска поздравил с Днем знаний учеников школы в поселке Уваровка

Музыкальные новости

Bigpot.news

Век живи – век учи: зарплаты алтайских педагогов оказались одними из самых низких в России

Сотрудники Росгвардии организовали экскурсию для детей в музей современной истории

Токсиколог Кутушов назвал неожиданную пользу лисичек в борьбе с паразитами

Собянин: силы ПВО сбили в Подольске летевший на Москву беспилотник

Новости России

29ru.net

Педиатр Константинова: пример семьи имеет большое значение в борьбе с ожирением у детей

Новый медицинский центр открыли в Одинцовском округе

Педагоги колледжа «Подмосковье» провели первые в этом году родительские собрания

В Подмосковье установлено более 265 электрозарядных станций

Экология в России и мире

Life24.pro

Деловые мероприятия на выставке «Интерткань-2024. Осень» 10-12 сентября

Терпеть нельзя: Доктор Кутушов рассказал, почему нужно помочиться сразу же

От мечты к реальности: как KPI помогут достичь целей компании

Прекрасной игры участникам теннисного турнира памяти Ю. М. Лужкова пожелала Елена Батурина

Спорт в России и мире

News.tennis

Новак Джокович опустится минимум на четвёртую строчку рейтинга ATP

Тиафу об Арене Артура Эша: «Я всегда мечтал играть на этом корте. Тут я лучшая версия себя»

Теннисист Рублев проиграл Димитрову в матче четвертого круга US Open

Джокович проиграл в третьем круге US Open и впервые с 2017 года завершит год без титула на турнире Большого шлема

Moscow.media

News24.pro

В Сергиевом Посаде объявлен конкурс на строительство очистных сооружений

Депутат Госдумы Толмачев оценил состояние обновленного колледжа в Щелкове

Еще 11 подмосковных фермеров получили гранты по конкурсу «Агростартап»

Воробьев: образовательный комплекс «Полет» досрочно открыли в Красногорске

Читайте на 123ru.net

Путешествия

Происшествия

VIP-тусовка

Ru24.net

Частные объявления в Вашем городе, в Вашем регионе и в России