Is Big Tech wrong to train AI models on 'messy' public data? A chat with synthetic data evangelist Ali Golshan.

30.06.2024 13:38

BusinessInsider.com

Ali Golshan, cofounder and CEO of synthetic data platform Gretel, says using synthetic data to train AI models is better for both AI and humans.

Ali Golshan, CEO of Gretel.ai, weighs in on the value of synthetic data.

Gretel.

Big Tech companies like OpenAI, Meta, and Google are in an epic race for data to train AI.
Ali Golshan, CEO of Gretel, believes synthetic data is a better alternative to public data.
He says synthetic data supports privacy, reduces biases, and enhances AI model accuracy.

The global AI arms race has unleashed a war for data.

Companies at the forefront of the technology, like OpenAI, Meta, and Google, are scouring the internet and troves of books, podcasts, and videos searching for data to train their models.

Some industry leaders, however, worry this kind of "land grab" for publicly available data isn't the right approach, especially since it puts companies at risk of copyright lawsuits. Instead, they're calling for companies to train their models on synthetic data.

Synthetic data is artificially generated rather than collected from the real world. It can be generated by machine learning algorithms with little more than a seed of original data.

Business Insider chatted with Ali Golshan, CEO and cofounder of Gretel, who one might call an evangelist for synthetic data. Gretel allows companies to experiment and build with synthetic data. It is working with major players in the healthcare space, such as genomics company Illumina, consulting firms like Ernst & Young, and consumer companies like Riot Games.

Golshan says synthetic data is a safer and more private alternative to "messy" public data, and that it can shepherd most companies into the next era of generative AI development.

The following conversation has been edited for clarity.

Why is synthetic data better than raw public data?

Raw data is just that: raw. It's often filled with holes, inconsistencies, and biases from the processes used to capture, label, and leverage it. Synthetic data, on the other hand, allows us to fill those gaps, expand into areas that can't be captured in the wild, and intentionally design the data needed for specific applications.

This level of control, with humans in the loop designing and refining the data, is crucial for pushing GenAI to new heights in a responsible, transparent, and secure manner. Synthetic data enables us to create datasets that are more comprehensive, balanced, and tailored to specific AI training needs, which leads to more accurate and reliable models.

Great, are there any cons to synthetic data?

Where synthetic data isn't very good is at the end of the day, if you have no data or clarity, you can't just have it create perfect data for you just, so you can experiment endlessly. So there is that scope that needs to be created.

Ultimately, the other part of it is that synthetic data is very good at privacy if you have enough data. So, if you have only a few hundred records and want ultimate privacy, that comes at a huge cost to utility and accuracy because the data is very limited. So, when it comes to absolutely zero data and wanting a domain-specific task or having very limited data and wanting great privacy and accuracy, those are just incompatible with the approaches.

What are the challenges of using public data?

Public data presents several challenges, especially for specialized use cases in healthcare. Imagine trying to train an AI model for predicting COVID-19 outcomes using only publicly available case count data — you'd be missing crucial specifics like patient comorbidities, treatment protocols, and detailed clinical progression. This lack of comprehensive data severely limits the model's effectiveness and reliability.

Adding to this challenge is the growing regulatory pressure against data collection practices. The Federal Trade Commission and other regulatory bodies are increasingly pushing back against web scraping and unauthorized data access — and rightly so. As AI becomes more powerful, the risk of re-identifying individuals from supposedly anonymized data is higher than ever.

There's also the critical issue of data freshness across all industries. In today's fast-paced business environment, organizations need real-time data to remain competitive and train models that respond rapidly to changing market conditions, consumer behaviors, and emerging trends. Public domain data often lags by weeks, months, or even years, making it less valuable for cutting-edge AI applications that require up-to-the-minute insights.

What do you think about companies like Meta and OpenAI that are willing to risk copyright lawsuits to get access to public data?

The era of 'move fast and break things' is over, especially in the age of GenAI, where there's too much at stake to operate in such a flippant manner. We're advocating for an approach that leads with privacy. By prioritizing privacy from the start and embedding it into the customers' AI products and services — by design — you get faster, more sustainable, and defensible AI development. That's what our partners and, ultimately, their customers want. In this sense, privacy is a catalyst for GenAI innovation.

This privacy-first approach is why partners like Google, AWS, EY, and Databricks work with us. They know that current methods are unsustainable and the future of AI will be driven by consensual, licensed data and thoughtful data-driven design, not by grasping at every bit of public data available. It's about creating a foundation of trust with your users and stakeholders, which is crucial for long-term success in AI development.

Companies are scrambling to build models that unlock insights from proprietary data. Where does synthetic data fit into that equation?

By some estimates, companies use only 1-10% of the data they collect. The rest is stored and siloed so that few can even access or experiment with it. This creates additional costs and data breach risks with no return value. Now, imagine if a company could safely open access to that remaining 90% of data. Cross-functional teams could collaborate and experiment with it to extract value without creating additional privacy or security risks. That level of knowledge sharing would be a huge boon for innovation.

It's like we're moving from the parable of the blind men trying to describe an elephant to each other. Each only has a grasp and understanding of the part they can touch; the rest is a black box. Providing an entire organization with shared access to the 'crown jewels' and the opportunity to surface new insights from that data would be a paradigm shift in how companies and products are built. This is what people mean when they speak of 'democratizing' data.

There are already ways of training smaller models with a fraction of the data we may have once used that yield great results. Where are we headed regarding the amount of data we need for training generative AI?

The idea of throwing the kitchen sink, in terms of data, to train a large language model is part of the problem and reflects the old 'move fast and break things' mentality. It's a land grab by companies with the means to do that, while AI regulations are still being hashed out.

Now that the dust is settling, people are realizing that the future lies in smaller, more specialized models targeted to very specific tasks and orchestrating the actions of these models through an agentic, systematic approach. This specialized model approach provides more transparency and removes much of the 'black box' nature of AI models since you're designing the models from the ground up, piece by piece.

It's also where regulation is heading. After all, how else will companies adhere to 'risk-based' regulations if we can't even quantify AI risks for each task we apply them to?

This shift toward more focused, efficient models aligns perfectly with differential privacy and synthetic data. We can generate precisely the data needed for these narrow AI models, ensuring high performance without the ethical and practical issues of massive data collection. It's about smart, targeted development rather than the brute-force approach companies have taken.

Read the original article on Business Insider

Новости от наших партнёров в Вашем городе

Ria.city

123ru.net

Всем спорт! Гости «Лужников» могли сразиться с Карякиным и Роем Джонсоном

Туляк стал серебряным призером на соревнованиях по воздухоплавательному спорту

Состоялся VI этап 49-й отчетно-выборной Конференции МК КПРФ

Президент ФЛГР Вяльбе: надо побольше думать о патриотизме

Музыкальные новости

Bigpot.news

В ОМ Девелопмент рассказали, как рынок отреагирует на отмену льготной ипотеки

"360.ru": рейс Москва — Красноярск с полуночи не может вылететь из Шереметьево

Победитель Олимпиады-80 Николай Корольков умер в возрасте 77 лет

Ещё один регион прислушался к Бастрыкину: В Перми ввели пакет ограничений для гастарбайтеров

Новости России

29ru.net

Туляк стал серебряным призером на соревнованиях по воздухоплавательному спорту

Экс-замглавы Минобороны Шевцова назвала слухами свой отъезд во Францию

«ТурПром»: туристка из Российской Федерации поняла, что в Катар она больше не хочет

Состоялся VI этап 49-й отчетно-выборной Конференции МК КПРФ

Экология в России и мире

Life24.pro

«Зрителей будет ждать неслабый аттракцион»: стартовали съемки продолжения сериала «Бедные смеются, богатые плачут»

«Она такая пацанка! Мне нравятся сильные женщины, я подкаблучник», — заявление Киркорова о новых отношениях вызвало волну слухов и предположений

Переподготовка "ПГС: проектирование зданий (расчётно-конструктивный раздел)”

Начальник сервисного локомотивного депо «Иваново» филиала «Северный» ООО «ЛокоТех-Сервис» Сергей Черемохин принял участие во вручении дипломов студентам Ивановского железнодорожного колледжа

Спорт в России и мире

News.tennis

Помощник Медведева заявил, что «дух Навального жив» и продолжает публиковать фейки о связи с «Мираторгом»

Уимблдон. 6 июля. Сафиуллин начнет в 13 мск, Медведев доиграет вторым запуском, микст Маррея и Радукану пройдет на Центральном корте

Звезда «Гонки» Даниэль Брюль снимет байопик о немецком теннисисте Готфриде фон Крамме

Медведев едва не проиграл на неудобном корте Уимблдона. Россиянин с трудом вышел в третий круг

Moscow.media

News24.pro

Сотни пассажиров не смогли улететь из Шереметьева из-за урагана

Три столичные улицы оборудовали для игры в падел-тенис

США в мае купили у России обогащенного урана на 209,5 млн долларов

В России впервые пройдет оценка компетенций школьных учителей

Is Big Tech wrong to train AI models on 'messy' public data? A chat with synthetic data evangelist Ali Golshan.

Читайте на 123ru.net

Происшествия

Деньги

Модные новости

Sport 24/7

Частные объявления в Вашем городе, в Вашем регионе и в России

Новости от наших партнёров в Вашем городе

Всем спорт! Гости «Лужников» могли сразиться с Карякиным и Роем Джонсоном

Туляк стал серебряным призером на соревнованиях по воздухоплавательному спорту

Состоялся VI этап 49-й отчетно-выборной Конференции МК КПРФ

Президент ФЛГР Вяльбе: надо побольше думать о патриотизме

В ОМ Девелопмент рассказали, как рынок отреагирует на отмену льготной ипотеки

"360.ru": рейс Москва — Красноярск с полуночи не может вылететь из Шереметьево

Победитель Олимпиады-80 Николай Корольков умер в возрасте 77 лет

Ещё один регион прислушался к Бастрыкину: В Перми ввели пакет ограничений для гастарбайтеров

Туляк стал серебряным призером на соревнованиях по воздухоплавательному спорту

Экс-замглавы Минобороны Шевцова назвала слухами свой отъезд во Францию

«ТурПром»: туристка из Российской Федерации поняла, что в Катар она больше не хочет

Состоялся VI этап 49-й отчетно-выборной Конференции МК КПРФ

«Зрителей будет ждать неслабый аттракцион»: стартовали съемки продолжения сериала «Бедные смеются, богатые плачут»

«Она такая пацанка! Мне нравятся сильные женщины, я подкаблучник», — заявление Киркорова о новых отношениях вызвало волну слухов и предположений

Переподготовка "ПГС: проектирование зданий (расчётно-конструктивный раздел)”

Помощник Медведева заявил, что «дух Навального жив» и продолжает публиковать фейки о связи с «Мираторгом»

Уимблдон. 6 июля. Сафиуллин начнет в 13 мск, Медведев доиграет вторым запуском, микст Маррея и Радукану пройдет на Центральном корте

Звезда «Гонки» Даниэль Брюль снимет байопик о немецком теннисисте Готфриде фон Крамме

Медведев едва не проиграл на неудобном корте Уимблдона. Россиянин с трудом вышел в третий круг

Беспроводной сканер штрих-кодов Heroje S-H29W

КАК АНТАГОНИСТ МОЖЕТ ИЗМЕНИТЬ МИР ИЛИ ИСТОРИЯ ОДНОГО МОСКОВСКОГО РЭПЕРА.

Ограничивается движение большегрузов по трассе М-5 Урал в Самарской области

Родные просторы... Старая Устьевская ГЭС на реке Вруда

Топ новостей на этот час

Я ВАС СПРАШИВАЮ, КТО ВЕРБОВАЛ ВАЗ, ЗЕЛЕНСКОГО? Советы Высшего Всенародного Президента.

«ТурПром»: туристка из Российской Федерации поняла, что в Катар она больше не хочет

Жить в небоскребах чаще всего хотят молодые люди в возрасте от 18 до 35 лет

Всем спорт! Гости «Лужников» могли сразиться с Карякиным и Роем Джонсоном