ADL Ranks Grok as the Worst AI Chatbot at Detecting Antisemitism, Rates Claude as the Best

28.01.2026 22:20

The Algemeiner

A 3D-printed miniature model of Elon Musk and the X logo are seen in this illustration taken Jan. 23, 2025. Photo: REUTERS/Dado Ruvic/Illustration

The Anti-Defamation League (ADL) on Wednesday released its AI Index, which ranks popular large language model (LLM) chatbot programs according to their effectiveness at detecting antisemitism, anti-Zionism, and other forms of extremism.

The watchdog group found a wide variability in performance among the six models it analyzed. Researchers applied a variety of tests to xAI’s Grok, Meta’s Llama, Alphabet’s Gemini, Chinese hedge fund High Flyer’s DeepSeek, OpenAI’s ChatGPT, and the clear winner of them all on recognizing hate, Anthropic’s Claude.

The ADL created an “overall performance model” which combined the results of multiple forms of testing. The group awarded Claude the highest score with 80 points, while Grok sat at the bottom with 21. ChatGPT came in second with 57, followed by DeepSeek (50), Gemini (49) and Llama at 31.

Researchers tested the apps between August and October of last year, striving to explore as an “average user” would utilize the programs, as opposed to a bad actor actively seeking to create harmful content. They performed more than 25,000 chats across 37 sub-categories and assessed the results with both human and AI evaluations.

The report also distinguished between anti-Jewish, traditional antisemitism directed at individual Jews, and anti-Zionist antisemitism directed at the Jewish state. A third category of analysis focused on more general “extremism” and considered questions about conspiracy theories and other narratives which run across the political spectrum.

Among its key findings, the ADL discovered that each app had problems.

“All six LLMs showed gaps in their ability to detect bias against Jews, Zionists/Zionism, and to identify extremism, often failing to detect and refute harmful or false theories and narratives,” the report said. “All models could benefit from improvement when responding to the type of harmful content tested.”

Researchers also found that “some models actively generate harmful content in response to relatively straightforward prompts, such as YouTube script personas saying ‘Jewish-controlled central banks are the puppet masters behind every major economic collapse.'”

The AI Index “reveals a troubling reality: every major AI model we tested demonstrates at least some gaps in addressing bias against Jews and Zionists and all struggle with extremist content,” ADL chief executive officer Jonathan Greenblatt said in a statement. “When these systems fail to challenge or reproduce harmful narratives, they don’t just reflect bias — they can amplify and may even help accelerate their spread. We hope that this index can serve as a roadmap for AI companies to improve their detection capabilities.”

Oren Segal, the ADL’s senior vice president of counter-extremism and intelligence, explained that the new research “fills a critical gap in AI safety research by applying domain expertise and standardized testing to antisemitic, anti-Zionist, and extremist content.” He warned that “no AI system we tested was fully equipped to handle the full scope of antisemitic and extremist narratives users may encounter. This Index provides concrete, measurable benchmarks that companies, buyers, and policymakers can use to drive meaningful improvement.”

Grok — the chatbot ranked lowest on the ADL’s list and directed by its billionaire owner Elon Musk to offer “anti-woke” and “politically incorrect” responses — has faced considerable criticism for last year’s expressions of antisemitism which included answers self-declaring the program as “MechaHitler.”

More recently, Musk and Grok have come under fire from government officials around the world objecting to a recent upgrade which enabled users to create “deepfake” sexualized images which stripped people featured in uploaded images.

The European Union opened an investigation this week with a goal of determining “whether the company properly assessed and mitigated risks associated with the deployment of Grok’s functionalities into X in the EU. This includes risks related to the dissemination of illegal content in the EU, such as manipulated sexually explicit images, including content that may amount to child sexual abuse material.”

Henna Virkkunen, the EU’s executive vice president for tech sovereignty, security, and democracy, decried the fact that Grok can be used for sexual exploitation.

“Sexual deepfakes of women and children are a violent, unacceptable form of degradation,” Virkkunen said. “With this investigation, we will determine whether X has met its legal obligations under the DSA [Digital Services Act], or whether it treated rights of European citizens – including those of women and children – as collateral damage of its service.”

On Monday, a bipartisan group of 35 attorneys general sent a letter to xAI demanding the disabling of the image undressing feature.

Pennsylvania Attorney General Dave Sunday led the effort.

“The time to ensure people are protected from powerful tools like generative AI isn’t after harm has been caused. You shouldn’t wait for a car crash to put up guardrails,” Sunday said. “This behavior by users was all too predictable and should have been addressed before its release. Tech companies have a responsibility to ensure their tools cannot be used in these destructive ways before they launch their product.”

France also opened an investigation into Grok in November 2025, following outputs promoting Holocaust denial in the French language, a criminal violation of the country’s strict laws against promoting lies about the Nazis’ mass murder of 6 million Jews.

Steven Stalinsky, executive director of the Middle East Media Research Institute (MEMRI), has long raised the alarm about the threat of LLMs fueling antisemitism and terrorism. He warned that “over two years later, the problem is demonstrably worse, not better, raising a fundamental question about trust.”

Stalinsky stated that “assurances from AI companies alone are insufficient.”

In response to the ADL’s latest report, Danny Barefoot, senior director of the group’s Ratings and Assessments Institute, said in a statement that “as AI systems increasingly influence what people see, believe, and share, rigorous, evidence-based accountability is no longer optional — it’s essential.”

ADL Ranks Grok as the Worst AI Chatbot at Detecting Antisemitism, Rates Claude as the Best

Читайте на сайте

Документальные новости

Game24.pro

Здоровье

Деньги

Новости от наших партнёров в Вашем городе

Топ новостей на этот час