Two AI models pass benchmark Turing Test, blurring line between human, machine | Latest News India


OpenAI’s GPT-4.5 and Meta’s Llama-3.1 models have passed the Turing Test, a benchmark proposed by Alan Turing in the 1950s to assess whether machines can exhibit intelligent behaviour indistinguishable from humans that has always been held up as a sort of tipping point on the maturity and sophistication of Artificial Intelligence (AI).

An artificial Intelligence booth by Rittal Ltd. at the Hannover Messe 2025 trade fair in Hannover, Germany, on March 31, 2025. (Bloomberg) PREMIUM
An artificial Intelligence booth by Rittal Ltd. at the Hannover Messe 2025 trade fair in Hannover, Germany, on March 31, 2025. (Bloomberg)

Researchers Cameron R. Jones and Benjamin K. Bergen from the University of California San Diego, found that GPT-4.5 performed so convincingly that judges identified it as human 73% of the time—significantly more often than they correctly identified actual human participants. Meta’s Llama-3.1-405B achieved a 56% success rate, essentially matching human performance (around 50%), while baseline models ELIZA and GPT-4o were correctly identified as non-human with only 23% and 21% win rates respectively.

“When prompted to adopt a human-like persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant,” Jones and Bergen stated in their study, which awaits peer review.

The experiment employed a three-party design where participants engaged in simultaneous five-minute conversations with both a human and an AI system before determining which was which. This methodology, tested across two independent populations — undergraduate students and prolific workers — provides the first robust evidence that any system passes the original three-party Turing test.

But crucially, the AI systems only passed the test when given specific prompts to adopt a particular persona—that of a young, introverted person knowledgeable about internet culture. The same models without the persona prompt performed significantly worse, highlighting how critical prompt engineering is to creating convincing human-like interactions.

The researchers note that passing the Turing test doesn’t necessarily indicate true intelligence. “It is arguably the ease with which LLMs can be prompted to adapt their behaviour to different scenarios that makes them so flexible: and apparently so capable of passing as human,” the researchers illustrate in their paper, suggesting that new benchmarks testing reasoning and ethical alignment may better gauge AI progress.

“This study was different from earlier Turing test experiments because it used a more rigorous three-party setup. Is it entirely surprising that — despite how rigorously the test was designed — AI would eventually beat us at ‘sounding human’ when it has been trained on more human data than any one person could ever read or watch?” said Sinead Bovell, founder of Waye, a tech education company, quoted in the original article.

Previous claims of passing the test include the Eugene Goostman chatbot in 2014, which convinced 33% of judges it was a 13-year-old Ukrainian boy—falling short of the 50% threshold now considered standard for passing.

The researchers included ELIZA, a rudimentary 1960s chatbot developed by Joseph Weizenbaum at MIT, as a control to verify that participants could identify obviously non-human responses. Its poor performance confirmed the test design was sufficiently sensitive.

OpenAI described GPT-4.5 as “the largest and best model for chat yet” during its February research preview. “It is the first model that feels like talking to a thoughtful person to me. I have had several moments where I’ve sat back in my chair and been astonished at getting actually good advice from an AI,” said OpenAI CEO Sam Altman at the time. Altman hasn’t directly addressed the Turing Test results thus far.

Analysis of participants’ strategies revealed that most interrogators engaged in small talk about daily activities or personal details (61% of games) and probed for social and emotional qualities (50%). However, the most effective strategies for identifying AI involved saying something strange to see how the witness would react or attempting to “jailbreak” the model.

“We’re not losing to artificial intelligence. We’re losing to artificial empathy,” said John Nosta, founder of innovation think-tank NostaLab, in a social media post cited in the original article.

The implications extend beyond technical achievement. Critics highlight potential economic disruption through job displacement and social concerns about undermining human relationships. These developments align with recent advances in Agentic AI (AI that can display autonomous behaviour), where companies including Microsoft, Adobe, Zoom and Slack are developing AI agents for specific professional roles.

Susan Schneider, founding director of the Center for the Future Mind at Florida Atlantic University (FAU), warned in a social media post: “Too bad these AI chatbots aren’t properly aligned. Yet, I predict: they will keep increasing in capacities and it will be a nightmare—emergent properties, ‘deeper fakes’, chatbot cyberwars. Hardly the Kurzweilian dream.”

Her reference is to futurist Ray Kurzweil who has often spoken and written about the transformative power of AI.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *