Scarlett Johansson v. GTP-4o: A New Benchmark in the Development of AI

Image: OpenAI

Scarlett Johansson v. GTP-4o: A New Benchmark in the Development of AI

Last week brought the launch of GTP-4o, OpenAI’s “newest flagship model that provides GPT-4-level intelligence.” The launch of GPT-4o made headlines, in part, because despite having previously declined OpenAI’s request to sample her voice, actress Scarlett Johansson ...

May 23, 2024 - By David Reid

Scarlett Johansson v. GTP-4o: A New Benchmark in the Development of AI

Image : OpenAI

Case Documentation

Scarlett Johansson v. GTP-4o: A New Benchmark in the Development of AI

Last week brought the launch of GTP-4o, OpenAI’s “newest flagship model that provides GPT-4-level intelligence.” The launch of GPT-4o made headlines, in part, because despite having previously declined OpenAI’s request to sample her voice, actress Scarlett Johansson said she was “shocked” and “angered” when she heard the new GTP-4o speak. One of the five voices used by GTP-4o, called Sky, sounded uncannily like the actress in her role as the AI Samantha in the 2013 film Her – about a man who falls in love with a virtual assistant. 

Adding to the discussion, OpenAI founder and CEO Sam Altman appeared to play up the comparison between Sky and Samantha/Johansson, tweeting “her” on the launch day of GPT-4o. OpenAI later posted on X that it was “working on pausing the use of Sky” and created a web page on May 19, explaining that a different actress had been used. The company also expanded on how the voices were chosen.

“When I heard the released demo, I was shocked, angered and in disbelief that Mr. Altman would pursue a voice that sounded so eerily similar to mine that my closest friends and news outlets could not tell the difference … In a time when we are all grappling with deepfakes and the protection of our own likeness, our own work, our own identities, I believe these are questions that deserve absolute clarity,” Johansson wrote in a statement released this week. The actress is said to be considering legal action against ChatGPT, presumably on right of publicity grounds.

The fact that the film Her was almost immediately referenced when GPT-4o was launched has helped raise awareness of the technology among the general public and, perhaps, made its capabilities seem less scary. This is fortunate because rumors about partnering with Apple have ignited privacy fears, with iOS18 coming out next month. Similarly, OpenAI has partnered with Microsoft with its new generation of AI powered Windows system called Copilot + PC.

Unlike other large language models (“LLMs”), GTP-4o – in which the “o” is short for “omni” – has been built from the ground up to understand not only text but also vision and sound in a unified way. This is true multi-modality going far beyond the capabilities of “traditional” LLMs. It can recognize nuances in speech, such as emotion, breathing, ambient noise, and birdsong, and it can integrate this with what it sees.

GTP-4o is a unified multi-modal model (meaning it can handle photos and text), is quick – responding at the same speed as normal human speech (at an average of 320 milliseconds) – and can be interrupted. The result is unnervingly natural, altering tone and emotional intensity appropriately. It can even sing. Some have even complained about how “flirty” GTP-4o is. No wonder some actors are worried.

It genuinely is a new way to interact with AI. It represents a subtle shift in our relationship with technology, providing a fundamentally new type of “natural” interface sometimes referred to as EAI, or empathetic AI. The speed of this advance has unnerved many government organizations and police forces. It’s still unclear how best to deal with this technology if it is weaponized by rogue states or criminals. With audio deepfakes on the rise, it is becoming increasingly difficult to detect what is, and is not, real. Even friends of Johansson thought it was her.

In a year when elections are due to be held involving more than 4 billion potential voters, and when fraud based around targeted deepfake audio is on the rise, the dangers of weaponized AI should not be underestimated. As Aristotle discovered, persuasive capability often is not about what you say, but in the way you say it. We all suffer from unconscious bias, and an interesting report from researchers in the United Kingdom about accent bias highlights this. Some accents are more believable, authoritative, or even trustworthy than others. For this precise reason, people working in call centers are now using AI to “westernize” their voices. In GTP-4o’s case how it says things may be just as important as what it says.


David Reid is a Professor of AI and Spatial Computing at Liverpool Hope University. (This article was initially published by The Conversation.)

related articles