two toys facing each other: Mr and Mrs Potato Head
Feature

OpenAI’s ChatGPT Can Now Talk and See

5 minute read
Chris Ehrlich avatar
By
SAVED
OpenAI is rolling out new voice and image capabilities to its ChatGPT. What does that mean for you and your workforce? Experts weigh in.

OpenAI unveiled a multimodal interface for its conversational artificial intelligence (AI) chatbot ChatGPT on Sept. 25. The release, announced on the company’s blog, will allow users to input prompts with images and their voice — to essentially “talk” with the chatbot. 

The company started rolling out the multimodal interface to its Plus and Enterprise users over the two weeks following the announcement and will to other users “soon after.” Image inputs will be on the web and mobile, and voice inputs will be on mobile.

Reworked spoke with AI and business leaders who believe this new interface offers use cases that can not only improve the effectiveness of both prompts and outputs but also worker efficiency. Here’s what they had to say.

Enhanced Accuracy and Efficiency

The new interface for ChatGPT is expected to have considerable positive impacts on workers and organizations, on one hand providing greater accuracy in prompts and on the other, enhanced efficiency in using the tool.

For instance, a user now has the ability to input an image of a graph and ask the chatbot to analyze it, thus yielding valuable insights from the technology without the need for an expert in-house. 

Similarly, on mobile, users will have the ability to upload an image and focus on an area of the image with the app’s drawing tool to query the bot on that specific area, adding to the precision they seek and doing away with what Iliya Rybchin, a partner at Elixirr Consulting, calls “translation.”

Rybchin said that by integrating these types of inputs to a generative AI tool, such as ChatGPT, the need for translation goes away because instead of writing a long explanation of an image to be able to query the bot on it, a user can simply upload the image, which should improve efficiency, accuracy and quality.

Employees can also more seamlessly integrate a multimodal chatbot interface into their regular workflow, Rybchin said. For instance, a repair technician in the field can use the voice input for a more hands-free experience to have a discussion on the specifications or diagnosis, talking with the multimodal chatbot as an AI master technician.

Ranjitha Kumar, chief scientist at UserTesting, agrees that a multimodal ChatGPT can make the human-LLM interaction more efficient for employees. 

“A picture is worth a thousand words, so it’s often easier for a user to provide an image in a prompt to describe a problem,” Kumar said. “Similarly, talking out loud when performing a task can be more efficient than typing out text.”

Related Article: The Impact of AI on the Future of Work: Embracing the Power of Collaboration

Boosting Creativity and Collaboration

The voice capability feature is likely to drive great interest from creatives by allowing them to have “back-and-forth” conversation with the chatbot. 

To understand a user's voice prompt, the chatbot transcribes their speech into text through Whisper, OpenAI’s open-source speech recognition system. Mobile users can opt into voice conversations in the app’s settings and select one of five voices for the chatbot: Breeze, Cove, Ember, Juniper and Sky. Each of the chatbot’s voices is based on OpenAI’s work with professional voice actors and its text-to-speech model that can generate human-like audio from text and a few seconds of sample speech.

The OpenAI blog said the company believes the ability of its text-to-speech model to create “realistic synthetic voices” can enable many creative and accessibility-focused applications. 

Raghu Ravinutala, co-founder and CEO of the conversational AI platform Yellow AI, said ChatGPT’s ability to understand voice gives it a more natural and intuitive user experience, “bridging the gap between humans and machines in the digital workplace.”

He agreed that by integrating visuals and voice into daily workflows, a multimodal conversational AI chatbot can encourage deeper collaboration among employees and help them seamlessly share and refine ideas.

Employees can be empowered, he said, to communicate ideas effectively through various media, leading to swifter problem solving, enhanced decision-making and streamlined information exchange. During onboarding and training, for instance, organizations can employ a multimodal chatbot to expedite the learning curve by accommodating several learning styles through text, images and speech, he said.

Olga Beregovaya, VP of AI and machine translation at the AI translation platform Smartling, said that with a multimodal conversational AI chatbot, a company can automate the generation of e-learning materials based on prompts, including in multiple languages, and make them accessible across its global workforce, contributing to productivity and sense of belonging.

However, Beregovaya noted that companies need to remember that text data “still prevails in the training data sets, so the text output can be more accurate than a visual output.”

Related Article: How to Train AI on Your Company’s Data

Some Limitations and Privacy Concerns

Still, there are risks.

To test the vision-based large language model (LLM) and align it for responsible usage, OpenAI worked with red team testers to assess risk in different areas — such as extremism and science — as well as alpha testers.

Learning Opportunities

But despite these efforts, the company has chosen to limit ChatGPT’s ability to “analyze and make direct statements about people” based on images since the chatbot is “not always accurate, and these systems should respect individuals’ privacy.”

OpenAI notes that the vision-based model presents other challenges as well, such as hallucinations about people and users relying on its interpretation of images in high-stakes fields. “We are transparent about the model's limitations and discourage higher risk use cases without proper verification,” the OpenAI blog read.

Furthermore, the realistic nature of the voice capabilities can raise new risks. OpenAI warned in its blog post that the voice capabilities have the potential for malicious actors to impersonate public figures or commit fraud. As such, OpenAI is applying the technology for the specific use case of voice chat.

Related Article: Are GenAI Copyright Protections Enough to Quell IP Concerns?

The Future of a Multimodal AI Chatbot

Rybchin said that when employees use a multimodal chatbot interface, companies can have greater confidence that they’ll be more efficient, access more data and make better decisions.

Ravinutala said ChatGPT's multimodal capabilities enhance its utility and point toward a “promising future” where it understands the world around it, extending beyond the “confines of the online data it has been trained on.”

A multimodal chatbot, he said, also enables more accessibility and inclusivity for employees with certain disabilities. In fact, to develop this interface, OpenAI worked with Be My Eyes — a free mobile app for people who are blind and people with low vision — to understand accessibility limitations and helpful uses.

Mark McNasby, co-founder and CEO of the conversational AI platform Ivy.ai, said OpenAI’s image and voice technologies in ChatGPT are “just one more step” in the company’s goal of creating artificial general intelligence. 

He agreed that the multimodal interface makes ChatGPT more accessible to certain disabled communities, and voice makes the mobile app a “much more compelling experience — akin to Siri and Albert Einstein having a baby,” he said.

But most significantly, he said, ChatGPT’s product evolution is a strategic move by OpenAI to drive the collection of multimedia data from certain user plans for the ultimate retraining of their AI models.

“With massive amounts of new data flowing into ChatGPT, they will be able to deliver more incremental improvements on the AI models and the associated APIs," McNasby said.

About the Author
Chris Ehrlich

Chris Ehrlich is the former editor in chief and a co-founder of VKTR. He's an award-winning journalist with over 20 years in content, covering AI, business and B2B technologies. His versatile reporting has appeared in over 20 media outlets. He's an author and holds a B.A. in English and political science from Denison University. Connect with Chris Ehrlich:

Main image: Caleb Mays | unsplash
Featured Research