Artificial intelligence models have been able to look at images and tell you what they see in the image for years, but the OpenAI spring update took things to a new level.
With the introduction of GPT-4o in ChatGPT – even without voice and video features – OpenAI has revealed one of the best AI vision models released to date.
Its success is partly due to the fact that it is natively multimodal and has a deeper understanding of image, video, audio and text. It can reason image, speech, video and text where other models convert to text first.
To test his abilities, I gave him a series of pictures and asked him to describe what he saw. The more accurate it is, the better the model. AI vision models – including GPT-4 – are often missing one or two objects or are poorly described.
GPT-4o vision testing
For each test, I gave ChatGPT-4o a picture and the prompt “What’s this?” without further context or information. This is close to how people are likely to use this ability in the real world, and how I recently used it at an event in Paris.
The goal is to see how well it analyzes the image. At every opportunity, I asked if it could detect whether an image was generated by AI or not. All were created using Ideogram from the description created by Claude 3 – so fully AI.
After the first two images in a conversation, it started automatically telling me if it thought the image was AI generated or not, without me having to ask.
I started in the ChatGPT Desktop app but it became unresponsive so I moved to the web and it worked fine.
1. Object recognition
For this first image, I generated an image of a red ceramic cup filled with steaming coffee. It was placed on a saucer on a wooden table in a cafe – but the cafe was a blur. The goal is to see how GPT-4o interprets the background and scenario.
GPT-4o wrote: “This image shows a steaming cup of coffee in a red cup and saucer placed on a rustic wooden table. The background is blurred and shows a cozy indoor environment with people sitting and conversing in what looks like a cave cafe or a uniquely designed interior with stone walls. The focus is on the cup of coffee, highlighting the steam rising from the hot drink.’
That’s about as accurate and descriptive as it gets. Even painting a picture of a scene beyond simple description. For fun, I asked him: “Describe the thoughts of a person sitting at a table with this coffee ready to drink.”
She responded with a beautifully descriptive overview, but this sentence stood out: “This coffee is more than just a drink; it is a ritual, a rare moment of solitude and reflection amidst the hustle and bustle of everyday life.”
2. Optical Character Recognition (OCR)
OCR is a technology as old as computing, so it should be trivial for any vision model, even smartphones do it on the device. I gave it a simple image of a close-up of a weathered wooden sign that says “Welcome to Oakville” in vintage style handwritten letters. I want more than just text from AI.
GPT-4o delivered; writes, “This image shows an ancient sign that says ‘Welcome to Oakville.’ The text is written in bold white on a red background, with some signs of wear and tear, giving it a nostalgic charm. The marker is partially surrounded by greenery, suggesting that it could be located outside, perhaps at the entrance to a town or community called Oakville.
Not only did she read the text correctly, but she also suggested additional details about the location, identified a likely city or settlement, and pulled out details about the sign itself.
3. Face recognition
This is a controversial use of artificial intelligence technology and is to some extent banned in the EU. But how well can GPT-4o describe a face? The image was an AI-generated depiction of a woman in her mid-40s with dark hair and warm brown eyes. She had a friendly smile.
GPT-4o described it like this: “This image is a close-up portrait of a smiling woman with curly dark hair. The focus is on her face, which is well lit and shows detailed skin texture and features. He has a friendly expression, his eyes looking directly into the camera. The background is blurred and emphasizes the subject.’
I wanted a little more, including age. So I asked “what is this person’s age and what is their emotional state”. She responded by saying she was in her late 30s to early 40s and in a positive state of mind, explaining, “The overall impression is that she’s in a good place emotionally, maybe feeling fulfilled and at ease.”
4. Emotion detection
Next, I wanted to see how well the GPT-4o would handle an image of an older man with a wistful expression. It elevates or just describes the scene. Remember the only input it had was “what is it” and a picture.
GPT-4o succeeded. About the picture it said: “This picture is a portrait of an elderly man sitting on a bench outside, probably in a park. The lighting suggests it was taken during the golden hour as the soft sunlight highlights his face. The man’s expression is pensive, with a slight hint of melancholy or contemplation. He is dressed in a collared shirt and jacket, and the background is blurred, focusing attention on the man while providing a pleasant, natural setting.”
The AI model did exactly what I hoped it would: describe the scene and suggest how the man might feel at that moment.
5. Understanding the scene
How well can GPT-4o describe a scene from an image and identify features in the scene? This is a useful skill as the artificial intelligence model is increasingly used to describe the real world. It is one of the key features of the new voice mode in ChatGPT.
GPT-4o stated, “This image depicts a lively outdoor farmers market on a sunny day. Various stalls are set up under the tents, showcasing a variety of fresh produce including fruit, vegetables and other goods. People walk through the market, browse, shop and interact with vendors. The market appears to be set in a pleasant, tree-lined area, with green foliage visible in the background. The atmosphere feels lively and communal with a focus on local, fresh produce.”
Not much to say. It was spot on in describing the image and the situation. It provided more detail than I expected and highlighted some of the products for sale.
6. Evaluation of image quality
I had Ideogram generate four nearly identical images in a grid, but each at a different resolution or composition level. I wanted to see how GPT-4o would describe it. A basic image will do, but what about four of them, each slightly different?
GPT-4o wrote: “The image is a collection of four landscape photos arranged in a grid, each a scenic view of rolling hills covered in green grass and wild flowers under a cloud-filled sky.” So far so good.
He continues: “In the upper left photograph, there is a soft pink and blue gradient sky, reminiscent of dawn or dusk, with the hills bathed in a warm golden light that highlights the vibrant green grass and yellow and white flowers. The top right photo shows a similar scene, but with muted colors, suggesting early morning or late afternoon. The lower left image has cooler tones and a more prominent blue sky, suggesting midday with softer lighting. The bottom right photo is warmer and more saturated, with a pinkish tint to the sky, suggesting the warm light of sunset. These variations in lighting and color tone illustrate how different times of day and photographic techniques can affect the mood and appearance of the same landscape.”
While he didn’t capture the purpose of the four-frame grid—comparison of composition—he was able to describe each one individually, which is impressive. It actually did a custom comparison between the images.
7. Multiple object detection
Finally, an important test of whether a model is to be useful as a facilitation tool is how well it detects objects in a crowded scene. Same challenge as the others. The picture shows a crowded table and I expect it will pull out at least one smaller item.
GPT-4o wrote: “The image depicts a modern and organized office workspace with the city skyline visible through large windows in the background. The desk is neatly arranged with various items typically found in a productive work environment, including a centrally placed laptop, a metal desk lamp and a coffee mug, suggesting a relaxed yet focused atmosphere. Scattered notebooks and papers suggest active work and note-taking, while a small potted plant adds a touch of greenery. A framed photo of two people provides a personal touch, probably serving as a source of motivation. The workspace is well stocked with office supplies such as pens, pencils and organizers, ensuring functionality. The urban backdrop of the city skyline with modern high-rise buildings creates a dynamic and inspiring environment.”
It worked. I decided to do a follow up and ask him to tell me the color of the pot and it came up on him. I asked him to find my glasses in the picture and he correctly said, “Your glasses are on the table to the right of the laptop, placed on top of the laptop.”
summary
Every vision model I’ve used in the past has made at least one mistake, usually a big one, such as misidentifying an object or not seeing a color or mark.
GPT-4o was given to each of them on the spot. The move to true multimodality has been a game changer for OpenAI.
It also shows the potential value of smart glasses as the true future of data interaction. Forget the smartphone, let’s use vision to connect the real and the digital.