Visual description via AI, the promise and the problem

By Jonathan Mosen

Editor’s note: Among his many roles, Jonathan is a former assistive technology product manager and accessibility consultant. He regularly covers technology developments in his Living Blindfully podcast.

Chances are you’ve heard some of the buzz around new technology called ChatGPT. It has taken the concept of artificial intelligence, or AI for short, mainstream.

ChatGPT is a freely available virtual chat interface which you can enter prompts into in order to generate information or have a AI-powered conversation. As I sat down to write this article, I gave thought to how to describe it more fully to those who may not have worked in information technology as I have. The obvious answer was to ask ChatGPT to describe the pros and cons of itself in non-technical language. Here’s what it came back with.

“ChatGPT is an AI language model that has transformed the world of conversational AI with its remarkable capabilities. It has strengths and weaknesses like any other technology. One of the major strengths of ChatGPT is its ability to understand and respond to human language in a natural and fluent manner. It can be used in various applications such as customer support, content creation, and social media management. However, despite its many strengths, ChatGPT does have its limitations. One of the main weaknesses of the model is its lack of understanding of common-sense knowledge and world facts. It may generate nonsensical or incorrect information if it has not seen similar examples of information during its training.”

That is a very self-aware, if I dare use that expression, description. ChatGPT is already saving me time, but I have also seen it write utter nonsense on subjects about which I have knowledge. Like any technology, it will improve over time. In the six months I have been using it, its “hallucinations” have decreased, but not disappeared.

There are many articles available already that discuss some very real philosophical and ethical issues we must confront urgently as a species due to the rapid advancement of AI. Many of the pros and cons of this technology have the same impact on us as everyone else. But there’s one area where blind and low vision users may benefit immensely. ChatGPT can now describe images, with competitors like Google also introducing similar features.

The key players

ChatGPT are working with two companies in the blindness space to test this technology and its implications. Envision produces software running on smart glasses that provide access to visual information. You can hear a full review and demonstration of the Envision glasses in episode 202 of my podcast, Living Blindfully, formerly called Mosen at Large. The new “Ask Envision” feature is available at this stage in its glasses, but not the free Envision app for your Apple or Android smartphone.

The second blindness-related company working with ChatGPT is Be My Eyes. This popular app connects its users with sighted volunteers and companies willing to provide support via the platform. Be My Eyes is testing what it’s calling “Virtual Volunteer”. You provide it with a picture, either by taking a photo in the app or by supplying a picture. It could, for example, be a picture on social media or a photo from your camera roll. Within seconds, virtual volunteer will describe the picture to you. Not only is the initial description far more detailed than any app currently on the market, you can ask follow-up questions as if you were asking a sighted person to explain particular details to you.

How well it can answer those questions depends on how good the picture is, and how much ChatGPT knows about the subject in question.

Because the description is powered by a tool that holds an astounding amount of knowledge, you can go beyond the picture, as CEO of Be My Eyes, Mike Buckley, explained to Living Blindfully.

“Think about taking a picture of the contents of your refrigerator and not only getting a list of the contents, but the tool can tell you what you can make for dinner based on what’s in there,” he says.

“We took a picture, Jonathan, of the Indian railway map system and then we asked something like, “How do I get from Bangalore to Delhi?” It gave the directions. Literally which lines you had to take and how to get there. Then I said, “Can you tell me in Hindi?” It did. It’s remarkable. It’s remarkable power.”

If you have doubts about what the AI is telling you, or it’s just not being helpful, you can get a sighted volunteer on the line with the double-tap of a button.

Setting realistic expectations

Be My Eyes are taking a cautious approach, testing it with a handful of users before opening the test more widely. Even then, they are seeking to set realistic expectations.

“Let’s be really firm about this,” Buckley says, “We do not want anyone using this technology to replace a white cane. We don’t want to use it to replace a guide dog in terms of where it is now. We want to be slow, we want to be thoughtful, we want to be cautious. We’ve even put that in the agreement upfront with the beta testers that it should not be used for these purposes. Down the road as video comes in and as the AI gets more sophisticated and intelligent, I think there are absolutely going to be navigational and other use cases, but we’re just not there yet.”

What’s it really like?

I’ve been given early access to this technology. When it works well, it can take your breath away. When it fails, it can make things up with such confidence that you may be convinced it’s telling the truth when it isn’t. Those of us with no vision may not be able to determine the accuracy of certain descriptions without verification from a sighted person we trust.

Last year, my wife, youngest daughter and I took a holiday in Europe. I took numerous photos, and I’ve been asking Virtual Volunteer to describe them to me. My wife was fascinated by the ravens at the Tower of London, so my daughter took a picture of one with us in the background. Here’s how Virtual Volunteer described that picture.

“The picture is of a cobblestone street with a black bird in the foreground. In the background, there are two people walking on the sidewalk and a brick building with stairs and a blue lamp post. The bird appears to be a crow or a raven and is standing on the cobblestones. The street is lined with a stone curve and there is moss growing between the cobblestones.”

This is a particularly good example because the system expressed some uncertainty about the bird. In a case like this, you can educate it to put it on the right track. I replied to Virtual Volunteer that the bird is indeed a raven, and to describe the raven to me in detail. Here’s how it replied.

“The bird is a raven, which is a large black bird with a curved beak and glossy feathers. It is standing on the cobblestones with its head tilted to the side as if it is looking at something. Its wings are folded at its sides and its feet are visible. The bird appears to be healthy and well-fed.”

I was also able to ask about the two people in the photo, Bonnie and me, and what we were wearing on that day.

Going through our holiday photos with Virtual Volunteer has brought back memories in vivid detail like no other software has been able to do. I will definitely be taking more photos now that I have this technology.

Wide-ranging applications

In daily life, I have used it to help me work out how much charge is left on my portable battery pack, and to describe unfamiliar air conditioning remotes in hotel rooms. You can often ask it to read the controls to you left to right, top to bottom, and it will come back with a detailed explanation.

I can take a picture of a document I suspect is a bill, and ask it to summarise who it’s from and what the total amount owing is. It has given me the closest experience I’ve ever had of glancing at a printed page for relevant information.

I’ve even done things I wouldn’t have bothered doing before, such as taking a picture from a hotel window and gaining a real understanding of what is outside.

Using it to get descriptions of social media photos during the coronation was quite something, as I could ask specific questions about objects and fashion.

The pitfalls

All that is great, until it gets something completely wrong. We have a Samsung television at home. When I asked Virtual Volunteer to describe the buttons on the remote control to me, it knew it was looking at a Samsung remote control, but it described a different remote control which did not have the buttons in the same order as the remote we have.

While frustrating to be pressing buttons and getting the wrong results, it’s not the end of the world if this technology describes the wrong remote control to you.

But what if it were to misidentify medication, or something else that might be a risk to health and even life? It’s for this reason that Be My Eyes and ChatGPT are cautious about making this technology available widely. They are also engaging closely with the National Federation of the Blind, a consumer organisation in the United States, to help shape the future of this product.

It would be ideal if a system could be developed that produces a confidence rating for each picture it is describing, and if the technology could be trained to take a conservative approach to description rather than risking a guess.

When computing technology expands to the point that it can provide real-time description of video as we travel, we will face even greater challenges in terms of the technology being useful while also being unclear about what is real.

In the meantime, there’s no doubt that the new generation of artificial intelligence applied to describing images will give us unparalleled access to visual information.

More info on Virtual Volunteer

For my demonstration of Virtual Volunteer, you can listen to Living Blindfully episode 222, available everywhere you get podcasts. You can also download the free Be My Eyes app for your Android or Apple device, and register your interest in testing the service.

Blind Citizens NZ

Latest News

Focus April 2024

Focus December 2023

Focus October 2023

National Conference AGM and Conference 2023

Focus June 2023

Visual description via AI, the promise and the problem

Blind Citizens NZ National Office