OpenAI is here it is again with another groundbreaking update! After tools like SORA, Voice Engine and the DALL-E 3 Inpainting Feature, a Vision Model update is what the community has been waiting for. Let’s discuss the new Vision features with GPT-4 Turbo!
Highlights:
- OpenAI releases enhanced GPT-4 Turbo with Vision through API, which will soon be rolled out through ChatGPT.
- Several users and brands are utilizing the API access to Vision’s capabilities such as extracting unstructured texts and data from the images and web.
- The enhancement is mainly built around two new features namely JSON and function support calling.
How good is the new GPT-4 Turbo model now with Vision request capabilities? In this article, we are going to go in-depth into all these topics and analyze them in detail. So, let’s get right into it!
The GPT-4 Turbo Vision Model
OpenAI announced the enhanced and updated version of the GPT-4 Turbo model, which now comes with Vision capabilities. It enables the model to recognize images and provide information about them.
With image processing, the new multimodal GPT-4 Turbo model is now widely accessible through the API. One API request is all that is needed to analyze text and images and derive conclusions using OpenAI’s “more intelligent and multimodal” model.
Majorly improved GPT-4 Turbo model available now in the API and rolling out in ChatGPT. https://t.co/HMihypFusV
— OpenAI (@OpenAI) April 9, 2024
In the past, language model systems were restricted to processing text as their only input modality. This limited the domains in which models such as GPT-4 could be applied for a large number of application cases.
Developers had to use different models for this in the past. Previously, the model has sometimes been referred to as GPT-4V or gpt-4-vision-preview in the API.
To facilitate the integration of the model into developer processes and apps, Vision requests now accept standard API capabilities like function calls and JSON mode.
With these new features you can now use gpt-4-turbo vision to extract structured data from an image!
Previously the vision model could answer general questions about what is present in the images but it wasn’t optimized enough to answer detailed questions such as the location of the objects in images.
But with JSON and function support, developers can say goodbye to all of these previous hassles.
The GPT-4 Turbo Vision model is now available in OpenAI’s API and is being slowly rolled out to ChatGPT users worldwide. To access the OpenAI API visit this link where you will get all the steps and details as to how you can set up the OpenAI API to access the latest GPT-4 Turbo Vision Model and start experiencing the groundbreaking features.
What GPT-4 Turbo with Vision Can Do?
The enhanced GPT-4 Turbo Vision model comes with several capabilities and groundbreaking features. The JSON and function support features are accessible through the API. Vision requests can now also use JSON mode and function calling.
Some users including OpenAI’s developers have shared some insights into the model’s capabilities after testing it. Let’s take a look at all of them:
1) Extracting unstructured text and images into database tables
A user named Simon Willison used the enhanced GPT-4 Turbo Vision Model through the API. The user gave a test input image to the model and extracted all the texts from the images. He used function calling to extract the texts from the images.
This is the input image that the user gave:
This is the extracted text code processed by GPT-4 Vision Function calling:
[
{
"event_title": "Coastside Comedy Luau",
"event_description": "Comedy event featuring Laurie Kilmartin, Ryan Goodcase, and Phil Griffiths, hosted by Marcus D. Includes Hawaiian buffet and welcome cocktail. Proceeds benefit Wilkinson School and Coastside Hope.",
"event_date": "2022-05-06",
"start_time": "18:00",
"end_time": "22:00"
}
]
Thus, this is a great feature of GPT-4 Turbo thanks to the vision capabilities. Processing complex texts from images has been a hectic task in the past, but now it is slowly into the reach of LLMs thanks to JSON and function support calling abilities.
You can witness the full process in action in the video below:
Creator Simon can be seen extracting unstructured texts and images into defined and modified database tables with the help of GPT-4 Turbo and Datasette Extract.
2) Writing code based on an Interface Drawing
Make Real is an application built by tldraw that lets users draw UI on the whiteboard. OpenAI developers shared a video demonstration of using tldraw, to make a working website powered by real code.
Make Real was using GPT-4 Turbo’s Vision Capabilities to convert UI code commands into the corresponding Website interface. An interesting aspect can be seen in the video where a blue submit button in the web interface was converted to green after giving the UI command “Make this green”.
Make Real, built by @tldraw, lets users draw UI on a whiteboard and uses GPT-4 Turbo with Vision to generate a working website powered by real code. pic.twitter.com/RYlbmfeNRZ
— OpenAI Developers (@OpenAIDevs) April 9, 2024
At the end of the video, you will witness a well-designed web interface developed as per the UI instructions. Thanks to Vision’s capabilities writing codes based on interface drawings is a much easier task.
3) A Variety of Coding Tasks
The world’s first autonomous coding AI Agent Devin, is powered by GPT-4 Turbo Vision. This allows it to achieve several successes with a variety of coding tasks and functions. In the below video shared by OpenAI Developers Devin can be seen writing a code fix for an issue in a GitHub repository.
Devin, built by @cognition_labs, is an AI software engineering assistant powered by GPT-4 Turbo that uses vision for a variety of coding tasks. pic.twitter.com/E1Svxe5fBu
— OpenAI Developers (@OpenAIDevs) April 9, 2024
Devin did an amazing job fixing the code issue in the GitHub repository but this is also thanks to GPT-4 Turbo’s Vision capabilities such as JSON and function requests. You can also find Devin’s other capabilities here in which GPT-4’s Turbo Vision plays a huge role.
4) Giving Nutrition Insights from Images of Foods
Healthify, the world’s largest health and fitness App used GPT-4 Turbo’s vision capabilities to build a feature called Snap. This feature helps users get nutrition insights from images of foods from around the world.
User can can give any food’s image with a proper title, and the App does its work in fetching several details such as nutrition quantities in Protein, Fats, Carbs, etc. It also provides a detailed insight paragraph stating how the food can impact a user’s health and whether should they try a better food alternative or not.
The @healthifyme team built Snap using GPT-4 Turbo with Vision to give users nutrition insights through photo recognition of foods from around the world. pic.twitter.com/jWFLuBgEoA
— OpenAI Developers (@OpenAIDevs) April 9, 2024
FatGPT is another application utilizing GPT-4 Turbo’s Vision for something similar.
fatGPT is using GPT-4 Vision to analyze meal logs and make weight loss easier. It can even tell that your burger is lettuce wrapped! pic.twitter.com/FbzkdXzEkW
— Erik Dungan (@callmeed) April 9, 2024
This is all thanks to GPT-4 Turbo’s vision capabilities which allow it to extract and interpret meaningful information from images and use it for a variety of use cases. These features were limited before but with the recent upgrades, they seem better than ever.
5) Extracting Web Data
Adrian Krebs, Unstructured Data ETL on autopilot at Kadoa, stated that Kadoa uses GPT-4 Turbo’s vision capabilities to automate specific web scraping and RPA tasks that don’t work with text representation alone.
He also shared a video in X, where you can see the entire web scraping process in action. You can see in the video that the application asks for several source websites and extracting actions. Later it also asks for the extraction data URL.
Later in the video, the application does its best in extracting the web data according to the user preferences.
At Kadoa, we use GPT-4 vision to automate specific web scraping and RPA tasks that don't work with text representation alone. pic.twitter.com/xYr3Je95Rq
— Adrian Krebs (@krebs_adrian) April 9, 2024
Extracting all sorts of unstructured data, whether from the web or images is made possible with the help of GPT-4 Turbo’s vision capabilities. Web scraping is made a lot easier since this GPT-4 enhancement.
6) Recreating Hacker News with Vision
Magic Patterns is an AI assistant that generates UI from a text prompt, image, or Figma mockup. This tool has also been leveraged with GPT-4 Turbo’s vision capabilities.
Y Combinator, an American technology startup accelerator and venture capital firm, shared a video on X, where they demonstrated using Magic Patterns for recreating the hacker news on their website. You can see the full demonstration in action in the video below.
Built by two frontend engineers, @magicpatterns (YC W23) is an AI assistant that generates UI from a text prompt, image, or Figma mockup.https://t.co/aqvIcZi7T6
— Y Combinator (@ycombinator) February 22, 2024
Congrats on the launch @alexdanilo99 and @teddarific! pic.twitter.com/XCfLPcnVZr
This is all thanks to Magic Patterns being leveraged by GPT-4 Vision’s JSON calling capabilities which allows it to recreate the hacker news on Y Combinator’s platform.
7) Transforming Dashboard Sketches into Functional Interactive Dashboards
Haroen Vermylen, Data visualization & big data expert of Luzmo announced, that they are using GPT-4 Turbo’s Vision API to power Instach.art, a tool to transform a sketch of a dashboard or a Figma mockup, into a fully functional interactive dashboard, demo data included.
Great news!
— Haroen Vermylen (@kagaherk) April 9, 2024
We're using Vision API to power https://t.co/0vvKnvNY6z — a tool to transform a sketch of a dashboard (or a Figma mockup, …) into a fully-functional interactive dashboard, demo data included.
This is made possible yet again due to GPT-4’s vision capabilities which allow extracting profound meaning from images and designs and converting them into highly beneficial interactions. Who would have imagined that a day would come when simple designs on dashboards could be converted into captivating interactions?
Conclusion
The GPT-4 Turbo’s enhanced Vision capabilities are making it possible to achieve success with several use cases and functions which was not possible before. What can we expect once it gets rolled out to ChatGPT for users worldwide? Only time will tell.