Apple enters the realm of LLMs with ReALM, which may outperform GPT-4! The company is expected to be launching big in the AI race later this year and this might be their big reveal for the big day!
Highlights:
- Apple AI researchers published a paper on a small AI model called ReALM.
- This new system can interpret context from on-screen content.
- The paper claims that ReALM’s performance is comparable to GPT-4 for reference resolution.
What is ReALM?
ReALM, which stands for Reference Resolution As Language Modeling, can understand images and text on the screen to enhance the interactions with the AI.
The concept of reference resolution involves a computer program performing a task based on vague language inputs, such as a user saying “this” or “that.” It’s a complex issue since computers can’t interpret images the way humans can. However, Apple seems to have found a streamlined resolution using LLMs.
The research paper proposes a novel approach to encode on-screen entities and their spatial relationships into a textual representation which can be processed by an LLM. This is done by parsing the screen, sorting the elements based on their position and constructing a representation that preserves the spatial positions of the elements.
There are 4 sizes mentioned in the paper: 80M, 250M, 1B, and 3B. The “M” and “B” represent the number of parameters in millions and billions, respectively.
The concept presented here is a game changer for Siri interaction.
While interacting with smart assistants, you often provide context-dependent information like the restaurant you visited last week or the recipe you last searched for. These are specific entities based on the past and current state of the device.
However, this requires extensive computational resources due to the large amount of references that need to be processed on a day-to-day basis.
How does ReALM work?
This is where the novel approach of ReALM has a huge impact. ReALM converts all relevant contextual information to text which simplifies the task for the language model.
Given relevant entities and a task the user wants to perform, the method should extract the entities that are pertinent to the current user query. The relevant entities are of 3 different types:
- On-screen Entities: These are entities that are currently displayed on a user’s screen.
- Conversational Entities: These are entities relevant to the conversation. These entities might come from a previous turn for the user (for example, when the user says “Call Mom”, the contact for Mom would be the relevant entity in question), or from the virtual assistant (for example, when the agent provides a user with a list of places or alarms to choose from).
- Background Entities: These are relevant entities that come from background processes that might not necessarily be a direct part of what the user sees on their screen or their interaction with the virtual agent; for example, an alarm that starts ringing or music that is playing in the background.
The key steps involved in converting these entities to textual forms are:
- Parsing the screen: First, ReALM assumes the presence of upstream data detectors which parse the screen and extract entities like phone numbers, contact names, and addresses with their bounding boxes.
- Sorting elements based on spatial positions: These extracted entities are sorted based on their positions on the screen, vertically from top to bottom based on the y-coordinates of their bounding box. Then a stable sort is performed horizontally from left to right based on the x-coordinates.
- Determining vertical levels: A margin is defined to group elements that are within a certain distance from each other vertically. Elements within this margin are considered to be on the same horizontal level or line.
- Constructing the textual representation: The sorted elements are then represented in a text format, with elements on the same horizontal level separated by a tab character, and elements on different levels separated by newline characters. This preserves the relative spatial positioning of the elements on the screen.
- Injecting turn objects: The entities that need to be resolved (referred to as “turn objects”) are injected into this textual representation by enclosing them in double curly braces {{ }}.
By converting the on-screen information into this textual format, ReALM can leverage the power of LLMs to understand the spatial relationships between entities and resolve references accordingly.
The authors fine-tuned a FLAN-T5 model on various datasets consisting of conversational, synthetic, and on-screen references, and demonstrated that their approach (ReALM) outperforms existing systems and performs comparably to or better than GPT-4, despite using significantly fewer parameters.
This innovative encoding method allows ReALM to handle references to on-screen elements without relying on complex visual understanding models or multi-modal architectures.
Instead, it leverages the strong language understanding capabilities of LLMs while providing the necessary spatial context through textual representation.
Here is an example of how the user screen is seen by on-screen extractors:
Here is an example of how inputs into the model have been encoded, in the form of a visual representation:
Here is what the Apple Researchers think about its performance:
“We show that ReaLM outperforms previous approaches, and performs roughly as well as the stateof-the-art LLM today, GPT-4, despite consisting of far fewer parameters, even for onscreen references despite being purely in the textual domain.”
By encoding spatial information into textual representations, ReALM outperforms existing systems and rivals state-of-the-art models using fewer parameters. This fine-tuning approach paves the way for more natural and efficient conversations.
Conclusion
This new paper by the Apple researcher and the implementation of this technique will fundamentally change the way smart assistants process contextual data. Apple is moving forward fast with MM1 models as well. Let’s wait for a few more months to know if it comes to our hands or not!