Articles by FavTutor
  • AI News
  • Data Structures
  • Web Developement
  • AI Code GeneratorNEW
  • Student Help
  • Main Website
No Result
View All Result
FavTutor
  • AI News
  • Data Structures
  • Web Developement
  • AI Code GeneratorNEW
  • Student Help
  • Main Website
No Result
View All Result
Articles by FavTutor
No Result
View All Result
Home AI News, Research & Latest Updates

Apple Dethrones GPT-4 with ReALM, Perfect for On-Device AI?

Ruchi Abhyankar by Ruchi Abhyankar
April 4, 2024
Reading Time: 6 mins read
Apple ReALM outperforms GPT-4 for on-screen AI context
Follow us on Google News   Subscribe to our newsletter

Apple enters the realm of LLMs with ReALM, which may outperform GPT-4! The company is expected to be launching big in the AI race later this year and this might be their big reveal for the big day!

Highlights:

  • Apple AI researchers published a paper on a small AI model called ReALM.
  • This new system can interpret context from on-screen content.
  • The paper claims that ReALM’s performance is comparable to GPT-4 for reference resolution.

What is ReALM?

ReALM, which stands for Reference Resolution As Language Modeling, can understand images and text on the screen to enhance the interactions with the AI.

The concept of reference resolution involves a computer program performing a task based on vague language inputs, such as a user saying “this” or “that.” It’s a complex issue since computers can’t interpret images the way humans can. However, Apple seems to have found a streamlined resolution using LLMs.

The research paper proposes a novel approach to encode on-screen entities and their spatial relationships into a textual representation which can be processed by an LLM. This is done by parsing the screen, sorting the elements based on their position and constructing a representation that preserves the spatial positions of the elements.

There are 4 sizes mentioned in the paper: 80M, 250M, 1B, and 3B. The “M” and “B” represent the number of parameters in millions and billions, respectively.

The concept presented here is a game changer for Siri interaction.

While interacting with smart assistants, you often provide context-dependent information like the restaurant you visited last week or the recipe you last searched for. These are specific entities based on the past and current state of the device.

However, this requires extensive computational resources due to the large amount of references that need to be processed on a day-to-day basis.

How does ReALM work?

This is where the novel approach of ReALM has a huge impact. ReALM converts all relevant contextual information to text which simplifies the task for the language model.

 Given relevant entities and a task the user wants to perform, the method should extract the entities that are pertinent to the current user query. The relevant entities are of 3 different types:

  • On-screen Entities: These are entities that are currently displayed on a user’s screen.
  • Conversational Entities: These are entities relevant to the conversation. These entities might come from a previous turn for the user (for example, when the user says “Call Mom”, the contact for Mom would be the relevant entity in question), or from the virtual assistant (for example, when the agent provides a user with a list of places or alarms to choose from).
  • Background Entities: These are relevant entities that come from background processes that might not necessarily be a direct part of what the user sees on their screen or their interaction with the virtual agent; for example, an alarm that starts ringing or music that is playing in the background.

The key steps involved in converting these entities to textual forms are:

  1. Parsing the screen: First, ReALM assumes the presence of upstream data detectors which parse the screen and extract entities like phone numbers, contact names, and addresses with their bounding boxes.
  2. Sorting elements based on spatial positions: These extracted entities are sorted based on their positions on the screen, vertically from top to bottom based on the y-coordinates of their bounding box. Then a stable sort is performed horizontally from left to right based on the x-coordinates.
  3. Determining vertical levels: A margin is defined to group elements that are within a certain distance from each other vertically. Elements within this margin are considered to be on the same horizontal level or line.
  4. Constructing the textual representation: The sorted elements are then represented in a text format, with elements on the same horizontal level separated by a tab character, and elements on different levels separated by newline characters. This preserves the relative spatial positioning of the elements on the screen.
  5. Injecting turn objects: The entities that need to be resolved (referred to as “turn objects”) are injected into this textual representation by enclosing them in double curly braces {{ }}.

By converting the on-screen information into this textual format, ReALM can leverage the power of LLMs to understand the spatial relationships between entities and resolve references accordingly. 

The authors fine-tuned a FLAN-T5 model on various datasets consisting of conversational, synthetic, and on-screen references, and demonstrated that their approach (ReALM) outperforms existing systems and performs comparably to or better than GPT-4, despite using significantly fewer parameters.

ReAKM comparison with other models

This innovative encoding method allows ReALM to handle references to on-screen elements without relying on complex visual understanding models or multi-modal architectures.

Instead, it leverages the strong language understanding capabilities of LLMs while providing the necessary spatial context through textual representation.

Here is an example of how the user screen is seen by on-screen extractors:

Technical diagrams representing user screens, detectable by screen parser-extractors.

Here is an example of how inputs into the model have been encoded, in the form of a visual representation:

Qualitative Examples of LLM-based finetuned model able to adapt to complex use-cases

Here is what the Apple Researchers think about its performance:

“We show that ReaLM outperforms previous approaches, and performs roughly as well as the stateof-the-art LLM today, GPT-4, despite consisting of far fewer parameters, even for onscreen references despite being purely in the textual domain.”

By encoding spatial information into textual representations, ReALM outperforms existing systems and rivals state-of-the-art models using fewer parameters. This fine-tuning approach paves the way for more natural and efficient conversations.

Conclusion

This new paper by the Apple researcher and the implementation of this technique will fundamentally change the way smart assistants process contextual data. Apple is moving forward fast with MM1 models as well. Let’s wait for a few more months to know if it comes to our hands or not!

ShareTweetShareSendSend
Ruchi Abhyankar

Ruchi Abhyankar

Hi, I'm Ruchi Abhyankar, a final year BTech student graduating with honors in AI and ML. My academic interests revolve around generative AI, deep learning, and data science. I am very passionate about open-source learning and am constantly exploring new technologies.

RelatedPosts

Candidate during Interview

9 Best AI Interview Assistant Tools For Job Seekers in 2025

May 1, 2025
AI Generated Tom and Jerry Video

AI Just Created a Full Tom & Jerry Cartoon Episode

April 12, 2025
Amazon Buy for Me AI

Amazon’s New AI Makes Buying from Any Website Easy

April 12, 2025
Microsoft New AI version of Quake 2

What Went Wrong With Microsoft’s AI Version of Quake II?

April 7, 2025
AI Reasoning Model Better Method

This Simple Method Can Make AI Reasoning Faster and Smarter

April 3, 2025

About FavTutor

FavTutor is a trusted online tutoring service to connects students with expert tutors to provide guidance on Computer Science subjects like Java, Python, C, C++, SQL, Data Science, Statistics, etc.

Categories

  • AI News, Research & Latest Updates
  • Trending
  • Data Structures
  • Web Developement
  • Data Science

Important Subjects

  • Python Assignment Help
  • C++ Help
  • R Programming Help
  • Java Homework Help
  • Programming Help

Resources

  • About Us
  • Contact Us
  • Editorial Policy
  • Privacy Policy
  • Terms and Conditions

Website listed on Ecomswap. © Copyright 2025 All Rights Reserved.

No Result
View All Result
  • AI News
  • Data Structures
  • Web Developement
  • AI Code Generator
  • Student Help
  • Main Website

Website listed on Ecomswap. © Copyright 2025 All Rights Reserved.