Articles by FavTutor
  • AI News
  • Data Structures
  • Web Developement
  • AI Code GeneratorNEW
  • Student Help
  • Main Website
No Result
View All Result
FavTutor
  • AI News
  • Data Structures
  • Web Developement
  • AI Code GeneratorNEW
  • Student Help
  • Main Website
No Result
View All Result
Articles by FavTutor
No Result
View All Result
Home AI News, Research & Latest Updates

Microsoft’s LLMLingua-2 Compresses Prompts By 80% in Length

Dhruv Kudalkar by Dhruv Kudalkar
March 26, 2024
Reading Time: 7 mins read
Microsoft LLMLingua 2 for Prompt Compression
Follow us on Google News   Subscribe to our newsletter

Microsoft recently released a research paper on LLMLingua 2, a novel compression model for prompt compression. Let’s look at how it works!

Highlights:

  • Microsoft Research released LLMLingua 2, a novel approach for task-agnostic prompt compression.
  • It can reduce the lengths of prompts to as small as 20 percent of the original prompt while functioning 3-6x faster than its predecessor LLMLingua
  • It is openly available for use on open-source collaboration platforms GitHub and HuggingFace.

Why do we need to Compress Prompts?

Optimizing the length of a prompt is very important. Longer prompts can lead to higher costs and increased latency which will affect the overall performance of a model. This will hurt the LLM in terms of its efficiency.

There are various challenges associated with long prompts:

  • Higher Costs: Operating Large Language Models (LLMs), especially when dealing with lengthy prompts, can incur significant computational expenses. Longer prompts need high computational resources to process, thus contributing to higher operational costs.
  • Increased Latency: The processing of lengthy prompts consumes a higher amount of time which in turn slows down the response time of LLs. Such delays can rescue the efficiency of AI-generated outputs

To overcome these issues, prompts have to be compressed so that the performance of LLMs can be optimized. The advantages of prompt compression are:

  • Improved Efficiency: Compression of prompts reduces the time required by LLMs to process data. This leads to faster response times and improved efficiency.
  • Optimised Resource Utilization: Smaller prompts ensure that AI systems function efficiently without any unnecessary overhead. This ensures that computational resources are optimally utilized.
  • Cost Reduction: By shortening prompts, computational resources required to operate LLM can be reduced, thus resulting in cost savings.

Compressing a prompt is not just about shortening its length and reducing its words. Rather, it’s about understanding the exact meaning of the prompt and then suitably reducing its length. That’s where LLMLingua2 comes in.

What is LLMLingua 2?

LLMLingua 2 is a compression model developed by Microsoft Research for task-agnostic compression of prompts. This novel task-agnostic method ensures that this technique works across various tasks, thus eliminating the requirement for specific adjustments based on different tasks every time.

LLMLingua 2 employs intelligent compression techniques to shorten lengthy prompts by eliminating redundant words or tokens while preserving important information. Microsoft Research claims that LLMLingua 2 is 3-6 times faster than its predecessor LLMLingua and similar methodologies.

How LLMLingua 2 Works

The steps involved in this technique are:

Data Distillation

To extract knowledge from the LLM for effective prompt compression, LLMLingua 2 prompts GPT-4 to generate compressed texts from original texts that satisfy the following criteria:

  1. Token reduction
  2. Informativeness
  3. Faithfulness

However, the team developing LLMLingua 2 found that distilling such data from GPT-4 is a challenging process as it does not consistently follow instructions.

Experiments determined that GPT-4 struggles to retain essential information from texts. GPT-4 tended to modify expressions in the original content and sometimes came up with hallucinated content. So, to overcome this, they came up with a solution for distillation.

To ensure the text remains faithful, they explicitly instructed GPT4 to compress the text by discarding unimportant words in the original texts only and not adding any new words during generation.

To ensure token reduction and informativeness, previous studies had specified either a compression ratio or a target number of compressed tokens in the instructions.

However, GPT-4 often fails to adhere to this. The density of text could vary depending on the genre, and style. Also, within a specific domain, the information density from different people could vary.

These factors suggested that a compression ratio might not be optimal. So, they removed this restriction from the instructions and instead prompted GPT04 to compress the original text as short as possible while retaining as much essential information as feasible.

Given below are the instructions used for compression:

instructions used for compression

They also evaluated a few other instructions that were proposed in LLMLingua. However, these instructions were not optimal for LLMLingua 2. The instructions are:

instructions that were proposed in LLMLingua

Data Annotation

The compressed versions from the previous step are compared to the original versions to create a training dataset for the compression model. In this dataset, every word in the original prompt is labelled indicating whether it is essential for compression.

Quality Control

The two quality metrics to assess the quality of compressed texts and automatically annotated labels are:

  • Variation Rate: It measures the proportion of words in the compressed text that are absent in the original text
  • Alignment Gap: This is used to measure the quality of the annotated labels

Compressor

They framed prompt compression as a binary token classification problem, distinguishing between preservation and discarding, ensuring fidelity to the original content while maintaining the low latency of the compression model.

A Transformer encoder is utilized as the feature extractor for the token classification model, leveraging bidirectional context information for each token.

Prompt Compression

When a prompt is provided, the compressor trained in the previous step identifies the key data and generates a shortened version while also retaining the essential information that will make the LLM perform effectively.

Training Data

They used an extractive text compression dataset that contained pairs of original texts from the MeetingBank dataset along with their compressed text representations. The compressor has been trained using this dataset.

Prompt Reconstruction

They also attempted prompt reconstruction by conducting experiments of prompting GPT-4 to reconstruct the original prompt from the compressed prompt generated by LLMLingua 2. The results showed that GPT-4 could effectively reconstruct the original prompt. This showed that there was no essential information lost during the compression phase.

LLMLingua 2 Prompt Compression Example

The example below shows compression of about 2x. Such a massive reduction in the prompt size will help reduce costs and latency and thus increase the efficiency of the LLM.

LLMLingua 2 Prompt Compression Example

The example has been taken from the research paper.

Another recent development from Microsoft to read about is Orca-Math which can solve big math problems using a small language model.

Conclusion

LLMLingua 2 represents a transformative approach for prompt compression to help cut costs and latency for operating an LLM while retaining essential information. This innovative approach not only facilitates faster and streamlined prompt processing but also enables task-agnostic prompt compression, thereby unleashing the full potential of LLMs across diverse use cases.

ShareTweetShareSendSend
Dhruv Kudalkar

Dhruv Kudalkar

Hello, I'm Dhruv Kudalkar, a final year undergraduate student pursuing a degree in Information Technology. My research interests revolve around Generative AI and Natural Language Processing (NLP). I constantly explore new technologies and strive to stay up-to-date in these fields, driven by a passion for innovation and a desire to contribute to the ever-evolving landscape of intelligent systems.

RelatedPosts

Candidate during Interview

9 Best AI Interview Assistant Tools For Job Seekers in 2025

May 1, 2025
AI Generated Tom and Jerry Video

AI Just Created a Full Tom & Jerry Cartoon Episode

April 12, 2025
Amazon Buy for Me AI

Amazon’s New AI Makes Buying from Any Website Easy

April 12, 2025
Microsoft New AI version of Quake 2

What Went Wrong With Microsoft’s AI Version of Quake II?

April 7, 2025
AI Reasoning Model Better Method

This Simple Method Can Make AI Reasoning Faster and Smarter

April 3, 2025

About FavTutor

FavTutor is a trusted online tutoring service to connects students with expert tutors to provide guidance on Computer Science subjects like Java, Python, C, C++, SQL, Data Science, Statistics, etc.

Categories

  • AI News, Research & Latest Updates
  • Trending
  • Data Structures
  • Web Developement
  • Data Science

Important Subjects

  • Python Assignment Help
  • C++ Help
  • R Programming Help
  • Java Homework Help
  • Programming Help

Resources

  • About Us
  • Contact Us
  • Editorial Policy
  • Privacy Policy
  • Terms and Conditions

Website listed on Ecomswap. © Copyright 2025 All Rights Reserved.

No Result
View All Result
  • AI News
  • Data Structures
  • Web Developement
  • AI Code Generator
  • Student Help
  • Main Website

Website listed on Ecomswap. © Copyright 2025 All Rights Reserved.