Microsoft's LLMLingua-2 Compresses Prompts By 80% in Length

Microsoft recently released a research paper on LLMLingua 2, a novel compression model for prompt compression. Let’s look at how it works!

Highlights:

Microsoft Research released LLMLingua 2, a novel approach for task-agnostic prompt compression.
It can reduce the lengths of prompts to as small as 20 percent of the original prompt while functioning 3-6x faster than its predecessor LLMLingua
It is openly available for use on open-source collaboration platforms GitHub and HuggingFace.

Why do we need to Compress Prompts?

Optimizing the length of a prompt is very important. Longer prompts can lead to higher costs and increased latency which will affect the overall performance of a model. This will hurt the LLM in terms of its efficiency.

There are various challenges associated with long prompts:

Higher Costs: Operating Large Language Models (LLMs), especially when dealing with lengthy prompts, can incur significant computational expenses. Longer prompts need high computational resources to process, thus contributing to higher operational costs.
Increased Latency: The processing of lengthy prompts consumes a higher amount of time which in turn slows down the response time of LLs. Such delays can rescue the efficiency of AI-generated outputs

To overcome these issues, prompts have to be compressed so that the performance of LLMs can be optimized. The advantages of prompt compression are:

Improved Efficiency: Compression of prompts reduces the time required by LLMs to process data. This leads to faster response times and improved efficiency.
Optimised Resource Utilization: Smaller prompts ensure that AI systems function efficiently without any unnecessary overhead. This ensures that computational resources are optimally utilized.
Cost Reduction: By shortening prompts, computational resources required to operate LLM can be reduced, thus resulting in cost savings.

Compressing a prompt is not just about shortening its length and reducing its words. Rather, it’s about understanding the exact meaning of the prompt and then suitably reducing its length. That’s where LLMLingua2 comes in.

What is LLMLingua 2?

LLMLingua 2 is a compression model developed by Microsoft Research for task-agnostic compression of prompts. This novel task-agnostic method ensures that this technique works across various tasks, thus eliminating the requirement for specific adjustments based on different tasks every time.

LLMLingua 2 employs intelligent compression techniques to shorten lengthy prompts by eliminating redundant words or tokens while preserving important information. Microsoft Research claims that LLMLingua 2 is 3-6 times faster than its predecessor LLMLingua and similar methodologies.

The steps involved in this technique are:

Data Distillation

To extract knowledge from the LLM for effective prompt compression, LLMLingua 2 prompts GPT-4 to generate compressed texts from original texts that satisfy the following criteria:

Token reduction
Informativeness
Faithfulness

However, the team developing LLMLingua 2 found that distilling such data from GPT-4 is a challenging process as it does not consistently follow instructions.

Experiments determined that GPT-4 struggles to retain essential information from texts. GPT-4 tended to modify expressions in the original content and sometimes came up with hallucinated content. So, to overcome this, they came up with a solution for distillation.

To ensure the text remains faithful, they explicitly instructed GPT4 to compress the text by discarding unimportant words in the original texts only and not adding any new words during generation.

To ensure token reduction and informativeness, previous studies had specified either a compression ratio or a target number of compressed tokens in the instructions.

However, GPT-4 often fails to adhere to this. The density of text could vary depending on the genre, and style. Also, within a specific domain, the information density from different people could vary.

These factors suggested that a compression ratio might not be optimal. So, they removed this restriction from the instructions and instead prompted GPT04 to compress the original text as short as possible while retaining as much essential information as feasible.

Given below are the instructions used for compression:

They also evaluated a few other instructions that were proposed in LLMLingua. However, these instructions were not optimal for LLMLingua 2. The instructions are:

Data Annotation

The compressed versions from the previous step are compared to the original versions to create a training dataset for the compression model. In this dataset, every word in the original prompt is labelled indicating whether it is essential for compression.

Quality Control

The two quality metrics to assess the quality of compressed texts and automatically annotated labels are:

Variation Rate: It measures the proportion of words in the compressed text that are absent in the original text
Alignment Gap: This is used to measure the quality of the annotated labels

Compressor

They framed prompt compression as a binary token classification problem, distinguishing between preservation and discarding, ensuring fidelity to the original content while maintaining the low latency of the compression model.

A Transformer encoder is utilized as the feature extractor for the token classification model, leveraging bidirectional context information for each token.

Prompt Compression

When a prompt is provided, the compressor trained in the previous step identifies the key data and generates a shortened version while also retaining the essential information that will make the LLM perform effectively.

Training Data

They used an extractive text compression dataset that contained pairs of original texts from the MeetingBank dataset along with their compressed text representations. The compressor has been trained using this dataset.

Prompt Reconstruction

They also attempted prompt reconstruction by conducting experiments of prompting GPT-4 to reconstruct the original prompt from the compressed prompt generated by LLMLingua 2. The results showed that GPT-4 could effectively reconstruct the original prompt. This showed that there was no essential information lost during the compression phase.

LLMLingua 2 Prompt Compression Example

The example below shows compression of about 2x. Such a massive reduction in the prompt size will help reduce costs and latency and thus increase the efficiency of the LLM.

The example has been taken from the research paper.

Another recent development from Microsoft to read about is Orca-Math which can solve big math problems using a small language model.

Conclusion

LLMLingua 2 represents a transformative approach for prompt compression to help cut costs and latency for operating an LLM while retaining essential information. This innovative approach not only facilitates faster and streamlined prompt processing but also enables task-agnostic prompt compression, thereby unleashing the full potential of LLMs across diverse use cases.