New Inflection 2.5 Model Achieving GPT-4 Level Performance

Inflection AI has released its latest large language model (LLM), Inflection 2.5. Just days after Claude 3’s release, this is the latest advancement achieving GPT-4 and Gemini level performance. How good is Inflection 2.5? Let’s find out!

Highlights:

Inflection AI unveiled Inflection 2.5, its new Large Language Model for its Pi Chatbot.
Available to all Pi’s website users, on iOS, Android, and their new desktop application.
Nears GPT-4 in several evaluation benchmarks demonstrating impressive performance.

What is Inflection 2.5?

Inflection 2.5 is the new LLM for Pi Chatbot launched on March 7, 2024. Inflection AI added IQ to the already exceptional EQ of its chatbot Pi and also fine-tuned the Inflection 1 model.

The new LLM comes with vast improvements in areas like coding and mathematics. This was made possible due to high strides of IQ involvement. This is huge news for the developer community as now you can use Pi to solve all of your coding projects and various problem-solving tasks requiring reasoning and comprehension.

Pi now has access to top-notch real-time online search capabilities thanks to the recent update, ensuring that users can quickly get accurate and up-to-date information.

Benchmarks Competing with GPT-4 and Gemini

According to Inflection, the latest 2.5 LLM achieves GPT-4 level performance. But the X factor is that it only needs 40% of the amount of GPT-4’s computed data, for its training.

“Inflection-1 used approximately 4% the training FLOPs of GPT-4 and, on average, performed at approximately 72% GPT-4 level on a diverse range of IQ-oriented tasks. Inflection-2.5, now powering Pi, achieves more than 94% the average performance of GPT-4 despite using only 40% the training FLOPs. We see a significant improvement in performance across the board, with the largest gains coming in STEM areas.”

Below we have attached a model-wise comparison between Inflection 1, Inflection 2.5, and GPT-4 (source):

This is just to show the beginning of Inflection 2.5’s capabilities across diverse skill sets and vast use cases. Below we have elaborated on every benchmark and stated how developers can make use of it.

1) Massive Multitask Language Understanding (MMLU)

On the MMLU benchmark, which measures performance across a wide range of tasks from high school to professional-level difficulty, Inflection-2.5 demonstrates significant increases over Inflection-1. Additionally, it was assessed on the exceedingly challenging expert-level GPQA Diamond standard.

Developers who are looking for a chatbot assistant to be adept in problem-solving skills ranging from humanities, and social sciences to much more and to be efficient in world knowledge can stick to Inflection 2.5.

A good score on this benchmark makes the model promising for more general subjects like math and history to more specialized subjects like ethics and law. It achieves ChatGPT-like performance.

2) Mathematical Reasoning

A test was also conducted based on Hungarian Mathematics and Physics GRE scores, two of STEM’s evaluation metrics. Inflection-2.5 achieves nearly the highest score in maj@32 and scores in the 85th percentile of human test-takers in maj@8. It also falls behind ChatGPT slightly in the Hungarian math metric.

Another test was conducted based on the MT-Bench metric which was later rectified as MT-Bench Corrected.

It also showed improvement over Inflection 1 in the GSM8k and MATH metrics. This is a good measure for testing Mathematical capabilities. It slightly loses to GPT-4 but impressively levels with GPT-3.5.

Overall, for the mathematical reasoning and problem-solving aspect, GPT-4 remains the winner, but Inflection 2.5 has done a great deed in overcoming its predecessor Inflection 2, and in leveling with GPT-3.5.

3) Coding Capabilities

To test who has better coding capabilities, MBPP+ and HumanEval+ metrics were looked at. The MBPP benchmark consists of 1000 Python programming problems and the HumanEval benchmark comes with 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics.

All these programs can be solved by entry-level programmers and are a good measure for testing coding functionalities. The results showed a similar trend where Inflection 2.5 beats Inflection 1 but falls behind GPT-4.

Developers should continue using GPT-4 for coding-related purposes at least for now. Maybe in the coming months, we as a community may witness Inflection challenge GPT’s outstanding code generation and optimization abilities.

4) Common Sense

They also evaluated Inflection-2.5 on HellaSwag and ARC-C, common sense and science benchmarks reported by a wide range of models. The results were impressive which showed Inflection 2.5 almost nearing GPT-4 in both the benchmarks.

This metric is important in showing the general knowledge capabilities possessed by the AI chatbots. This is one aspect that comes in handy for users while solving their daily tasks. Over the years both ChatGPT and Gemini (previously Bard) have done outstanding work in narrowing the gap between General Reasoning and AI, and now Inflection is the latest player in the game.

5) Near Human Capabilities

BIG-Bench is an extensive evaluation suite aimed at activities that are thought to be outside the scope of existing language models. Inflection-2.5 outperforms the best models and improves over 10% on Inflection-1 on BIG-Bench-Hard, a subset of BIG-Bench issues challenging for big language models.

We know GPT-4’s near human capabilities, and to see Inflection’s impressive score almost nearing the former, we already get an idea of what the LLM is capable of to surpass the average human-rater performance.

You can also check the comparison for GPT-4 and Claude 3 to find out about the other competitors.

How to Access Inflection 2.5?

You can access Inflection 2.5 through Inflection’s Pi chatbot assistant which was released last year around May. The LLM is powering the chatbot and is available for all users worldwide via pi.ai, iOS, Android, and also their newly introduced Desktop app.

Once you go to pi.ai you can start talking to the Chatbot. You can directly start talking to Pi or you can set up your account by email signup or phone number verification. Setting up an account will enable you to see your conversation history.

So go ahead and try it today. Get the first-hand developer experience with Inflection 2.5!

Conclusion

In addition to maintaining Pi’s recognizable demeanor and security protocols, Inflection-2.5 enhances its standing as a flexible and indispensable personal assistant on a wide range of subjects. Pi powered by Inflection-2.5 promises an enhanced user experience for anything from coding to studying for exams, talking about current events, and even having informal discussions.