Comparing Llama 3 vs GPT-4: Showdown of AI Titans

According to Meta, Llama 3 outperforms several existing LLMs on crucial benchmarks, showcasing superior performance. This has led to various comparisons and tests between it 3 and similar large language models (LLMs). In this article, we will compare Meta’s Llama 3 with OpenAI’s best model, GPT-4.

Meta’s Llama 3 vs OpenAI’s GPT-4

AI models should be best for every industry and different use cases, including coding, text generation, solving riddles, general knowledge etc. Let’s compare both of them in every scenario:

1) Apple Test

In the Apple test, an LLM is asked to generate 10 sentences that end with the word ‘apple.’ LLMs often struggle with this task and cannot achieve 100% accuracy. We performed the Apple Test on Llama 3 and GPT-4.

Prompt: Generate 10 sentences that end with the word apple.

Llama 3:

GPT-4:

Llama 3 achieved an accuracy of 100% as it successfully generated 10 sentences ending with the word ‘apple.’ On the other hand, GPT-4 could generate only 9 sentences ending with apple thus marginally losing out to Llama in the apple test.

Meta’s Llama 3 beats OpenAI’s GPT-4 in Apple Test.

2) Mathematical Riddles – Magic Elevator Test

The magic elevator test is a famous test to evaluate the logical capabilities of LLM’s. In this test, the model is presented with a logical puzzle related to an elevator and told to give the correct answer.

Prompt: There is a tall building with a magic elevator in it. When stopping on an even floor, this elevator connects to floor 1 instead. Starting on floor 1, I take the magic elevator 3 floors up. Exiting the elevator, I then use the stairs to go 3 floors up again.
Which floor do I end up on?

Llama 3:

GPT-4:

For the magic elevator test, Llama 3 gave the correct answer by successfully understanding the puzzle and the logic involved. It also lets the user know that this is a classic thinking puzzle which could probably imply that it has been trained on this puzzle. However, GPT-4 could not provide the correct answer and it gave the user a wrong approach to the problem.

So, for the magic elevator test, Llama 3 again beats GPT-4.

3) Mathematical Riddles 2

Now, let’s look at another mathematical riddle.

Prompt: Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?

Llama 3:

GPT-4:

In this case, Llama 3 does not understand the puzzle and thus provides the wrong answer. The right answer should be ‘1 sister.’ In comparison, GPT-4 provides the user with an accurate answer along with the correct approach to the riddle.

For mathematical riddles, we can say that Llama 3 and GPT-4 are evenly matched and could excel at different kinds of puzzles.

4) Code Generation

Prompt: Code a snake game in Python.

Llama 3:

import pygame
import time
import random

pygame.init()

white = (255, 255, 255)
yellow = (255, 255, 102)
black = (0, 0, 0)
red = (213, 50, 80)
green = (0, 255, 0)
blue = (50, 153, 213)

dis_width = 800
dis_height = 600

dis = pygame.display.set_mode((dis_width, dis_height))
pygame.display.set_caption('Snake Game')

clock = pygame.time.Clock()

snake_block = 10
snake_speed = 15

font_style = pygame.font.SysFont("bahnschrift", 25)
score_font = pygame.font.SysFont("comicsansms", 35)


def Your_score(score):
    value = score_font.render("Your Score: " + str(score), True, yellow)
    dis.blit(value, [0, 0])


def our_snake(snake_block, snake_list):
    for x in snake_list:
        pygame.draw.rect(dis, black, [x[0], x[1], snake_block, snake_block])


def message(msg, color):
    mesg = font_style.render(msg, True, color)
    dis.blit(mesg, [dis_width / 6, dis_height / 3])


def gameLoop():
    game_over = False
    game_close = False

    x1 = dis_width / 2
    y1 = dis_height / 2

    x1_change = 0
    y1_change = 0

    snake_List = []
    Length_of_snake = 1

    foodx = round(random.randrange(0, dis_width - snake_block) / 10.0) * 10.0
    foody = round(random.randrange(0, dis_height - snake_block) / 10.0) * 10.0

    while not game_over:

        while game_close:
            if pygame.mixer.music.get_busy() == False:
                pygame.mixer.music.load('game_over.wav')
                pygame.mixer.music.play()

            for event in pygame.event.get():
                if event.type == pygame.KEYDOWN:
                    if event.key == pygame.K_SPACE:
                        gameLoop()
                    elif event.key == pygame.K_ESCAPE:
                        pygame.quit()
                        quit()

            dis.fill(blue)
            message("You Lost! Press C-Play Again or Q-Quit", red)
            Your_score(Length_of_snake - 1)
            pygame.display.update()

        for event in pygame.event.get():
            if event.type == pygame.QUIT:
                game_over = True
            if event.type == pygame.KEYDOWN:
                if event.key == pygame.K_LEFT:
                    x1_change = -snake_block
                    y1_change = 0
                elif event.key == pygame.K_RIGHT:
                    x1_change = snake_block
                    y1_change = 0
                elif event.key == pygame.K_UP:
                    y1_change = -snake_block
                    x1_change = 0
                elif event.key == pygame.K_DOWN:
                    y1_change = snake_block
                    x1_change = 0

        if x1 >= dis_width or x1 < 0 or y1 >= dis_height or y1 < 0:
            game_close = True
            pygame.mixer.music.load('game_over.wav')
            pygame.mixer.music.play()

        x1 += x1_change
        y1 += y1_change
        dis.fill(blue)
        pygame.draw.rect(dis, green, [foodx, foody, snake_block, snake_block])
        snake_Head = []
        snake_Head.append(x1)
        snake_Head.append(y1)
        snake_List.append(snake_Head)
        if len(snake_List) > Length_of_snake:
            del snake_List[0]

        for x in snake_List[:-1]:
            if x == snake_Head:
                game_close = True

        our_snake(snake_block, snake_List)
        Your_score(Length_of_snake - 1)

        pygame.display.update()

        if x1 == foodx and y1 == foody:
            Length_of_snake += 1
            foodx = round(random.randrange(0, dis_width - snake_block) / 10.0) * 10.0
            foody = round(random.randrange(0, dis_height - snake_block) / 10.0) * 10.0

        clock.tick(snake_speed)

    pygame.quit()
    quit()

GPT-4:

import pygame
import sys
import random
pygame.init()
# Constants for the game
WIDTH, HEIGHT = 640, 480
GRIDSIZE = 20
GRID_WIDTH = WIDTH // GRIDSIZE
GRID_HEIGHT = HEIGHT // GRIDSIZE
# Colors
BLACK = (0, 0, 0)
WHITE = (255, 255, 255)
GREEN = (0, 255, 0)
RED = (255, 0, 0)
# Initialize display
screen = pygame.display.set_mode((WIDTH, HEIGHT))
clock = pygame.time.Clock()
# Utility functions
def draw_grid():
    for x in range(0, WIDTH, GRIDSIZE):  # Draw vertical lines
        pygame.draw.line(screen, WHITE, (x, 0), (x, HEIGHT))
    for y in range(0, HEIGHT, GRIDSIZE):  # Draw horizontal lines
        pygame.draw.line(screen, WHITE, (0, y), (WIDTH, y))
def random_position():
    return random.randint(0, GRID_WIDTH-1), random.randint(0, GRID_HEIGHT-1)
class Snake:
    def __init__(self):
        self.length = 1
        self.positions = [random_position()]
        self.direction = random.choice([pygame.K_UP, pygame.K_DOWN, pygame.K_LEFT, pygame.K_RIGHT])
        self.color = GREEN
    def get_head_position(self):
        return self.positions[0]
    def turn(self, point):
        if (self.length > 1 and (point[0] * -1, point[1] * -1) == self.direction):
            return
        else:
            self.direction = point
    def move(self):
        current = self.get_head_position()
        x, y = self.direction
        new = (((current[0] + x) % GRID_WIDTH), (current[1] + y) % GRID_HEIGHT)
        if new in self.positions[2:]:
            self.reset()
        else:
            self.positions.insert(0, new)
            if len(self.positions) > self.length:
                self.positions.pop()
    def reset(self):
        self.length = 1
        self.positions = [random_position()]
        self.direction = random.choice([pygame.K_UP, pygame.K_DOWN, pygame.K_LEFT, pygame.K_RIGHT])
    def draw(self):
        for p in self.positions:
            r = pygame.Rect((p[0]*GRIDSIZE, p[1]*GRIDSIZE), (GRIDSIZE, GRIDSIZE))
            pygame.draw.rect(screen, self.color, r)
            pygame.draw.rect(screen, BLACK, r, 1)
    def handle_keys(self):
        for event in pygame.event.get():
            if event.type == pygame.QUIT:
                pygame.quit()
                sys.exit()
            elif event.type == pygame.KEYDOWN:
                if event.key == pygame.K_UP:
                    self.turn((0, -1))
                elif event.key == pygame.K_DOWN:
                    self.turn((0, 1))
                elif event.key == pygame.K_LEFT:
                    self.turn((-1, 0))
                elif event.key == pygame.K_RIGHT:
                    self.turn((1, 0))
class Food:
    def __init__(self):
        self.position = random_position()
        self.color = RED
    def randomize_position(self):
        self.position = random_position()
    def draw(self):
        r = pygame.Rect((self.position[0]*GRIDSIZE, self.position[1]*GRIDSIZE), (GRIDSIZE, GRIDSIZE))
        pygame.draw.rect(screen, self.color, r)
        pygame.draw.rect(screen, BLACK, r, 1)
# Main game loop
def main():
    snake = Snake()
    food = Food()
    while True:
        snake.handle_keys()
        snake.move()
        if snake.get_head_position() == food.position:
            snake.length += 1
            food.randomize_position()
        screen.fill(BLACK)
        draw_grid()
        snake.draw()
        food.draw()
        pygame.display.update()
        clock.tick(10)
if __name__ == '__main__':
    main()

Here, we asked Llama 3 and GPT-4 to code the snake game in Python. Comparing both responses, we see that the game coded by Llama is better in terms of functionality, aesthetics, and additional features. The response provided by Llama 3 also provided instructions on how to run the game and how to maintain the file structure.

For code generation tasks, Llama 3 is better than GPT-4 as seen in the example above

5) Code Debugging

Prompt: Explain the bug in the following code:

from time import sleep
from multiprocessing.pool import ThreadPool

def task():
    sleep(1)
    return 'all done'

if __name__ == '__main__':
    with ThreadPool() as pool:
        result = pool.apply_async(task())
        value = result.get()
        print(value)

Llama 3:

GPT-4:

For code debugging, both GPT-4 and Llama 3 can identify the error and rectify it properly. However, GPT-4 explains the error better and makes it easier for the user to understand.

So, for code debugging, we can say that GPT-4 outperforms Llama 3.

6) Mathematical Operations

Prompt: Factor 2x^5 + 2x^3 + 3x^2 + 3 =

Llama 3:

GPT-4:

Llama 3 is unable the factorize the given polynomial and thus provides inaccurate factors. In comparison, GPT-4 correctly factorizes the polynomial as required.

In mathematical operation questions, GPT-4 beats Llama 3.

7) General Knowledge

Prompt: Is Taiwan an independent country?

Llama 3:

GPT-4:

We asked the models the controversial question about Taiwan being an independent country. We can see that GPT-4 provides a better answer as it is more clear about the topic and ready to speak out on such a question.

We also tested the models on another question.

Prompt: Explain the concept of quantum entanglement in a way that a 10-year-old could understand, using analogies and examples.

Llama 3:

GPT-4:

In this case, the response generated by GPT-4 is better as it is clear, concise, up to the point, and covers everything regarding the concept. It also understands that it should give simple examples as it has to be explained from the point of view of a 10-year-old. In contrast, Llama 3 uses a walkie-talkie to explain the concept, something a 10-year-old might not be aware of.

So, in terms of general knowledge, GPT-4 edges Llama 3.

8) Text Generation

We asked the models to design a product description for a wireless fast charger.

Prompt: Write me a product description for a 100W wireless fast charger for my website.

Llama 3:

GPT-4:

We can see that both the models generated really good product descriptions. Llama 3 covers more use cases and also speaks about the various compatible devices and components of the product box as opposed to GPT-4 which mainly covers features.

Here, we can say that Llama 3 outperforms GPT-4 for text generation tasks.

We asked the models to summarize a short email inquiring about the deposit payment and its deadline for admission to USC.

9) Text Summarization

Prompt: Summarize this in 30 words

“Dear USC Graduate Admissions,
I hope this email finds you well. My name is XYZ, and I am writing to inquire about the deadline to reserve my seat for the Masters in Computer Science program for Fall 2024 at USC Viterbi School of Engineering.
I am delighted to have received my admission letter on February 25th, 2024, and I am eager to confirm my enrollment in the program. However, I would appreciate it if you could provide information regarding the deadline to reserve my seat and the deposit required to secure my spot.
Could you please inform me of the deadline by which I need to confirm my enrollment and pay the deposit? Additionally, I would like to know the amount of the deposit and any relevant details regarding the payment process.
Your assistance in providing this information would be greatly appreciated as it will help me plan accordingly and ensure a smooth transition into the program.
Thank you very much for your attention to this matter. I look forward to your prompt response.
Attached below is my USC ID.
USC ID: 1234567890
Best Regards,
XYZ”

Llama 3:

GPT-4:

Here, we can observe that both summaries and similar and correct. However, we can note that GPT-4 abided by the exact word limit as opposed to Llama 3 and summarized the email in fewer words.

So, for summarization tasks, GPT-4 outperforms Llama 3.

10) Story Narration

We asked Llama 3 and GPT-4 to write a short story in the style of the famous author J.K. Rowling.

Prompt: Write a short story in the style of a famous author, such as J.K. Rowling, set in a futuristic world where humans coexist with advanced AI.

Llama 3:

GPT-4:

In this case, both models don’t perform very effectively. The styles discussed hardly match J.K. Rowling. However, we can say that Llama 3 slightly edged GPT-4 for this use case.

For style mimicking, Llama 3 beats GPT-4.

Conclusion

Llama 3 demonstrates competitiveness with GPT-4 across various tasks, surpassing GPT-4 in certain areas while trailing behind in others. Its open-source nature has attracted significant community interest, and with continued development and enhancements, Llama 3 holds promise to eventually rival the capabilities of the advanced GPT-4 model.

Comparing Llama 3 vs GPT-4: Showdown of AI Titans

Dhruv Kudalkar

RelatedPosts

7 Best AI Tools for Remote Job Seekers in 2025

9 Best AI Interview Assistant Tools For Job Seekers in 2025

AI Just Created a Full Tom & Jerry Cartoon Episode

Amazon’s New AI Makes Buying from Any Website Easy

What Went Wrong With Microsoft’s AI Version of Quake II?

About FavTutor

Categories

Important Subjects

Resources