Articles by FavTutor
  • AI News
  • Data Structures
  • Web Developement
  • AI Code GeneratorNEW
  • Student Help
  • Main Website
No Result
View All Result
FavTutor
  • AI News
  • Data Structures
  • Web Developement
  • AI Code GeneratorNEW
  • Student Help
  • Main Website
No Result
View All Result
Articles by FavTutor
No Result
View All Result
Home AI News, Research & Latest Updates

Google’s Gemini 1.5 Pro Got Ears, Can Now Listen to Audio

Dhruv Kudalkar by Dhruv Kudalkar
April 10, 2024
Reading Time: 7 mins read
Google Gemini 1.5 Pro Can Now Process Audio Stream
Follow us on Google News   Subscribe to our newsletter

At its Next 2024 event, Google announced major updates for its Gemini 1.5 Pro model, but the new listening capabilities stand out!

Highlights:

  • Google Gemini 1.5 Pro can now listen to uploaded audio files and interpret them without referring to written transcripts.
  • The new feature can be used to process information from recorded interviews, videos and earning calls.
  • The model is now available in public preview through its Vertex AI platform to build applications.

Gemini 1.5 Pro Can Now Process Audio Streams

Google now allows users to prompt audio files to the Gemini 1.5 Pro model. The model then extracts relevant information from the audio inputs, including music and speech, to generate a response based on the prompt.

This feature lets users feed large audio files without the need to provide transcripts for them. Users can also record audio clips instead of typing long prompts.

Gemini excels at interpreting very large audio files within a matter of seconds. When an audio file is uploaded, it automatically counts the number of tokens and displays it on the interface. It then gives the required response in a structured format.

It can summarize the audio, extract necessary information, answer direct questions, provide reasoning, and explain concepts. Thus, with Gemini being given ‘ears,’ it enhances the user experience and makes it extremely convenient for users when they need to upload audio files or speak out lengthy questions.

The announcement came with many new updates at Next ’24 event, including making it available in public preview, inpainting in Imagen 2.0, and new prompt management capabilities for their large models. Here is how they described the new feature:

“In addition, we are announcing that Gemini 1.5 Pro on Vertex AI now supports the ability to process audio streams including speech, and even the audio portion of videos. This enables seamless cross-modal analysis that provides insights across text, images, videos, and audio — such as using the model to transcribe, search, analyze, and answer questions across earnings calls or investor meetings.”

Now, users can upload their lectures, recorded interviews, conferences, earning calls, music samples, conversations with friends, podcasts or even audio from videos, and get them analyzed through Gemini. After its initial launch, developers found many interesting features in Gemini 1.5 Pro after testing.

How to Access It?

The new audio processing capabilities to Gemini 1.5 Pro model through their Vertex AI platform.

Anyone can access the model available for free through Google DeepMind’s official website. Click on ‘Try Gemini 1.5.’ You will have to sign in before being redirected to the Google AI studio through which the model can be tested for free. The interface for the model is shown below:

Google AI studio platform

Now, to upload audio files, the user will have to first connect your Google Drive. After this, they can record an audio, upload an audio from Drive, or upload an audio from your local machine.

What You Can Do with this new Listening feature?

We tested the audio feature for various use cases. They are as follows:

Example 1: Conversation

We provided Gemini with an audio of a conversation between a college student and a librarian. We asked it to summarize the audio in a structured format. The audio is of 2 minutes and 36 seconds.

1) I provided Gemini with an audio of a conversation between a college student and a librarian. I asked Gemini to summarize the audio in a structured format that included all the important information mentioned in the audio. I also asked a follow-up question. pic.twitter.com/gNCIOTnPsR

— Dhruv (@dhruvvvvvvvvv_) April 9, 2024

Gemini was able to respond within 20 seconds thus highlighting the speed of interpreting audio as mentioned above. It generated an accurate response that included all the important information mentioned in the audio.

Gemini 1.5 Pro generating conversation summary using audio processing feature

We then provided Gemini with a small part of the audio we previously provided. We asked it to listen to the audio and give an interpretation of what exactly the dialogue means. It provided an accurate explanation of what the dialogue meant in the context of the conversation.

follow up question

Thus, it can also accurately answer follow-up questions based on specific parts of the audio.

Example 2: Summarization

We provided Gemini with an audio from a geography lecture and asked it to summarize the lecture in short.

2) I provided Gemini with an audio from a geography lecture and asked it to summarize the lecture in short. This was a very lengthy video of precisely 5 minutes and 38 seconds. Gemini was able to provide a short and simple summary within 20 seconds which is really impressive. pic.twitter.com/ZmHxa6BAtW

— Dhruv (@dhruvvvvvvvvv_) April 9, 2024

This was a very lengthy video of precisely 5 minutes and 38 seconds.

Gemini 1.5 Pro for summarization through audio processing feature

It was able to provide a short and simple summary within 20 seconds which is impressive considering the length of the audio clip.

Example 3: Record Own Audio

Here, we asked Gemini a question using the ‘record an audio’ option. In the audio, we requested it to explain the different evaluation metrics for a Machine Learning model. We also asked it to explain how a confusion matrix is built along with the different formulae that can be derived from it.

3) I asked Gemini a question using the ‘record an audio’ option. I requested Gemini to explain the different evaluation metrics for a Machine Learning model. I also asked it to explain how a confusion matrix is built along with the different formulae that can be derived from it. pic.twitter.com/l15NFKB2Wz

— Dhruv (@dhruvvvvvvvvv_) April 9, 2024

Gemini provided an answer for every question accurately. It also provided additional evaluation metrics other than the ones derived from the confusion matrix.

asking questions to Google Gemini 1.5 Pro

This is quite impressive as users now don’t need to type out long questions. Instead, they can just speak out their questions in a few seconds and the model will understand the question and provide the correct response.

Similar capabilities are available in OpenAI’s ChatGPT, so now there is good competition between Gemini 1.5 and GPT-4.

Conclusion

Google’s Gemini 1.5 Pro model’s new feature for audio interpretation without transcripts marks a significant advancement in the ever-improving field of LLMs. With its ability to swiftly summarize lengthy audio files and accurately respond to spoken queries, Gemini promises to improve user interaction and enhance the overall user experience.

ShareTweetShareSendSend
Dhruv Kudalkar

Dhruv Kudalkar

Hello, I'm Dhruv Kudalkar, a final year undergraduate student pursuing a degree in Information Technology. My research interests revolve around Generative AI and Natural Language Processing (NLP). I constantly explore new technologies and strive to stay up-to-date in these fields, driven by a passion for innovation and a desire to contribute to the ever-evolving landscape of intelligent systems.

RelatedPosts

Candidate during Interview

9 Best AI Interview Assistant Tools For Job Seekers in 2025

May 1, 2025
AI Generated Tom and Jerry Video

AI Just Created a Full Tom & Jerry Cartoon Episode

April 12, 2025
Amazon Buy for Me AI

Amazon’s New AI Makes Buying from Any Website Easy

April 12, 2025
Microsoft New AI version of Quake 2

What Went Wrong With Microsoft’s AI Version of Quake II?

April 7, 2025
AI Reasoning Model Better Method

This Simple Method Can Make AI Reasoning Faster and Smarter

April 3, 2025

About FavTutor

FavTutor is a trusted online tutoring service to connects students with expert tutors to provide guidance on Computer Science subjects like Java, Python, C, C++, SQL, Data Science, Statistics, etc.

Categories

  • AI News, Research & Latest Updates
  • Trending
  • Data Structures
  • Web Developement
  • Data Science

Important Subjects

  • Python Assignment Help
  • C++ Help
  • R Programming Help
  • Java Homework Help
  • Programming Help

Resources

  • About Us
  • Contact Us
  • Editorial Policy
  • Privacy Policy
  • Terms and Conditions

Website listed on Ecomswap. © Copyright 2025 All Rights Reserved.

No Result
View All Result
  • AI News
  • Data Structures
  • Web Developement
  • AI Code Generator
  • Student Help
  • Main Website

Website listed on Ecomswap. © Copyright 2025 All Rights Reserved.