{"id":3454,"date":"2024-04-10T08:06:00","date_gmt":"2024-04-10T08:06:00","guid":{"rendered":"https:\/\/favtutor.com\/articles\/?p=3454"},"modified":"2024-04-10T10:16:22","modified_gmt":"2024-04-10T10:16:22","slug":"gemini-pro-audio-listen","status":"publish","type":"post","link":"https:\/\/favtutor.com\/articles\/gemini-pro-audio-listen\/","title":{"rendered":"Google&#8217;s Gemini 1.5 Pro Got Ears, Can Now Listen to Audio"},"content":{"rendered":"\n<p>At its Next 2024 event, Google announced major updates for its Gemini 1.5 Pro model, but the new listening capabilities stand out!<\/p>\n\n\n\n<p><strong>Highlights:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Google Gemini 1.5 Pro can now listen to uploaded audio files and interpret them without referring to written transcripts.<\/li>\n\n\n\n<li>The new feature can be used to process information from recorded interviews, videos and earning calls.<\/li>\n\n\n\n<li> The model is now available in public preview through its Vertex AI platform to build applications.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Gemini 1.5 Pro Can Now Process Audio Streams<\/strong><\/h2>\n\n\n\n<p><strong>Google now allows users to prompt audio files to the Gemini 1.5 Pro model. The model then extracts relevant information from the audio inputs, including music and speech, to generate a response based on the prompt. <\/strong><\/p>\n\n\n\n<p>This feature lets users feed large audio files without the need to provide transcripts for them. Users can also record audio clips instead of typing long prompts.<\/p>\n\n\n\n<p>Gemini excels at interpreting very large audio files within a matter of seconds. When an audio file is uploaded, it automatically counts the number of tokens and displays it on the interface. It then gives the required response in a structured format.<\/p>\n\n\n\n<p>It can summarize the audio, extract necessary information, answer direct questions, provide reasoning, and explain concepts. Thus, with Gemini being given \u2018ears,\u2019 it enhances the user experience and makes it extremely convenient for users when they need to upload audio files or speak out lengthy questions.<\/p>\n\n\n\n<p>The announcement came with <a href=\"https:\/\/cloud.google.com\/blog\/products\/ai-machine-learning\/google-cloud-gemini-image-2-and-mlops-updates\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">many new updates at Next &#8217;24 event<\/a>, including making it available in public preview, inpainting in Imagen 2.0, and new prompt management capabilities for their large models. Here is how they described the new feature:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>&#8220;In addition, we are announcing that Gemini 1.5 Pro on Vertex AI now supports the ability to process audio streams including speech, and even the audio portion of videos. This enables seamless cross-modal analysis that provides insights across text, images, videos, and audio \u2014 such as using the model to transcribe, search, analyze, and answer questions across earnings calls or investor meetings.&#8221;<\/p>\n<\/blockquote>\n\n\n\n<p>Now, users can upload their lectures, recorded interviews, conferences, earning calls, music samples, conversations with friends, podcasts or even audio from videos, and get them analyzed through Gemini. After its initial launch, <a href=\"https:\/\/favtutor.com\/articles\/gemini-pro-testing-developers-feeback\/\">developers found many interesting features in Gemini 1.5 Pro<\/a> after testing.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>How to Access It?<\/strong><\/h2>\n\n\n\n<p><strong>The new audio processing capabilities to Gemini 1.5 Pro model through their Vertex AI platform. <\/strong><\/p>\n\n\n\n<p>Anyone can access the model available for free through <a href=\"https:\/\/deepmind.google\/technologies\/gemini\/#gemini-1.5\" target=\"_blank\" rel=\"noopener\">Google DeepMind\u2019s official website<\/a>. Click on \u2018Try Gemini 1.5.\u2019 You will have to sign in before being redirected to the Google AI studio through which the model can be tested for free. The interface for the model is shown below:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img decoding=\"async\" width=\"1024\" height=\"603\" src=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image12-1024x603.png\" alt=\"Google AI studio platform\" class=\"wp-image-3457\" srcset=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image12-1024x603.png 1024w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image12-300x177.png 300w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image12-768x452.png 768w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image12-1536x904.png 1536w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image12-750x441.png 750w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image12-1140x671.png 1140w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image12.png 1638w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n\n<p>Now, to upload audio files, the user will have to first connect your Google Drive. After this, they can record an audio, upload an audio from Drive, or upload an audio from your local machine.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>What You Can Do with this new Listening feature?<\/strong><\/h2>\n\n\n\n<p>We tested the audio feature for various use cases. They are as follows:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Example 1: Conversation<\/strong><\/h3>\n\n\n\n<p>We provided Gemini with an audio of a conversation between a college student and a librarian. We asked it to summarize the audio in a structured format. The audio is of 2 minutes and 36 seconds. <\/p>\n\n\n\n<div align=\"center\"><blockquote class=\"twitter-tweet\" data-conversation=\"none\"><p lang=\"en\" dir=\"ltr\">1) I provided Gemini with an audio of a conversation between a college student and a librarian. I asked Gemini to summarize the audio in a structured format that included all the important information mentioned in the audio. I also asked a follow-up question. <a href=\"https:\/\/t.co\/gNCIOTnPsR\" target=\"_blank\">pic.twitter.com\/gNCIOTnPsR<\/a><\/p>&mdash; Dhruv (@dhruvvvvvvvvv_) <a href=\"https:\/\/twitter.com\/dhruvvvvvvvvv_\/status\/1777802464630128881?ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener\">April 9, 2024<\/a><\/blockquote> <script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/div>\n\n\n\n<p>Gemini was able to respond within 20 seconds thus highlighting the speed of interpreting audio as mentioned above. It generated an accurate response that included all the important information mentioned in the audio.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" width=\"943\" height=\"848\" src=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/01.png\" alt=\"Gemini 1.5 Pro generating conversation summary using audio processing feature\" class=\"wp-image-3458\" srcset=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/01.png 943w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/01-300x270.png 300w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/01-768x691.png 768w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/01-750x674.png 750w\" sizes=\"(max-width: 943px) 100vw, 943px\" \/><\/figure>\n<\/div>\n\n\n<p>We then provided Gemini with a small part of the audio we previously provided. We asked it to listen to the audio and give an interpretation of what exactly the dialogue means. It provided an accurate explanation of what the dialogue meant in the context of the conversation. <\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" width=\"940\" height=\"387\" src=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/04.png\" alt=\"follow up question\" class=\"wp-image-3459\" srcset=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/04.png 940w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/04-300x124.png 300w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/04-768x316.png 768w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/04-750x309.png 750w\" sizes=\"(max-width: 940px) 100vw, 940px\" \/><\/figure>\n<\/div>\n\n\n<p>Thus, it can also accurately answer follow-up questions based on specific parts of the audio.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Example 2: Summarization<\/strong><\/h3>\n\n\n\n<p>We provided Gemini with an audio from a geography lecture and asked it to summarize the lecture in short. <\/p>\n\n\n\n<div align=\"center\"><blockquote class=\"twitter-tweet\" data-conversation=\"none\"><p lang=\"en\" dir=\"ltr\">2) I provided Gemini with an audio from a geography lecture and asked it to summarize the lecture in short. This was a very lengthy video of precisely 5 minutes and 38 seconds. Gemini was able to provide a short and simple summary within 20 seconds which is really impressive. <a href=\"https:\/\/t.co\/ZmHxa6BAtW\" target=\"_blank\">pic.twitter.com\/ZmHxa6BAtW<\/a><\/p>&mdash; Dhruv (@dhruvvvvvvvvv_) <a href=\"https:\/\/twitter.com\/dhruvvvvvvvvv_\/status\/1777803401255883028?ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener\">April 9, 2024<\/a><\/blockquote> <script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/div>\n\n\n\n<p>This was a very lengthy video of precisely 5 minutes and 38 seconds. <\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" width=\"934\" height=\"750\" src=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/05.png\" alt=\"Gemini 1.5 Pro for summarization through audio processing feature\" class=\"wp-image-3460\" srcset=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/05.png 934w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/05-300x241.png 300w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/05-768x617.png 768w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/05-750x602.png 750w\" sizes=\"(max-width: 934px) 100vw, 934px\" \/><\/figure>\n<\/div>\n\n\n<p>It was able to provide a short and simple summary within 20 seconds which is impressive considering the length of the audio clip.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Example 3: Record Own Audio<\/strong><\/h3>\n\n\n\n<p>Here, we asked Gemini a question using the \u2018record an audio\u2019 option. In the audio, we requested it to explain the different evaluation metrics for a Machine Learning model. We also asked it to explain how a confusion matrix is built along with the different formulae that can be derived from it. <\/p>\n\n\n\n<div align=\"center\"><blockquote class=\"twitter-tweet\" data-conversation=\"none\"><p lang=\"en\" dir=\"ltr\">3) I asked Gemini a question using the \u2018record an audio\u2019 option. I requested Gemini to explain the different evaluation metrics for a Machine Learning model. I also asked it to explain how a confusion matrix is built along with the different formulae that can be derived from it. <a href=\"https:\/\/t.co\/l15NFKB2Wz\" target=\"_blank\">pic.twitter.com\/l15NFKB2Wz<\/a><\/p>&mdash; Dhruv (@dhruvvvvvvvvv_) <a href=\"https:\/\/twitter.com\/dhruvvvvvvvvv_\/status\/1777804180469449054?ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener\">April 9, 2024<\/a><\/blockquote> <script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/div>\n\n\n\n<p>Gemini provided an answer for every question accurately. It also provided additional evaluation metrics other than the ones derived from the confusion matrix.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img decoding=\"async\" width=\"775\" height=\"1024\" src=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image7-775x1024.png\" alt=\"asking questions to Google Gemini 1.5 Pro\" class=\"wp-image-3461\" srcset=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image7-775x1024.png 775w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image7-227x300.png 227w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image7-768x1015.png 768w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image7-750x991.png 750w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image7.png 990w\" sizes=\"(max-width: 775px) 100vw, 775px\" \/><\/figure>\n<\/div>\n\n\n<p>This is quite impressive as users now don\u2019t need to type out long questions. Instead, they can just speak out their questions in a few seconds and the model will understand the question and provide the correct response.<\/p>\n\n\n\n<p>Similar capabilities are available in OpenAI&#8217;s ChatGPT, so now there is <a href=\"https:\/\/favtutor.com\/articles\/gemini-vs-gpt-4\/\">good competition between Gemini 1.5 and GPT-4.<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Conclusion<\/strong><\/h2>\n\n\n\n<p>Google&#8217;s Gemini 1.5 Pro model&#8217;s new feature for audio interpretation without transcripts marks a significant advancement in the ever-improving field of LLMs. With its ability to swiftly summarize lengthy audio files and accurately respond to spoken queries, Gemini promises to improve user interaction and enhance the overall user experience.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Google announced new audio processing features for the Gemini 1.5 Pro model. Find out how to access it and its various use cases.<\/p>\n","protected":false},"author":18,"featured_media":3463,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jnews-multi-image_gallery":[],"jnews_single_post":null,"jnews_primary_category":{"id":"","hide":""},"footnotes":""},"categories":[57],"tags":[56,64,59,58,72],"class_list":["post-3454","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","tag-ai","tag-gemini","tag-generative-ai","tag-google","tag-llm"],"_links":{"self":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts\/3454","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/comments?post=3454"}],"version-history":[{"count":2,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts\/3454\/revisions"}],"predecessor-version":[{"id":3464,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts\/3454\/revisions\/3464"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/media\/3463"}],"wp:attachment":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/media?parent=3454"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/categories?post=3454"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/tags?post=3454"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}