{"id":3845,"date":"2024-04-18T11:38:20","date_gmt":"2024-04-18T11:38:20","guid":{"rendered":"https:\/\/favtutor.com\/articles\/?p=3845"},"modified":"2024-04-18T11:38:21","modified_gmt":"2024-04-18T11:38:21","slug":"google-infini-attention","status":"publish","type":"post","link":"https:\/\/favtutor.com\/articles\/google-infini-attention\/","title":{"rendered":"Google&#8217;s Infini-attention Give LLMs Infinite Context Length"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Google Researchers introduced a novel approach for scaling Large Language Models (LLMs) to process infinitely long text inputs. They developed Infini-attention, a technique that configures LLMs to extend their context window while keeping memory and computational requirements constant.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Highlights:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Researchers at Google introduced Infini-attention, a novel approach to give LLMs infinite context length.<\/li>\n\n\n\n<li>Researchers at Google claim that models using Infini-attention can sustain quality across a context window of one million tokens.<\/li>\n\n\n\n<li>Results demonstrate that Infini-Transformers can efficiently process extremely long input sequences with bounded memory.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>What is a Context Window?<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>The context window is an important term in the field of LLMs, referring to the number of words or tokens that a model considers at any given time when processing text. It determines the extent of the model&#8217;s understanding and influences its ability to generate meaningful responses.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If a conversation exceeds the context length, tokens from earlier parts of the conversation may be disregarded, thus in turn affecting the model&#8217;s performance and effectiveness. Every model is designed with a specified context window that represents the optimal operating scope for the model.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In today\u2019s world, expanding context length has emerged as a significant focus for enhancing model performance and gaining a competitive edge. Researchers at Google claim that models equipped with <a href=\"https:\/\/arxiv.org\/pdf\/2404.07143.pdf\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Infini-attention<\/a> can sustain quality across a context window of one million tokens without necessitating extra memory.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Google&#8217;s Infini-attention Methodology<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Compressive Memory<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Infini-attention incorporates a compressive memory into the standard attention mechanism, combining both masked local attention and long-term linear attention in a single Transformer block. It reuses the key, value, and query states from the dot-product attention computation for long-term memory consolidation and retrieval.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The compressive memory is parameterized with an associative matrix, and the memory update and retrieval process is cast as a linear attention mechanism.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Attention Layer<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In Infini-Transformers, the attention layer maintains both global compressive and local fine-grained states. The local attention context is computed within each input segment, while the compressive memory stores and retrieves the entire context history. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The final contextual output is an aggregation of the long-term memory-retrieved values and the local attention contexts.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" width=\"717\" height=\"668\" src=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image1-9.png\" alt=\"Attention Layer\" class=\"wp-image-3846\" srcset=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image1-9.png 717w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image1-9-300x279.png 300w\" sizes=\"(max-width: 717px) 100vw, 717px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">Infini-Transformers process extremely long inputs in a streaming fashion, enabling them to scale to infinitely long contexts with bounded memory and compute resources. The approach introduces minimal changes to the standard scaled dot-product attention and supports plug-and-play continual pre-training and long-context adaptation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The image below shows a comparison between Google\u2019s Infini-Transformer, and <a href=\"https:\/\/arxiv.org\/abs\/1901.02860\" target=\"_blank\" rel=\"noreferrer noopener\">Transformer-XL<\/a>. Like Transformer-XL, Infini-Transformer functions on a sequence of segments, computing standard causal dot-product attention within each segment.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This attention computation is localized within the segment&#8217;s N tokens (where N represents the segment length).\u00a0<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" width=\"781\" height=\"383\" src=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image2-5.png\" alt=\" Infini-Transformer\" class=\"wp-image-3847\" srcset=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image2-5.png 781w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image2-5-300x147.png 300w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image2-5-768x377.png 768w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/image2-5-750x368.png 750w\" sizes=\"(max-width: 781px) 100vw, 781px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">Unlike local attention, which discards previous segment attention states, Infini-Transformers reuse these states to maintain a comprehensive context history, achieved through a compressive memory approach.\u00a0<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Infini-Transformer has an entire context history whereas Transformer-XL discards old contexts since it caches the KV states for the last segment only. Thus, each attention layer in Infini-Transformers integrates both global compressive and local fine-grained states, defining an efficient attention mechanism called Infini-attention.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Experiments Conducted<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The effectiveness of Infini-Transformers was demonstrated through experiments on various tasks involving extremely long input sequences. The experiments conducted are as follows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Long-context language modeling<\/strong>: Small Infini-Transformer models were trained and evaluated on <a href=\"https:\/\/arxiv.org\/abs\/1911.05507\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">PG19<\/a> and <a href=\"https:\/\/arxiv.org\/abs\/2203.08913\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Arxiv-math benchmarks<\/a>. The models outperformed Transformer-XL and <a href=\"https:\/\/arxiv.org\/abs\/2203.08913\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Memorizing Transformers<\/a> while maintaining significantly fewer memory parameters.<\/li>\n\n\n\n<li><strong>1M passkey retrieval benchmark: <\/strong>A 1B LLM with Infini-attention was continually pre-trained on 4K length inputs and fine-tuned on the passkey retrieval task. The model successfully solved the task with up to 1M context length after fine-tuning on only 5K length inputs.<\/li>\n\n\n\n<li><strong>500K length book summarization (BookSum):<\/strong> An 8B LLM model with Infini-attention was continuously pre-trained with 8K input length and fine-tuned on the BookSum task. The model outperformed the previous best results and achieved a new state-of-the-art BookSum by processing the entire text from the books.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Results<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>In the long-context language modelling experiments, Infini-Transformers achieved better perplexity scores than Transformer-XL and Memorizing Transformers while maintaining 114x fewer memory parameters. <\/strong>Further increasing the training sequence length to 100K resulted in even lower perplexity scores.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For the 1M passkey retrieval benchmark, Infini-Transformers solved the task with up to 1M context length after fine-tuning on only 5K length inputs, demonstrating their ability to extrapolate to much longer input lengths than seen during training.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the 500K length book summarization task, Infini-Transformers outperformed previous state-of-the-art models and achieved better Rouge scores with more text provided as input from the books.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The results demonstrate that Infini-Transformers can efficiently process extremely long input sequences with bounded memory and computation, making them a promising approach for scaling LLMs to infinitely long context windows. Infini-attention allows for easy adaptation of existing LLMs to long-context tasks through continual pre-training and fine-tuning.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Conclusion<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Google&#8217;s introduction of Infini-attention within Infini-Transformers presents a groundbreaking approach for scaling LLMs to process infinitely long text inputs. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Google Researchers introduced Infini-attention, a technique that configures LLMs to extend their context window while keeping memory constant.<\/p>\n","protected":false},"author":18,"featured_media":3850,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jnews-multi-image_gallery":[],"jnews_single_post":null,"jnews_primary_category":{"id":"","hide":""},"footnotes":""},"categories":[57],"tags":[56,58,133],"class_list":["post-3845","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","tag-ai","tag-google","tag-research"],"_links":{"self":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts\/3845","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/comments?post=3845"}],"version-history":[{"count":2,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts\/3845\/revisions"}],"predecessor-version":[{"id":3851,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts\/3845\/revisions\/3851"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/media\/3850"}],"wp:attachment":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/media?parent=3845"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/categories?post=3845"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/tags?post=3845"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}