{"id":2697,"date":"2024-03-20T08:23:09","date_gmt":"2024-03-20T08:23:09","guid":{"rendered":"https:\/\/favtutor.com\/articles\/?p=2697"},"modified":"2024-03-20T08:24:12","modified_gmt":"2024-03-20T08:24:12","slug":"nvidia-nim-llm-deployment","status":"publish","type":"post","link":"https:\/\/favtutor.com\/articles\/nvidia-nim-llm-deployment\/","title":{"rendered":"NVIDIA&#8217;s NIM is The Next Innovative Approach to Deploy LLMs"},"content":{"rendered":"\n<p>During the GTC24 conference, NVIDIA made many announcements, but one of the most interesting to look into is NIM, so let&#8217;s know more about it!<\/p>\n\n\n\n<p><strong>Highlights:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>NVIDIA unveiled NIM to simplify the deployment of AI models in production environments.<\/li>\n\n\n\n<li>They are collaborating with tech giants like Amazon, Google, and Microsoft.<\/li>\n\n\n\n<li>NIM microservices may get integrated into platforms like SageMaker, Kubernetes Engine, and Azure AI.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>NVIDIA&#8217;s NIM Explained<\/strong><\/h2>\n\n\n\n<p><strong>NIM by NVIDIA is a novel software platform engineered to simplify the deployment of custom and pre-trained AI models into production environments. <\/strong><\/p>\n\n\n\n<p>In simple terms, a NIM is a container full of microservices. Microservices, or microservice architecture, is an architectural style that structures an application as a collection of services that are loosely coupled and independently deployed.\u00a0<\/p>\n\n\n\n<p>NVIDIA aims to accelerate and <a href=\"https:\/\/developer.nvidia.com\/blog\/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">optimize the deployment of generative AI-based LLMs<\/a> with a new approach to delivering models for rapid inference.<\/p>\n\n\n\n<p>These services are organized around business capabilities with each service owned by a single and small team. The microservice architecture helps an organization deliver large, complex applications rapidly and reliably.<\/p>\n\n\n\n<p>The container includes any type of model that can run anywhere where there is an NVIDIA GPU. This could be on the cloud or your local machine. The models can include various kinds of models spanning open to proprietary ones.\u00a0<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>\u201cWe believe that NVIDIA NIM is the best software package, the best runtime for developers to build on top of, so that they can focus on the enterprise applications\u201d.<\/p>\n<cite>Manuvir Das, VP of enterprise computing at NVIDIA<\/cite><\/blockquote>\n\n\n\n<p>This container can be deployed wherever a basic contained can be run. This can be a Kubernetes deployment in the cloud architecture, a Linux-based server, or any serverless function-as-a-service model.<\/p>\n\n\n\n<p>NIM doesn\u2019t replace any prior approach to model delivery from NVIDIA. Rather, it&#8217;s a container that includes a highly optimized model for NVIDIA GPUs along with necessary technologies to help improve inference.<\/p>\n\n\n\n<p>Some other interesting previous launches by NVIDIA in 2024 are <a href=\"https:\/\/favtutor.com\/articles\/nvidia-chat-with-rtx-chatbot-pc\/\">Chat with RTX<\/a> and <a href=\"https:\/\/favtutor.com\/articles\/starcoder2-ai-benchmarks-benefits-nvidia\/\">StarCoder2 AI collaboration<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Here&#8217;s What NIM does?<\/strong><\/h3>\n\n\n\n<p>Patrick Moorhead- Founder, CEO, and Chief Analyst at Moor Insights &amp; Strategy said the following about NIM on X:<\/p>\n\n\n\n<div align=\"center\"><blockquote class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">Bigger than Blackwell is <a href=\"https:\/\/twitter.com\/nvidia?ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener\">@nvidia<\/a>\u2019s &quot;NIM&quot; for enterprises. Nvidia Inference Microservices. <br><br>Enterprise SaaS &amp; SW Platforms (ie Adobe, SAP) &amp; Data Platforms (ie Cloudera, Cohesity, &amp; SnowBricks) write once across the hybrid multi-cloud Infrastructure (ie AWS, Dell) and Model\u2026 <a href=\"https:\/\/t.co\/rPoAsDKvW8\" target=\"_blank\">pic.twitter.com\/rPoAsDKvW8<\/a><\/p>&mdash; Patrick Moorhead (@PatrickMoorhead) <a href=\"https:\/\/twitter.com\/PatrickMoorhead\/status\/1769838908823773309?ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener\">March 18, 2024<\/a><\/blockquote> <script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/div>\n\n\n\n<p>The NIM platform leverages the company&#8217;s expertise in inferencing and model optimization, simplifying the process of deploying AI models into production environments. <\/p>\n\n\n\n<p>Combining a model with an optimized inferencing engine and packaging it into a container, offers developers a streamlined solution that would typically take weeks or months to achieve.<\/p>\n\n\n\n<p>This initiative aims to create an ecosystem of AI-ready containers, utilizing NVIDIA&#8217;s hardware as the foundational layer.\u00a0<\/p>\n\n\n\n<p><strong>NIM packages optimized inference engines, industry-standard APIs, and AI model support into containers for easy deployment.<\/strong> While offering prebuilt models, it also accommodates organizations to integrate their proprietary data and facilitates the acceleration of Retrieval Augmented Generation (RAG) deployment.\u00a0<\/p>\n\n\n\n<p>This technology represents a significant milestone for AI deployment, serving as the cornerstone of NVIDIA&#8217;s next-generation strategy for inference. Its impact is expected to extend across model developers and data platforms in the AI space.\u00a0<\/p>\n\n\n\n<p>NIM currently supports models from various providers, including NVIDIA, A121, Adept, Cohere, Getty Images, Shutterstock, and open models from Google, Hugging Face, Meta, Microsoft, Mistral AI, and Stability AI.\u00a0<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>How NIMs will help the RAG approach?<\/strong><\/h3>\n\n\n\n<p>NVIDIA&#8217;s NIMs are poised to facilitate the deployment of Retrieval Augmented Generation (RAG) models, a key focus area for many organizations. With a growing number of customers already implementing RAGs, the challenge lies in transitioning from prototyping to production.\u00a0<\/p>\n\n\n\n<p>NVIDIA and several leading data vendors are hoping that this is the answer to this challenge. Vector database capabilities are critical to enabling RAG, and there are several vector database vendors supporting NIMs such as Apache Lucene, Datastax, Faiss, Kinetica, Milvus, Redis and Weaviate.<\/p>\n\n\n\n<p>NIMs offer a solution to this by streamlining the deployment process, enabling organizations to deliver real business value with their models. <\/p>\n\n\n\n<p>Additionally, the integration of NVIDIA NeMo Retriever microservices enhances the RAG approach by providing optimized data retrieval capabilities. NeMo retriever was announced by NVIDIA in November 2023 to help enable RAG with an optimized approach for data retrieval.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>How to use NVIDIA&#8217;s NIM?<\/strong><\/h2>\n\n\n\n<p>Using NVIDIA NIM is a simple process. Within the NVIDIA API documentation, developers have access to various AI models that can be used for building and deploying their AI applications.\u00a0<\/p>\n\n\n\n<p>To deploy a microservice on your infrastructure, sign up for the <a href=\"https:\/\/www.nvidia.com\/en-in\/data-center\/products\/ai-enterprise\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">NVIDIA AI Enterprise 90-day evaluation<\/a> license and follow the steps given below:<\/p>\n\n\n\n<p>First, Download the model that you want to deploy from NVIDIA NGC (Nvidia GPU Cloud). For the given example, a version of the Llama-2 7B Model has been downloaded:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>ngc registry model download-version \"ohlfw0olaadg\/ea-participants\/llama-2-7b:LLAMA-2-7B-4K-FP16-1-A100.24.01<\/code><\/pre>\n\n\n\n<p>Then, Unpack the downloaded model into a target repository: <\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>tar -xzf llama-2-7b_vLLAMA-2-7B-4K-FP16-1-A100.24.01\/LLAMA-2-7B-4K-FP16-1-A100.24.01.tar.gz<\/code><\/pre>\n\n\n\n<p>Now, Launch the NIM Container with the desired model:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>docker run --gpus all --shm-size 1G -v $(pwd)\/model-store:\/model-store --net=host\nnvcr.io\/ohlfw0olaadg\/ea-participants\/nemollm-inference-ms:24.01\nnemollm_inference_ms --model llama-2-7b --num_gpus=1<\/code><\/pre>\n\n\n\n<p>Once the Container is deployed, start making requests using REST API:<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;mode&quot;:&quot;apl&quot;,&quot;mime&quot;:&quot;text\/apl&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;Code&quot;,&quot;language&quot;:&quot;APL&quot;,&quot;modeName&quot;:&quot;apl&quot;}\">import requests\n \nendpoint = 'http:\/\/localhost:9999\/v1\/completions'\n \nheaders = {\n    'accept': 'application\/json',\n    'Content-Type': 'application\/json'\n}\n \ndata = {\n    'model': 'llama-2-7b',\n    'prompt': &quot;The capital of France is called&quot;,\n    'max_tokens': 100,\n    'temperature': 0.7,\n    'n': 1,\n    'stream': False,\n    'stop': 'string',\n    'frequency_penalty': 0.0\n}\n \nresponse = requests.post(endpoint, headers=headers, json=data)\nprint(response.json())<\/pre><\/div>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>NVIDIA NIM\u2019s partners<\/strong><\/h3>\n\n\n\n<p>LlamaIndex, an innovative data framework designed to support LLM-based application development, was announced as a launch partner for NIM.<\/p>\n\n\n\n<div align=\"center\"><blockquote class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">\u2b50\ufe0fJust announced at GTC keynote\u2b50\ufe0f NVIDIA Inference Microservice or NIM and we are a launch partner!<br><br>NIM accelerates deployment of LLM models across NVIDIA GPUs and integrates with LlamaIndex to build first-class RAG pipelines.<br><br>NVIDIA&#39;s blog post: <a href=\"https:\/\/t.co\/8bOpEOSL0N\" target=\"_blank\">https:\/\/t.co\/8bOpEOSL0N<\/a><br><br>Our\u2026<\/p>&mdash; LlamaIndex \ud83e\udd99 (@llama_index) <a href=\"https:\/\/twitter.com\/llama_index\/status\/1769849701183197403?ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener\">March 18, 2024<\/a><\/blockquote> <script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/div>\n\n\n\n<p>NIM accelerates deployment of LLM models across NVIDIA GPUs and it can now be integrated with LlamaIndex to build first-class RAG pipelines.<\/p>\n\n\n\n<p>LangChain also announced its integration:<\/p>\n\n\n\n<div align=\"center\"><blockquote class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">\ud83e\udd1d Our Integration With NVIDIA NIM for GPU-optimized LLM Inference in RAG<br><br>As enterprises turn their attention from prototyping LLM applications to productionizing them, they often want to turn from third-party model services to self-hosted solutions. We\u2019ve seen many folks\u2026 <a href=\"https:\/\/t.co\/A0vFP1Bv8T\" target=\"_blank\">pic.twitter.com\/A0vFP1Bv8T<\/a><\/p>&mdash; LangChain (@LangChainAI) <a href=\"https:\/\/twitter.com\/LangChainAI\/status\/1769851779003695143?ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener\">March 18, 2024<\/a><\/blockquote> <script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/div>\n\n\n\n<p>Haystack, the open-source LLM framework by Deepset, has also partnered with NMI which will now give users the flexibility to deploy hosted or self-hosted RAG pipelines.<\/p>\n\n\n\n<div align=\"center\"><blockquote class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">The new NVIDIA NIM integration in Haystack 2.0 gives you the flexibility to deploy hosted or self-hosted RAG pipelines.<a href=\"https:\/\/t.co\/h4ewr1qcMx\" target=\"_blank\">https:\/\/t.co\/h4ewr1qcMx<\/a><\/p>&mdash; Haystack (@Haystack_AI) <a href=\"https:\/\/twitter.com\/Haystack_AI\/status\/1769864081794896003?ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener\">March 18, 2024<\/a><\/blockquote> <script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/div>\n\n\n\n<p><strong>NVIDIA is collaborating with Amazon, Google, and Microsoft to integrate these NIM microservices into platforms like SageMaker, Kubernetes Engine, and Azure AI, as well as the above-mentioned frameworks such as Deepset, LangChain, and LlamaIndex.\u00a0<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-us.googleusercontent.com\/5dsCa01M8I-Y1pvbBtU1k9Fe6pagLU3bNt4mIppwMkYswDbRoNm1ksMrwaT5lzG0enSaPgXFfk6IgAp_JbdB2cOwdzBHXdLsO2HA8m-Sg_o6wgnE2JQlUNgifzGcBphlCDmvrWErw-6WnOn0YD0CfEQ\" alt=\"NVIDIA NIM\"\/><\/figure>\n\n\n\n<p>Here are the benefits it will provide:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Deploy generative AI applications anywhere<\/li>\n\n\n\n<li>Prebuilt container and Helm Chart- a package that contains all the necessary resources to deploy an application to a Kubernetes cluster<\/li>\n\n\n\n<li>Develop with defacto standard and industry-defined APIs<\/li>\n\n\n\n<li>Harness domain-specific models<\/li>\n\n\n\n<li>Run on optimized inference engines<\/li>\n\n\n\n<li>Accelerated models that are ready for deployment<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Conclusion<\/strong><\/h2>\n\n\n\n<p>This recent development in the deployment of RAG models will greatly increase the efficiency of production environments. NIM will offer a streamlined solution to both experienced developers and those still new to the world of Generative AI!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Everything we know so far about NIM,  the new approach to deploy AI or LLM Models announced by NVIDIA.<\/p>\n","protected":false},"author":18,"featured_media":2703,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jnews-multi-image_gallery":[],"jnews_single_post":null,"jnews_primary_category":{"id":"","hide":""},"footnotes":""},"categories":[57],"tags":[56,72,121,87],"class_list":["post-2697","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","tag-ai","tag-llm","tag-nim","tag-nvidia"],"_links":{"self":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts\/2697","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/comments?post=2697"}],"version-history":[{"count":3,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts\/2697\/revisions"}],"predecessor-version":[{"id":2705,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts\/2697\/revisions\/2705"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/media\/2703"}],"wp:attachment":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/media?parent=2697"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/categories?post=2697"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/tags?post=2697"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}