{"id":3188,"date":"2024-04-03T06:06:27","date_gmt":"2024-04-03T06:06:27","guid":{"rendered":"https:\/\/favtutor.com\/articles\/?p=3188"},"modified":"2024-04-03T06:06:29","modified_gmt":"2024-04-03T06:06:29","slug":"safe-google-deepmind","status":"publish","type":"post","link":"https:\/\/favtutor.com\/articles\/safe-google-deepmind\/","title":{"rendered":"Meet SAFE By Google DeepMind: AI for Fact-Checking LLMs"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">In the era of Generative AI chatbots, the subject of hallucination has been a deep concern. Almost every chatbot starting from OpenAI\u2019s ChatGPT, Google\u2019s Gemini and even Anthropic\u2019s Claude has been making up illogical and erroneous content to a very high extent.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Highlights:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Google DeepMind unveils SAFE, an AI model that can fact-check LLMs.<\/li>\n\n\n\n<li>Mainly trained and designed on LongFact, a vast prompt set. Uses an evaluation metric <strong> <\/strong>F<sub>1<\/sub>@K to assign scores.<\/li>\n\n\n\n<li>Shows impressive results stating LLMs are more efficient and less-expensive than human fact checkers.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">As of now, it is hard to determine which tool is factually more accurate than the other, as there is no specific benchmark to measure the factuality of LLMs in the aspect of long context responses.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>However, a team of researchers from Google DeepMind and Stanford University developed a cutting-edge AI tool named SAFE. This tool can fact-check LLMs and allow the factuality of AI models to be benchmarked.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">What is SAFE and how was it developed? What are the features that come with it? In this article, we are going to explore these topics in-depth and find out more about this state-of-the-art model. So, let\u2019s get right into it!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>What is SAFE by Google DeepMind?<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>A team of artificial intelligence specialists at Google&#8217;s DeepMind and Stanford University has developed an AI-based system called SAFE that can be used to fact-check the results of LLMs such as ChatGPT.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>SAFE stands for Search Augmented Factuality Evaluator<\/strong> <strong>and it employs a large language model to deconstruct created text into discrete facts and subsequently leverages Google Search outcomes to ascertain the veracity of each assertion.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">An important method by which human users of LLMs verify results is to look into AI responses by using a search engine like Google to locate reliable sources. The DeepMind team adopted a similar strategy. They&nbsp;produced an LLM that dissects assertions or details in an answer supplied by the initial LLM. They then utilized&nbsp;Google Search to locate websites that might be utilized&nbsp;for confirmation&nbsp;and then contrasted the two responses to ascertain their accuracy.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results,\u201d the authors explained.<\/p>\n<cite>Google DeepMind\u2019s <a href=\"https:\/\/arxiv.org\/pdf\/2403.18802.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">research paper<\/a><\/cite><\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>How was SAFE Trained? Looking into the Architecture<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Initially, DeepMind employed GPT-4 to generate LongFact, a collection of 2,280 questions about 38 subjects. The LLM being examined responds in-depth to these questions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">After that, they developed an AI agent with GPT-3.5-turbo to search Google and confirm the veracity of the replies the LLM produced. The approach was given the name SAFE (Search-Augmented Factuality Evaluator).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s look at the involved architecture components in detail:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>1. LongFact:&nbsp; The Prompt Set<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Google DeepMind created the LongFact prompt set to test a model&#8217;s factual accuracy when it generates long-form responses, which can include many pages. They designed LongFact by instructing GPT-4 to create queries that, within a given field (such as &#8220;biology&#8221;), inquire about a particular idea or object and call for a long-form response that includes several in-depth factoids.<\/strong><\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" width=\"628\" height=\"182\" src=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-477.png\" alt=\"Longfact\" class=\"wp-image-3189\" srcset=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-477.png 628w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-477-300x87.png 300w\" sizes=\"(max-width: 628px) 100vw, 628px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">Depending on whether the questions in LongFact are about concepts or objects, the two tasks were titled LongFact-Concepts and LongFact-Objects. For both tasks, the researchers used the same set of 38 manually selected topics across Social Sciences, Humanities, STEM, and many more. They generated 30 unique prompts per topic for a total of 1,140 prompts per task.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>2. SAFE: Search Augmented Factuality Evaluator<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>The researchers described SAFE as a method of using an LLM agent to assess long-form factuality in model responses automatically. Using the language model, they first broke down a long-form response into discrete facts.<\/strong> Next, they suggested fact-checking questions to submit to a Google Search API for each fact, and they analyzed if the fact was corroborated by the results of the query.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" width=\"673\" height=\"263\" src=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-478.png\" alt=\"Search Augmented Factuality Evaluator\" class=\"wp-image-3190\" srcset=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-478.png 673w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-478-300x117.png 300w\" sizes=\"(max-width: 673px) 100vw, 673px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">Overall, to assess whether a fact is relevant to answering the prompt in the context of the response, the Search-Augmented Factuality Evaluator (SAFE) divides a long-form response into discrete, self-contained facts using a language model. For each fact, iteratively issues Google search queries in a multi-step process, and evaluates whether the search results support or contradict the fact.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>3. F<sub>1<\/sub>@K: Measuring the factuality of a model\u2019s response<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>The researchers introduced F1@K, which measures recall as the ratio of provided supported facts over a variable desired number of supported facts K, as well as factual precision as the ratio of supported facts in response.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The researchers first computed the factual precision of a response as follows:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" width=\"578\" height=\"118\" src=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-483.png\" alt=\"factual precision\" class=\"wp-image-3191\" srcset=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-483.png 578w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-483-300x61.png 300w\" sizes=\"(max-width: 578px) 100vw, 578px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">Where S(y) and N(y) are supported and non-supported facts respectively.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">They also computed the factual recall of the response as follows:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" width=\"651\" height=\"131\" src=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-482.png\" alt=\"factual recall\" class=\"wp-image-3192\" srcset=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-482.png 651w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-482-300x60.png 300w\" sizes=\"(max-width: 651px) 100vw, 651px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">Where K is the number of model responses.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Combining the 2 equations they got the expression for F<sub>1<\/sub>@K, which measures the long-form factuality of a model response y given the number of supported facts S(y) and the number of not-supported facts N(y) that are in y.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img decoding=\"async\" width=\"1024\" height=\"198\" src=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-481-1024x198.png\" alt=\"F1@K\" class=\"wp-image-3193\" srcset=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-481-1024x198.png 1024w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-481-300x58.png 300w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-481-768x148.png 768w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-481-750x145.png 750w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-481-1140x220.png 1140w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-481.png 1208w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\"><strong>Analyzing SAFE\u2019s Workflow<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">SAFE is designed to rate an LLM response but what is the innovative technical process behind this?<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" width=\"974\" height=\"450\" src=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-484.png\" alt=\"SAFE Workflow\" class=\"wp-image-3194\" srcset=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-484.png 974w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-484-300x139.png 300w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-484-768x355.png 768w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-484-750x347.png 750w\" sizes=\"(max-width: 974px) 100vw, 974px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\"><strong>Initially,<\/strong> <strong>SAFE separates each sentence in a long-form response into a separate fact<\/strong> to divide the response into discrete, self-contained facts.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Next,<\/strong> <strong>SAFE\u2019s model replaces ambiguous references (such pronouns) with the appropriate things that they are referring to in the context of the response, revising each particular fact to be self-contained.<\/strong> SAFE then determines<strong> <\/strong>if a fact is pertinent to responding to the prompt within the context of the response in order to provide a score to each self-contained individual fact.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>After that<\/strong>, <strong>each pertinent fact that is still present is assigned a multi-step rating of &#8220;supported&#8221; or &#8220;not supported.&#8221; <\/strong>Based on the fact to rate and the previously obtained search results, SAFE creates a search query in each step.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Lastly, SAFE uses reasoning to ascertain whether the fact is supported by the search results after a predetermined number of steps. <\/strong>Following fact rating, the number of &#8220;supported,&#8221; &#8220;irrelevant,&#8221; and &#8220;not-supported&#8221; facts for a particular prompt-response combination are the metrics that SAFE outputs.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>What did the Results Show?<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>The study results were highly successful in demonstrating SAFE as an efficient AI model to rate the factuality of an LLM\u2019s response. SAFE performs &#8220;superhuman performance,&#8221; according to the researchers, when compared to human annotators who undertake fact-checking.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In 72% of the human annotations, SAFE was found to be in agreement, and 76% of the time it was determined to be incorrect. It also cost 20 times less than human annotators who were crowdsourced. Thus, it shows that LLMs are more efficient and less expensive fact-checkers than humans.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" width=\"426\" height=\"271\" src=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-485.png\" alt=\"SAFE vs Human Annotations\" class=\"wp-image-3195\" srcset=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-485.png 426w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-485-300x191.png 300w\" sizes=\"(max-width: 426px) 100vw, 426px\" \/><\/figure>\n<\/div>\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" width=\"922\" height=\"274\" src=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-486.png\" alt=\"SAFE vs Human Disagreement\" class=\"wp-image-3196\" srcset=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-486.png 922w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-486-300x89.png 300w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-486-768x228.png 768w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-486-750x223.png 750w\" sizes=\"(max-width: 922px) 100vw, 922px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">However, some individuals have undermined this result. On Twitter, prominent AI researcher Gary Marcus\u2014who frequently challenges exaggerated claims\u2014suggested that in this instance, &#8220;superhuman&#8221; might only mean &#8220;better than an underpaid crowd worker, rather a true human fact checker.&#8221;<\/p>\n\n\n\n<div align=center><blockquote class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">On a quick read I can\u2019t figure out much about the human subjects, but it looks like superhuman means better than an underpaid crowd worker, rather a true human fact checker? That makes the characterization misleading. (Like saying that 1985 chess software was superhuman).\u2026<\/p>&mdash; Gary Marcus (@GaryMarcus) <a href=\"https:\/\/twitter.com\/GaryMarcus\/status\/1773429783982149633?ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener\">March 28, 2024<\/a><\/blockquote> <script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Marcus makes an important argument. SAFE would need to be tested not only against crowdsourced workers but also against highly skilled human fact-checkers in order to show genuinely superhuman performance. Contextualising the results appropriately requires knowing the specifics of the human raters, including their training, pay, and fact-checking procedure.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Which LLM tops the Factual list?<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Thirteen LLMs belonging to the Gemini, GPT, Claude, and PaLM-2 families were prompted by the researchers using LongFact. It then assessed the veracity of their answers using SAFE. The quantity and accuracy of each factoid in the response provided by the tested LLMs were used to gauge the response&#8217;s quality.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img decoding=\"async\" width=\"1024\" height=\"459\" src=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-487-1024x459.png\" alt=\"\" class=\"wp-image-3197\" srcset=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-487-1024x459.png 1024w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-487-300x134.png 300w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-487-768x344.png 768w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-487-750x336.png 750w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-487-1140x511.png 1140w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/04\/Screenshot-487.png 1169w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\"><strong>The most factual model for producing long-form responses is GPT-4-Turbo. PaLM-2-L-IT-RLHF and Gemini-Ultra trailed it closely. Suprisingly, Claude 3\u2019s Sonnet and Opus models were higher in the list than GPT-4 and Gemini Pro. For<\/strong> the past few months, users have claimed Claude 3 shows lesser hallucination compared to several other models.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Overall, the data gives the impression that larger LLMs are more factual than the smaller ones.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>How Can You Access It?<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The Google DeepMind team has made the source-code for SAFE publicly accessible for AI developers and enthusiasts who wish to utlilize its fact-checking capabilities in the context of LLMs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Visit this <a href=\"https:\/\/github.com\/google-deepmind\/long-form-factuality\" target=\"_blank\" data-type=\"link\" data-id=\"https:\/\/github.com\/google-deepmind\/long-form-factuality\" rel=\"noreferrer noopener\">Github Repository <\/a>where you will find the codes and the model weights, along with the installation instructions.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Conclusion<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">SAFE is a marvelous advancement in the field of LLMs and generative AI as it provides us with a highly beneficial feature, i.e. fact-checking LLMs. For long developers and AI enthusiasts have suffered from AI hallucinations and illogical factual content. Now with SAFE on our hands, we can all say goodbye to this hassle. However, we must remember this is a state-of-the-art model and only time will tell how it performs in the days to come.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Google DeepMind has launched SAFE, an AI model that can help you fact check LLMs such as Gemini, ChatGPT and Claude.<\/p>\n","protected":false},"author":15,"featured_media":3201,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jnews-multi-image_gallery":[],"jnews_single_post":null,"jnews_primary_category":{"id":"","hide":""},"footnotes":""},"categories":[57],"tags":[56,112,59,58],"class_list":["post-3188","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","tag-ai","tag-deepmind","tag-generative-ai","tag-google"],"_links":{"self":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts\/3188","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/users\/15"}],"replies":[{"embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/comments?post=3188"}],"version-history":[{"count":3,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts\/3188\/revisions"}],"predecessor-version":[{"id":3202,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts\/3188\/revisions\/3202"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/media\/3201"}],"wp:attachment":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/media?parent=3188"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/categories?post=3188"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/tags?post=3188"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}