LLMs have been rapidly advancing in the past 2 years. They also have amazing potential for the healthcare industry, but to evaluate them correctly, something was needed, and now we have it: Open Medical-LLM Leaderboard.
Highlights:
- Researchers at the Open Life Science AI have released a new leaderboard for evaluation of Medical LLMs
- The leaderboard checks and ranks each Medical LLM, based on its knowledge and question-answering capabilities.
- It a collection of all major medical evaluation parameters like the MedQA and the MedMCQA datasets.
New Open Medical-LLM Leaderboard
LLMs like GPT-3, GPT-4, and Med-PaLM 2, along with the growth in electronic health records, have made a significant progress for use of AI in the healthcare sector.
One of the most promising applications of LLMs lies in medical question-answering (QA) systems. By leveraging the vast knowledge encoded within these models, healthcare professionals can quickly obtain accurate and relevant information, streamlining decision-making processes and improving diagnostic accuracy.
However, many challenges need to be overcome before the medical community can widely adopt LLMs. A tiny error could put lives in jeopardy. The accuracy of information provided by LLMs has always been in question due to the well-publicised hallucination issues.
So, a benchmark model was something that the industry needed to make such LLMs a practical tool for real-life tasks.
Researchers at the Open Life Science AI non-profit have introduced a new series of benchmarks to evaluate models on parameters specific to the healthcare industry, where they are ranked on the Open Medical-LLM Leaderboard.
The ranking are based on of each model’s medical knowledge and question-answering capabilities. Here’s how they explained in their official announcement:
“The Open Medical LLM Leaderboard aims to track, rank and evaluate the performance of large language models (LLMs) on medical question answering tasks. It evaluates LLMs across a diverse array of medical datasets, including MedQA (USMLE), PubMedQA, MedMCQA, and subsets of MMLU related to medicine and biology. The leaderboard offers a comprehensive assessment of each model’s medical knowledge and question answering capabilities.”
The datasets cover various aspects of medicine such as general medical knowledge, clinical knowledge, anatomy, genetics, and more. They contain multiple-choice and open-ended questions that require medical reasoning and understanding.” The researchers clarified in their post on huggingface.
Evaluation Parameters in the Leaderboard
The medical LLM leaderboard is not an entirely brand new benchmark, but rather a collection of all major medical evaluation parameters like the MedQA and the MedMCQA datasets.
Here is the full list of datasets on which the accuracy of LLMs is being evaluated:
MedQA
The MedQA dataset consists of multiple-choice questions from the United States Medical Licensing Examination (USMLE). It covers general medical knowledge and includes 11,450 questions in the development set and 1,273 questions in the test set.
Each question has 4 or 5 answer choices, and the dataset is designed to assess the medical knowledge and reasoning skills required for medical licensure in the United States.
MedMCQA
MedMCQA is a large-scale multiple-choice QA dataset derived from Indian medical entrance examinations (AIIMS/NEET). It covers 2.4k healthcare topics and 21 medical subjects, with over 187,000 questions in the development set and 6,100 questions in the test set.
Each question has 4 answer choices and is accompanied by an explanation. MedMCQA evaluates a model’s general medical knowledge and reasoning capabilities.
PubMedQA
PubMedQA is a closed-domain QA dataset, In which each question can be answered by looking at an associated context (PubMed abstract). It consists of 1,000 expert-labeled question-answer pairs. Each question is accompanied by a PubMed abstract as context, and the task is to provide a yes/no/maybe answer based on the information in the abstract.
The dataset is split into 500 questions for development and 500 for testing. PubMedQA assesses a model’s ability to comprehend and reason over scientific biomedical literature.
MMLU Subsets (Medicine and Biology)
The MMLU benchmark (Measuring Massive Multitask Language Understanding) includes multiple-choice questions from various domains. For the Open Medical-LLM Leaderboard, we focus on the subsets most relevant to medical knowledge:
- Clinical Knowledge: 265 questions assessing clinical knowledge and decision-making skills.
- Medical Genetics: 100 questions covering topics related to medical genetics.
- Anatomy: 135 questions evaluating the knowledge of human anatomy.
- Professional Medicine: 272 questions assessing knowledge required for medical professionals.
- College Biology: 144 questions covering college-level biology concepts.
- College Medicine: 173 questions assessing college-level medical knowledge.
Each MMLU subset consists of multiple-choice questions with 4 answer options and is designed to evaluate a model’s understanding of specific medical and biological domains.
Also, read about the Study shows that GPT-4V with RAG offers numerous benefits in Clinical Trial Screening.
Conclusion
This Open Medical-LLM Leaderboard is a valiant effort to evaluate models, medical professionals across the world have declared the wide disparity between textbook or dataset situations and actual clinical cases.