The most recent iteration of the artificial intelligence chatbot, ChatGPT, has successfully passed a radiology board-style examination. Researchers conducted a study using 150 multiple-choice questions inspired by the Canadian Royal College and American Board of Radiology exams. This achievement highlights the tremendous potential of AI in medical fields. However, the study also brings attention to certain limitations that impact the reliability of ChatGPT.
ChatGPT, developed by OpenAI, is a deep-learning model renowned for generating human-like responses based on input. Its ability to recognize patterns enables it to interpret and respond to vast amounts of data. Nevertheless, due to the absence of a definitive source of truth in its training data, ChatGPT sometimes produces factually incorrect responses.
Dr Rajesh Bhayana, an abdominal radiologist and technology lead at University Medical Imaging Toronto, explains, “The use of large language models like ChatGPT is rapidly expanding and will only continue to grow. Our research offers valuable insight into how ChatGPT performs in a radiology setting, emphasizing its immense potential while shedding light on current reliability issues.”
ChatGPT’s popularity and impact have been steadily increasing. It recently became the fastest-growing consumer application in history and is being integrated into popular search engines such as Google and Bing, widely utilized by both physicians and patients for medical inquiries. During the radiology board-style examination, ChatGPT answered 69% of the questions correctly, falling slightly short of the 70% passing grade.
However, there was a notable discrepancy in performance between lower-order thinking questions (84% correct) and higher-order thinking questions (60% correct). The AI struggled particularly with descriptions of imaging findings, calculations, classifications, and the application of concepts. It is important to note that ChatGPT has not received any radiology-specific training, so these difficulties were somewhat anticipated.
In March, a newer version called GPT-4 was released, featuring enhanced advanced reasoning capabilities. In a subsequent study, GPT-4 answered 81% of the same questions correctly, surpassing the passing threshold and outperforming its predecessor, GPT-3.5. However, despite these improvements, GPT-4 did not demonstrate progress in lower-order thinking questions and provided incorrect answers to 12 questions that GPT-3.5 had answered correctly. This inconsistency raises concerns regarding the AI’s reliability in gathering information.
Dr. Bhayana comments, “ChatGPT gave accurate and confident answers to some challenging radiology questions, but then made some very illogical and inaccurate assertions. Given how these models function, the inaccurate responses should not be surprising.” The studies also noted ChatGPT’s tendency to produce inaccurate responses, referred to as hallucinations. While less frequent in GPT-4, this tendency still restricts the chatbot’s current usability in medical education and practice.
Despite its limitations, researchers recognize the potential of using ChatGPT to stimulate ideas and assist in the medical writing process and data summarization, as long as the information is fact-checked. Dr. Bhayana adds, “To me, this is its biggest limitation. At present, ChatGPT is best used to spark ideas, help start the medical writing process, and in data summarization. If used for quick information recall, it always needs to be fact-checked.”