September 11, 2023
AI models analyze large amounts of data, learn from this information, and then make informed decisions. In healthcare, AI has the potential to execute complex tasks, including diagnostic problem-solving and decision-making support, and will revolutionize medicine in the years to come.1
In surgery, there are compelling uses of AI, including risk predictive analytics through machine learning, and the use of AI to assist with real-time intraoperative decision-making and support through image recognition and video analysis (see Figure 1).
Figure 1. Potential Preoperative, Intraoperative, and Postoperative Uses of Artificial Intelligence in Surgery
The use of AI for predictive analytics in surgery has had significant success. Surgical risk is seldom linear, and the presence or absence of certain risk variables affects the impact of others. AI has the ability to detect these nonlinear relationships.
The Predictive OpTimal Trees in Emergency Surgery Risk (POTTER) tool is one of several AI risk calculators developed over the last few years. Created by Dimitris Bertsimas, PhD, and colleagues, POTTER uses novel machine-learning algorithms to predict postoperative outcomes, including mortality, following emergency surgery.2 For more information about POTTER, read the February 2023 Bulletin article, “Mobile Device Application Helps Predict Postoperative Complications.”
Similarly, the Trauma Outcome Predictor (TOP) is another machine-learning AI tool that predicts inhospital mortality for trauma patients.3 Both applications have been validated, and user-friendly smartphone interfaces were developed to facilitate use at the bedside by surgeons.
In a recent study published in The Journal of Trauma and Acute Care Surgery, POTTER outperformed surgeons in predicting postoperative risk, and when provided to surgeons, enhanced their judgment.4 By accounting for nonlinear interactions among variables, these AI risk-assessment tools are proving to be useful for bedside counseling of patients.
The use of AI intraoperatively is heavily reliant on techniques to annotate and assess operative videos. Using annotations and machine learning, researchers assessed disease severity, critical view of safety achievement, and intraoperative events in 1,051 laparoscopic cholecystectomy videos.5 These results were compared to manual review by surgeons, and researchers found that AI-surgeon agreement varied based on case severity.
Despite the variance, AI-surgeon agreement was consistently greater than 75% to 99% for intraoperative events. Another study found that an AI model trained to identify the Parkland grading scale used in laparoscopic cholecystectomies was reliable in quantifying the degree of gallbladder inflammation and predicting the intraoperative course of the cholecystectomy and, as such, had serious potential for real-time surgical performance augmentation.6
One of the serious concerns regarding the use of AI in healthcare in general and in surgery specifically, is the risk of encoding unnoticed bias especially with the use of black box uninterpretable models. For example, if the data used to train an AI model to identify risk of malignancy in skin lesion images do not include patients of all colors and skin tones, its output will fail to adequately perform in those patient populations.
In an effort aimed at early identification of trauma patients who will need discharge to rehabilitation, the use of a national dataset led the AI models to encode race as the second-most important factor upon which to decide whether patients need rehabilitation, likely reflecting existing disparities in the healthcare system that the algorithm inadvertently learned.7 This risk of bias recently led President Biden to secure commitment from several leading AI companies to include safety measures to decrease bias in training and output of AI models.8
The mathematics in transparent or interpretable AI have the potential, if used wisely, not only to identify bias in our training datasets and disparities in our existent care, but also to mitigate and remedy them in its derived decision-support algorithms if we prompt it to do so.9
OpenAI’s ChatGPT and GPT-4, Google’s Bard, and Microsoft’s Sydney are examples of recent developments in AI technology and warrant our attention in surgery.
These natural language models (NLMs) operate by training with large amounts of data, identifying patterns in the data that are not discernible to the human eye or mind, and generating statistically probable outputs. NLMs have been described as a revolutionary technology and “the biggest innovation since the user-friendly computer.”10 ChatGPT alone had 1.6 billion visits in June 2023.11
Since their public release, these models have been used for a range of applications, including writing poetry, making mnemonics, and most importantly, sifting through vast and sometimes incomprehensible data from a variety of text sources to answer the user’s questions in an engaging and simple manner.
In healthcare, NLMs already have been used to write discharge summaries,12 simplify radiology reports,13 take medical notes,14 and even write scientific manuscripts and grant applications.15,16 Tests evaluating the medical knowledge of GPT-4 using the US Medical Licensing Examination have shown that it answers questions correctly 90% of the time, and recently, it has been recommended for assisting bedside clinicians in medical consultation, diagnosis, and education.14
The potential use of NLMs for drafting the operative note and reducing the administrative burden on surgeons recently was assessed.15 Researchers evaluated the operative notes created by ChatGPT using a 30-point tool generated using recommendations from the Getting It Right First Time (GIRFT) program.
GIRFT is an organization that partnered with the National Health Service in the UK to improve surgical documentation guidelines. The authors found that ChatGPT scored an average of 78.8%, surpassing the compliance of surgeons’ operative notes to a similar set of guidelines from the Royal College of Surgeons of England.
Similarly, investigators described advanced applications of ChatGPT in cardiothoracic surgery, particularly in creating predictive models to identify patients requiring intensive treatment plans or therapeutic targets for lung and cardiovascular diseases by processing extensive datasets.17
ChatGPT also has been described as a potential tool to provide clinical information such as research findings and management protocols in a concise, prompt, and contextually appropriate manner to surgeons pressed for time in order to enhance patient care.18
Despite the potential use of NLMs in surgery, surgeons need to be aware of some of their limitations, including the potential for inaccurate information that is not present in the training data—a phenomenon known as “hallucination” or “absence of factuality.”19-21
These inaccuracies often are stated in a convincing tone that could mislead those seeking information, potentially pivoting or even compromising their judgment and decision-making. For example, researchers found that ChatGPT-generated discharge summaries included information about the patient’s compliance with therapy and a postoperative recovery course that was not found in the training data.12
Another study revealed that 9% of the translated radiology reports contained inaccurate information, and 5% had missing information.13
Such inaccuracies may pose significant patient safety risks initiating anchor and confirmation bias that can lead to significant patient harm. The risk for hallucination has led to suggestions that national organizations such as the US National Institutes of Health should promote the preferential use of NLMs for simple and less-risky healthcare administrative and patient communication tasks that do not require extensive training or expertise and are easily validated.22
Such limitations, however, can be mitigated by creating NLMs trained with contextualized and subject-specific (e.g., surgery) data and information that improve the model’s fidelity. Specifically, surgeons should learn and get facile with designing prompts that enable the model to give the most reliable and accurate response (i.e., “prompt engineering”).
The content of the prompt as well as its tone can impact the results provided by the model, and thoughtful, purposeful prompts can significantly enhance the usefulness of the output.23-25 Figure 2 shows select examples of prompts that can help surgeons make the best use of these models.
Figure 2. Prompts to Improve Output from ChatGPT
Another shortcoming of NLMs is their inability to understand the causality between actions. While NLMs can “memorize” data to describe an action or predict its outcome with high accuracy, they most often cannot provide a causal explanation for it. This limitation was described in a recent article as the inability of the model to conjecture that an apple falls due to the force of gravity and conclude that all objects would fall due to gravity.26
Since NLMs are not constrained by the information from which they can learn, these models often are not restricted by ethical principles and are incapable of moral thinking. Moral thinking is crucial if NLMs are to be used as interfaces for patient communication. To overcome this, restrictions may be placed on the subject matter and language used by the model. However, not only are these restrictions circumventable with the use of detailed prompts, but they also can cause the model to assume an apathetic indifference to ethical dilemmas.26
Overall, the use of AI in image recognition and predictive/prescriptive analytics promises to be an era of remarkable precision, workflow optimization, and elevated patient well-being in surgery. The NLM models such as ChatGPT also indicate an unprecedented era of machine-learning-dependent healthcare and surgical care.
There has already been reproducible success in using AI and ChatGPT in the postoperative phase of surgical care to predict outcomes. In addition, there are a growing number of advancements toward enabling real-time intraoperative capabilities.
While AI can identify patterns in data that are not discernable to the human eye and has been shown to have greater than 75% accuracy for numerous applications, the risk of encoding bias and the risk of hallucination require human engagement to mitigate and minimize negative effects.
As such, there is a significant need for surgeons and healthcare teams to become familiar with this technology and critically evaluate and make strategic and purposeful use of AI in patient care.
AI is here to stay, evolve, and improve, eventually shifting the tasks of healthcare workers toward the areas that cannot be automated such as patient connectedness, reassurance, and comfort.27
Dr. Wardah Rafaqat and Dr. May Abiad are postdoctoral research fellows in the Department of Surgery at Massachusetts General Hospital in Boston.
References