Human Engagement, Prompts, and Bias Recognition Are Ingredients for Successful AI, ChatGPT Use

From voice and face recognition on smartphones to self-driving cars, advanced technology–specifically artificial intelligence (AI)–is quickly becoming a part of our daily lives.

AI models analyze large amounts of data, learn from this information, and then make informed decisions. In healthcare, AI has the potential to execute complex tasks, including diagnostic problem-solving and decision-making support, and will revolutionize medicine in the years to come.¹

In surgery, there are compelling uses of AI, including risk predictive analytics through machine learning, and the use of AI to assist with real-time intraoperative decision-making and support through image recognition and video analysis (see Figure 1).

Figure 1. Potential Preoperative, Intraoperative, and Postoperative Uses of Artificial Intelligence in Surgery

AI and Predictive Analytics

The use of AI for predictive analytics in surgery has had significant success. Surgical risk is seldom linear, and the presence or absence of certain risk variables affects the impact of others. AI has the ability to detect these nonlinear relationships.

The Predictive OpTimal Trees in Emergency Surgery Risk (POTTER) tool is one of several AI risk calculators developed over the last few years. Created by Dimitris Bertsimas, PhD, and colleagues, POTTER uses novel machine-learning algorithms to predict postoperative outcomes, including mortality, following emergency surgery.² For more information about POTTER, read the February 2023 Bulletin article, “Mobile Device Application Helps Predict Postoperative Complications.”

Similarly, the Trauma Outcome Predictor (TOP) is another machine-learning AI tool that predicts inhospital mortality for trauma patients.³ Both applications have been validated, and user-friendly smartphone interfaces were developed to facilitate use at the bedside by surgeons.

In a recent study published in The Journal of Trauma and Acute Care Surgery, POTTER outperformed surgeons in predicting postoperative risk, and when provided to surgeons, enhanced their judgment.⁴ By accounting for nonlinear interactions among variables, these AI risk-assessment tools are proving to be useful for bedside counseling of patients.

The use of AI intraoperatively is heavily reliant on techniques to annotate and assess operative videos. Using annotations and machine learning, researchers assessed disease severity, critical view of safety achievement, and intraoperative events in 1,051 laparoscopic cholecystectomy videos.⁵ These results were compared to manual review by surgeons, and researchers found that AI-surgeon agreement varied based on case severity.

Despite the variance, AI-surgeon agreement was consistently greater than 75% to 99% for intraoperative events. Another study found that an AI model trained to identify the Parkland grading scale used in laparoscopic cholecystectomies was reliable in quantifying the degree of gallbladder inflammation and predicting the intraoperative course of the cholecystectomy and, as such, had serious potential for real-time surgical performance augmentation.⁶

AI and Healthcare Inequities

One of the serious concerns regarding the use of AI in healthcare in general and in surgery specifically, is the risk of encoding unnoticed bias especially with the use of black box uninterpretable models. For example, if the data used to train an AI model to identify risk of malignancy in skin lesion images do not include patients of all colors and skin tones, its output will fail to adequately perform in those patient populations.

In an effort aimed at early identification of trauma patients who will need discharge to rehabilitation, the use of a national dataset led the AI models to encode race as the second-most important factor upon which to decide whether patients need rehabilitation, likely reflecting existing disparities in the healthcare system that the algorithm inadvertently learned.⁷ This risk of bias recently led President Biden to secure commitment from several leading AI companies to include safety measures to decrease bias in training and output of AI models.⁸

The mathematics in transparent or interpretable AI have the potential, if used wisely, not only to identify bias in our training datasets and disparities in our existent care, but also to mitigate and remedy them in its derived decision-support algorithms if we prompt it to do so.⁹

ChatGPT and NLMs in Surgery

OpenAI’s ChatGPT and GPT-4, Google’s Bard, and Microsoft’s Sydney are examples of recent developments in AI technology and warrant our attention in surgery.

These natural language models (NLMs) operate by training with large amounts of data, identifying patterns in the data that are not discernible to the human eye or mind, and generating statistically probable outputs. NLMs have been described as a revolutionary technology and “the biggest innovation since the user-friendly computer.”¹⁰ ChatGPT alone had 1.6 billion visits in June 2023.¹¹

Since their public release, these models have been used for a range of applications, including writing poetry, making mnemonics, and most importantly, sifting through vast and sometimes incomprehensible data from a variety of text sources to answer the user’s questions in an engaging and simple manner.

In healthcare, NLMs already have been used to write discharge summaries,¹² simplify radiology reports,¹³ take medical notes,¹⁴ and even write scientific manuscripts and grant applications.^15,16 Tests evaluating the medical knowledge of GPT-4 using the US Medical Licensing Examination have shown that it answers questions correctly 90% of the time, and recently, it has been recommended for assisting bedside clinicians in medical consultation, diagnosis, and education.¹⁴

The potential use of NLMs for drafting the operative note and reducing the administrative burden on surgeons recently was assessed.¹⁵ Researchers evaluated the operative notes created by ChatGPT using a 30-point tool generated using recommendations from the Getting It Right First Time (GIRFT) program.

GIRFT is an organization that partnered with the National Health Service in the UK to improve surgical documentation guidelines. The authors found that ChatGPT scored an average of 78.8%, surpassing the compliance of surgeons’ operative notes to a similar set of guidelines from the Royal College of Surgeons of England.

Similarly, investigators described advanced applications of ChatGPT in cardiothoracic surgery, particularly in creating predictive models to identify patients requiring intensive treatment plans or therapeutic targets for lung and cardiovascular diseases by processing extensive datasets.¹⁷

ChatGPT also has been described as a potential tool to provide clinical information such as research findings and management protocols in a concise, prompt, and contextually appropriate manner to surgeons pressed for time in order to enhance patient care.¹⁸

Despite the potential use of NLMs in surgery, surgeons need to be aware of some of their limitations, including the potential for inaccurate information that is not present in the training data—a phenomenon known as “hallucination” or “absence of factuality.”^19-21

These inaccuracies often are stated in a convincing tone that could mislead those seeking information, potentially pivoting or even compromising their judgment and decision-making. For example, researchers found that ChatGPT-generated discharge summaries included information about the patient’s compliance with therapy and a postoperative recovery course that was not found in the training data.¹²

Another study revealed that 9% of the translated radiology reports contained inaccurate information, and 5% had missing information.¹³

Such inaccuracies may pose significant patient safety risks initiating anchor and confirmation bias that can lead to significant patient harm. The risk for hallucination has led to suggestions that national organizations such as the US National Institutes of Health should promote the preferential use of NLMs for simple and less-risky healthcare administrative and patient communication tasks that do not require extensive training or expertise and are easily validated.²²

Such limitations, however, can be mitigated by creating NLMs trained with contextualized and subject-specific (e.g., surgery) data and information that improve the model’s fidelity. Specifically, surgeons should learn and get facile with designing prompts that enable the model to give the most reliable and accurate response (i.e., “prompt engineering”).

The content of the prompt as well as its tone can impact the results provided by the model, and thoughtful, purposeful prompts can significantly enhance the usefulness of the output.^23-25 Figure 2 shows select examples of prompts that can help surgeons make the best use of these models.

Figure 2. Prompts to Improve Output from ChatGPT

Another shortcoming of NLMs is their inability to understand the causality between actions. While NLMs can “memorize” data to describe an action or predict its outcome with high accuracy, they most often cannot provide a causal explanation for it. This limitation was described in a recent article as the inability of the model to conjecture that an apple falls due to the force of gravity and conclude that all objects would fall due to gravity.²⁶

Since NLMs are not constrained by the information from which they can learn, these models often are not restricted by ethical principles and are incapable of moral thinking. Moral thinking is crucial if NLMs are to be used as interfaces for patient communication. To overcome this, restrictions may be placed on the subject matter and language used by the model. However, not only are these restrictions circumventable with the use of detailed prompts, but they also can cause the model to assume an apathetic indifference to ethical dilemmas.²⁶

Overall, the use of AI in image recognition and predictive/prescriptive analytics promises to be an era of remarkable precision, workflow optimization, and elevated patient well-being in surgery. The NLM models such as ChatGPT also indicate an unprecedented era of machine-learning-dependent healthcare and surgical care.

There has already been reproducible success in using AI and ChatGPT in the postoperative phase of surgical care to predict outcomes. In addition, there are a growing number of advancements toward enabling real-time intraoperative capabilities.

While AI can identify patterns in data that are not discernable to the human eye and has been shown to have greater than 75% accuracy for numerous applications, the risk of encoding bias and the risk of hallucination require human engagement to mitigate and minimize negative effects.

As such, there is a significant need for surgeons and healthcare teams to become familiar with this technology and critically evaluate and make strategic and purposeful use of AI in patient care.

AI is here to stay, evolve, and improve, eventually shifting the tasks of healthcare workers toward the areas that cannot be automated such as patient connectedness, reassurance, and comfort.²⁷

Dr. Wardah Rafaqat and Dr. May Abiad are postdoctoral research fellows in the Department of Surgery at Massachusetts General Hospital in Boston.

References

Rajpurkar P, Chen E, Banerjee O, Topol EJ. AI in health and medicine. Nat Med. 2022;28(1):31-38.
Bertsimas D, Dunn J, Velmahos GC, Kaafarani HMA. Surgical risk is not linear: Derivation and validation of a novel, user-friendly, and machine-learning-based Predictive OpTimal Trees in Emergency Surgery Risk (POTTER) calculator. Ann Surg. 2018;268(4):574-583.
Maurer LR, Bertsimas D, Bouardi HT, et al. Trauma outcome predictor: An artificial intelligence interactive smartphone tool to predict outcomes in trauma patients. J Trauma Acute Care. 2021; 91(1):93-99.
El Moheb M, Gebran A, Maurer LR, et al. Artificial intelligence versus surgeon gestalt in predicting risk of emergency general surgery. J Trauma Acute Care Surg. 2023 Jun 14.
Korndorffer JR, Hawn MT, Spain DA, et al. Situating artificial intelligence in Surgery: A focus on disease severity. Ann Surg. 2020;272(3):523-528.
Ward TM, Hashimoto DA, Ban Y, et al. Artificial intelligence prediction of cholecystectomy operative course from automated identification of gallbladder inflammation. Surg Endosc. 2022;36(9):6832-6840.
Maurer LR, Bertsimas D, Kaafarani HMA. Machine learning reimagined: The promise of interpretability to combat bias. Ann Surg. 2022;275(6):e738-e739.
Seven AI companies agree to safeguards in the US. BBC News. July 22, 2023.Available at: https://www.bbc.com/news/technology-66271429. Accessed August 9, 2023.
Beale N, Battey H, Davison AC, Mackay RS. An unethical optimization principle. R Soc Open Sci. 2020. Available at: https://royalsocietypublishing.org/doi/10.1098/rsos.200462. Accessed August 9, 2023.
Bienasz A. The age of AI has begun: Bill Gates says this ‘revolutionary’ tech is the biggest innovation since the user-friendly computer. Entrepreneur. March 22, 2023. Available at: https://www.entrepreneur.com/business-news/bill-gates-says-chatgpt-is-revolutionary-in-new-blog-post/448138. Accessed August 9, 2023.
Duarte F. Number of ChatGPT Users. Exploding Topics. July 13, 2023. Available at: https://explodingtopics.com/blog/chatgpt-users. Accessed August 9, 2023.
Patel SB, Lam K. ChatGPT: The future of discharge summaries? Lancet Digit Health. 2023; 5(3):e107-e108. Available at: http://www.thelancet.com/article/S2589750023000213/fulltext. Accessed August 9, 2023.
Lyu Q, Tan J, Zapadka ME, et al. Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: Results, limitations, and potential. Vis Comput Ind Biomed Art. December 2023. Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10192466/. Accessed August 9, 2023.
Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N Engl J Med. 2023; 388(13):1233-1239.
Robinson A, Aggarwal S. When precision meets penmanship: ChatGPT and surgery documentation. Cureus. 2023;15(6):e40546.
Janssen BV, Kazemier G, Besselink MG. The use of ChatGPT and other large language models in surgical science. BJS Open. 2023;7(2):zrad032.
Rad AA, Nia PS, Athanasiou T. ChatGPT: Revolutionizing cardiothoracic surgery research through artificial intelligence. Interdiscip Cardiovasc Thorac Surg. 2023; 36(6):ivad090.
Upperman JS, Meyers P. Viewpoint: For better or worse, ChatGPT is here to play. July 10, 2023. Bull Am Coll Surg. Available at: https://www.facs.org/for-medical-professionals/news-publications/news-and-articles/bulletin/2023/july-2023-volume-108-issue-7/for-better-or-worse-chatgpt-is-here-to-play/. Accessed August 9, 2023.
Ziwei J, Nayeon L, Frieske R, et al. Survey of hallucination in natural language generation. ACM Comput Surv. 2022. Available at: https://arxiv.org/abs/2202.03629. Accessed August 9, 2023.
Else H. Abstracts written by ChatGPT fool scientists. Nature. 2023;613(7944):423.
Rafaqat W, Chu DI, Kaafarani HM. AI and ChatGPT meet surgery: A word of caution for surgeon-scientists. Ann Surg. 2023. Available at: https://pubmed.ncbi.nlm.nih.gov/37470174/. Accessed August 9, 2023.
Aronson S, Lieu TW, Scirica BM. Getting generative AI Right. NEJM Catal. 2023. Available at: https://catalyst.nejm.org/doi/full/10.1056/CAT.23.0063. Accessed August 9, 2023.
ChatGPT guide: Use these prompt strategies to maximize your results. The Decoder. June 19, 2023. Available at: https://the-decoder.com/chatgpt-guide-prompt-strategies/. Accessed August 9, 2023.
How to do tables in ChatGPT. AI for Folks. Available at: https://aiforfolks.com/how-to-do-tables-in-chatgpt/. Accessed August 9, 2023.
El Atillah M. Getting the most out of ChatGPT: These are the most useful prompts to make your life easier. Euronews. June 25, 2023. Available at: https://www.euronews.com/next/2023/06/25/getting-the-most-out-of-chatgpt-these-are-the-most-useful-prompts-to-try-now. Accessed August 9, 2023.
Chomsky N. The false promise of ChatGPT. New York Times. August 3, 2023. Available at: https://www.nytimes.com/2023/03/08/opinion/noam-chomsky-chatgpt-ai.html. Accessed August 9, 2023.
Kaafarani HMA. A seventh domain of quality: Patient connectedness. Ann Surg. 2023;278(2):e209-e210.