August 4, 2023
The assessment of learners is a critical part of modern surgical training. Educators and researchers tasked with this responsibility must understand validity theory to appropriately assess learners, interpret outcomes, and make and justify decisions about competence, selection, and advancement. Unfortunately, outdated concepts surrounding validity persist in and muddle the medical education literature.1,2 This article introduces contemporary validity theory to surgical educators and education researchers who design assessments or interpret and act on their results.
Validity theory has evolved over several decades. Classical frameworks from the 1950s viewed validity as a property of the assessment tool and described multiple types, including content, construct, and face. These frameworks were replaced by Samuel Messick’s contemporary conceptualization, which was first described in the 1980s. It was later incorporated into the Standards for Educational and Psychological Testing, a consensus standard of professional societies in education, psychology, and measurement. Since then, scholars such as Michael Kane have proposed additional contemporary frameworks.3
All these contemporary frameworks view validity as a quality of the evidence that educators and researchers provide to support how they interpreted and used the results of an assessment. For example, educators and researchers can interpret or use the results of an assessment for use in particular ways. When this happens, they essentially make an argument that they interpreted or used the results appropriately. To support that argument, they can gather and present evidence in a process known as validation.3 Thus, validity is the degree to which that evidence supports their argument.
In this way, validity is no longer a property specific to an assessment tool. Instead, the validity of an interpretation or action is specific to the context in which the assessment was used. For example, the Fundamentals of Laparoscopic Surgery—an assessment tool that measures the construct of laparoscopic skill—can be administered to different subjects (for example, medical students versus interns) and for different purposes (for example, feedback versus certification).4 Thus, an assessment tool given in a unique context (i.e., to specific subjects and for specific purposes) produces unique results. The arguments—interpretations, uses—based on those results and the evidence gathered to support those arguments are therefore unique to that context rather than to the tool itself.
With this background in mind, Table 1 provides some questions to consider along with example answers from a surgical validation study.5 Asking and answering them will help to uncover assumptions and intentions that inform the validation process.
Questions to Consider |
Ensuring Competency in Open Aortic Aneurysm Repair—Development and Validation of a New Assessment Tool5 |
What is the construct being measured?
|
Basic technical skills to perform an open AAA repair
|
What is the assessment tool?
|
Using OPERATE, an OSATS-based tool, the subjects will perform a simulation-based open AAA repair
|
Who are the subjects being assessed?
|
Novice and experienced vascular surgeons
|
What is the intended purpose of the assessment tool?
|
To assess if the subjects have sufficient technical skills for performing an open AAA repair before starting supervised training on patients
|
What will be the interpretation of the results?
|
If a subject scores at or above the pass/fail standard on the OPERATE assessment tool, then they are qualified for supervised training on patients
|
Despite the evolution of validity theory in the broader education sphere, outdated concepts persist in the medical education literature.1,2 We will replace some commonly used but outdated concepts (in bold) with a contemporary understanding.
“I am using a validated assessment tool from the literature.”
This example highlights two outdated concepts:
“Is this interpretation valid or not?”
This question views validity as a binary outcome: present or not. Contemporary frameworks instead view validity as a continuum: “Is there sufficient validity evidence to support this interpretation?”
“We demonstrated face, content, and construct validity for this assessment tool.”
While classical frameworks describe distinct types of validity, contemporary frameworks view validity as a unified quality. Furthermore, the use of face validity has been discouraged by most education scholars.2,3
While contemporary frameworks share key principles, they differ on the evidence required. We will focus on Messick’s framework, as it is endorsed by the Standards for Educational and Psychological Testing.6 Table 2 defines the five sources of validity evidence in Messick’s framework and provides examples collected in the surgical validation study referenced above.
Sources |
Definition |
Description of the Evidence from Nayahangan et al. 20205 |
Content
|
Evidence used to show that the assessment tool measures the construct and that it is both representative (includes everything it’s supposed to) and relevant (excludes anything irrelevant).
|
The OPERATE assessment tool was developed by subject-matter experts (experienced surgeons and an education scientist) who uniformly agreed on what skills to assess and how to assess them
|
Response Process
|
Evidence used to show that how the subjects respond to the assessment items (i.e., the thoughts and actions behind their responses) and how raters evaluate that response are consistent and fit with what is intended
|
For consistency, the same authors administered the simulation to all subjects and all raters received rater training to understand the assessment
|
Internal Structure
|
Evidence used to show that the scores on assessment items that measure the same construct have the intended relationship to each other
|
The internal consistency of the OPERATE assessment tool was high at .92, based on Cronbach’s alpha.
|
Relationship to Other Variables
|
Evidence used to show that the scores on the construct measured by the assessment tool have the intended relationship to other constructs
|
Scores from the OPERATE assessment tool correlate with the subject’s experience level (novice vs. experienced).
|
Consequences
|
Evidence to highlight the consequences that the results of this assessment have: e.g., (1) does the interpretation of the results maximize benefits and minimize harm and (2) what are intended and unintended consequences of that interpretation?
|
Based on the pass/fail standard created for the OPERATE tool, 6% of the novices and 33% of the experts passed. A low pass rate in novices fits with the intended use (a credible assessment for pass/fail) but could be an unintended consequence for the experienced group.
|
For a deeper exploration of validity evidence, contemporary frameworks, and other information about validity, refer to Cook and Hatala’s 2016 article or chapter 2 in the Assessment in Health Professions Education textbook.3,7 Finally, given the growing diversity of surgical trainees, surgical educators and researchers should consider issues of diversity, equity, and inclusion and ask if the assessment of learners or arguments made from those results account for these issues. For more information on diversity and validity of assessments, we refer readers to additional resources.8,9
Surgical educators and researchers should have a clear understanding of contemporary principles of validity theory. It is vital to not only the accurate communication of assessment research, but also to the thoughtful evaluation and education of surgical trainees.
Research Fellow, Department of Surgery, Massachusetts General Hospital
Resident, Department of Surgery, Massachusetts General Hospital
Health Professions Education Researcher, Massachusetts General Hospital
Director of Health Professions Education Research, Massachusetts General Hospital
Professor of Surgery, Department of Surgery, Massachusetts General Hospital
Massachusetts General Hospital
Corresponding author: grashid@mgh.harvard.edu