0
We're unable to sign you in at this time. Please try again in a few minutes.
Retry
We were able to sign you in, but your subscription(s) could not be found. Please try again in a few minutes.
Retry
There may be a problem with your account. Please contact the AMA Service Center to resolve this issue.
Contact the AMA Service Center:
Telephone: 1 (800) 262-2350 or 1 (312) 670-7827  *   Email: subscriptions@jamanetwork.com
Error Message ......
Original Contribution |

Validation of Consensus Panel Diagnosis in Dementia FREE

Matthew J. Gabel, PhD; Norman L. Foster, MD; Judith L. Heidebrink, MD; Roger Higdon, PhD; Howard J. Aizenstein, MD, PhD; Steven E. Arnold, MD; Nancy R. Barbas, MD, MSW; Bradley F. Boeve, MD; James R. Burke, MD, PhD; Christopher M. Clark, MD; Steven T. DeKosky, MD; Martin R. Farlow, MD; William J. Jagust, MD; Claudia H. Kawas, MD; Robert A. Koeppe, PhD; James B. Leverenz, MD; Anne M. Lipton, MD, PhD; Elaine R. Peskind, MD; R. Scott Turner, MD, PhD; Kyle B. Womack, MD; Edward Y. Zamrini, MD
[+] Author Affiliations

Author Affiliations: Department of Political Science, Washington University, St Louis, Missouri (Dr Gabel); Center for Alzheimer's Care, Imaging, and Research and Department of Neurology, University of Utah, Salt Lake City (Dr Foster); Department of Neurology, University of Michigan, Ann Arbor (Dr Heidebrink); Neurology Service, Department of Veterans Affairs Medical Center, Ann Arbor (Dr Heidebrink); and Seattle Children's Research Institute, Seattle, Washington (Dr Higdon); Department of Psychiatry, University of Pittsburgh, Pittsburgh, Pennsylvania (Dr Aizenstein); Departments of Neurology (Drs Barbas and Turner) and Radiology (Dr Koeppe), University of Michigan, Ann Arbor; Departments of Psychiatry (Dr Arnold) and Neurology (Drs Arnold and Clark), University of Pennsylvania, Philadelphia; Department of Neurology, Mayo Clinic, Rochester, Minnesota (Dr Boeve); Department of Neurology, Duke University, Durham, North Carolina (Dr Burke); School of Medicine, University of Virginia, Charlottesville (Dr DeKosky); Department of Neurology, Indiana University, Indianapolis (Dr Farlow); Departments of Neuroscience and Public Health, University of California at Berkeley, Berkeley (Dr Jagust); Departments of Neurology and Neurobiology and Behavior, University of California at Irvine, Irvine (Dr Kawas); Veterans Affairs, Puget Sound Health Care System (Drs Leverenz and Peskind) and Departments of Neurology (Dr Leverenz) and Psychiatry & Behavioral Sciences (Drs Leverenz and Peskind), University of Washington, Seattle; Departments of Neurology and Psychiatry (Dr Womack), University of Texas Southwestern, Dallas; Department of Neurology, Texas Health Presbyterian Hospital, Dallas (Dr Lipton); and Alzheimer's Care, Imaging and Research and Department of Neurology, University of Utah, Salt Lake City (Dr Zamrini). Dr Turner is now with the Memory Disorders Program, Department of Neurology, Georgetown University, Washington, DC.


Arch Neurol. 2010;67(12):1506-1512. doi:10.1001/archneurol.2010.301.
Text Size: A A A
Published online

Background  The clinical diagnosis of dementing diseases largely depends on the subjective interpretation of patient symptoms. Consensus panels are frequently used in research to determine diagnoses when definitive pathologic findings are unavailable. Nevertheless, research on group decision making indicates that many factors can adversely affect panel performance.

Objective  To determine conditions that improve consensus panel diagnosis.

Design  Comparison of neuropathologic diagnoses with individual and consensus panel diagnoses based on clinical scenarios only, fludeoxyglucose F 18 positron emission tomography images only, and scenarios plus images.

Setting  Expert and trainee individual and consensus panel deliberations using a modified Delphi method in a pilot research study of the diagnostic utility of fludeoxyglucose F 18 positron emission tomography.

Patients  Forty-five patients with pathologically confirmed Alzheimer disease or frontotemporal dementia.

Main Outcome Measures  Statistical measures of diagnostic accuracy, agreement, and confidence for individual raters and panelists before and after consensus deliberations.

Results  The consensus protocol using trainees and experts surpassed the accuracy of individual expert diagnoses when clinical information elicited diverse judgments. In these situations, consensus was 3.5 times more likely to produce positive rather than negative changes in the accuracy and diagnostic certainty of individual panelists. A rule that forced group consensus was at least as accurate as majority and unanimity rules.

Conclusions  Using a modified Delphi protocol to arrive at a consensus diagnosis is a reasonable substitute for pathologic information. This protocol improves diagnostic accuracy and certainty when panelist judgments differ and is easily adapted to other research and clinical settings while avoiding the potential pitfalls of group decision making.

Figures in this Article

Many dementing diseases lack distinctive physical findings or validated biomarkers, thus making accurate clinical diagnosis challenging. Clinicians often must reach a diagnosis based solely on their judgment of informant history of variable quality and the relative prominence of deficits in specific cognitive domains. Because these subjective judgments understandably differ among individual clinicians, the accuracy and confidence of diagnoses also vary. Diagnostic criteria have been developed to provide guidance for clinicians, but applying these criteria also requires interpretation and judgment. Consequently, neuropathologic examination findings continue to be the standard criterion for determining the cause of a dementing illness.

The validity of research results depends on accurate diagnosis. Recognizing the limitations of individual clinician diagnoses, research studies often use the consensus of a panel when histopathologic information is unavailable.1,2 It is hoped that a panel will achieve greater diagnostic reliability, accuracy, and certainty than even an individual expert. Despite this hope, there has been little examination of consensus panel performance in determining the cause of dementia. The limited empirical evidence available suggests that consensus panel results may be suspect. For example, similarly composed medical panels often reach varying conclusions about the same sets of questions, raising serious doubts about panel reliability.3,4 In addition, theoretical and empirical studies of group decision making indicate that depending on their composition and procedures, consensus panels may not achieve highly accurate decisions.5 Consequently, the absence of strong evidence regarding the efficacy of consensus panels is a potentially serious problem for dementia research.

Bringing empirical evidence to bear on this question is complicated by the variety of consensus panel goals, memberships, and procedures currently in use. Given this variety, we need to identify effective panels and cannot simply assume that any single panel will be as accurate as others. For example, consensus panels can have different goals. Some are designed to identify only patients for whom a diagnosis is likely to be highly accurate, whereas others seek the best diagnosis for all patients, recognizing that accuracy may be higher in some situations than in others. Consensus panels also vary in their composition and organization. Members may include only a single specialty or may be multidisciplinary. Some panels include individuals who have personally examined the patient with the intent of ensuring the most direct and detailed information. Other panels explicitly exclude individuals with “special knowledge” of the patient out of concern that such individuals would exert disproportionate influence on group judgments and suppress independent analysis, which is the theoretical advantage of panel diagnosis.5 Furthermore, panel rules for arriving at a group diagnosis also are variable. For some, majority agreement is sufficient. For others, unanimity is expected or required. Finally, the panel may follow a rigorous protocol or be quite informal. Some simply determine whether there are objections to the individual physician judgment, whereas others expect each panelist to arrive at a diagnosis independently. Social science research shows that these aspects of panel organization affect the accuracy of consensus judgments.5

The Delphi method of consensus is a formal and rigorous procedure that incorporates organizational features that social science theory indicates promote accurate individual and group judgments.57 This method is commonly used to set professional priorities and establish guidelines, but the exact protocol can vary in panel size, the use of face-to-face discussion, and the number of iterations before a final decision is reached.810 The essential features of the Delphi method are (1) presentation of a uniform set of information to the panel (thus excluding individuals with unique special knowledge), (2) an initial independent decision of each panelist that is recorded and subsequently shared with others, (3) discussion of the recorded opinions of panelists, and (4) a final group decision. Votes are used to ensure independent judgments, and diversity of opinions is encouraged through panel membership and during discussions.

We took advantage of an opportunity to explore diagnostic performance of consensus panels provided by trials we conducted to examine the diagnostic utility of fludeoxyglucose F 18 positron emission tomography (FDG-PET).11 Consensus panels generally are convened only when there is no standard criterion available. In these trials, however, neuropathologic findings were available, and we undertook these studies to determine the extent to which consensus panel diagnosis might be a justifiable alternative to postmortem examination. In the United States, FDG-PET currently is reimbursed in dementia only when physicians find it difficult to distinguish Alzheimer disease (AD) from frontotemporal dementia (FTD). Thus, it was scientifically appropriate in these trials to restrict diagnostic options to these 2 possibilities. The requirement of a binary decision was fortuitous because it significantly simplified the analysis of panel performance. Diagnostic decisions inherently vary widely in difficulty, and repeated use of exactly the same decision in this study allowed us to evaluate key variables, including the diversity of diagnostic perspectives, the types of patient information reviewed, and the decision criteria for consensus. Although clinical diagnosis is complex and requires the consideration of multiple conditions, binary decisions are relevant to clinical practice. For example, after an extensive dementia evaluation, researchers often must make critical diagnostic judgments, choosing between only 2 of the most likely possibilities such as demented or nondemented, mild cognitive impairment or normal for age, AD or not AD, and AD or vascular dementia.

Two consensus panels, each composed of 6 panelists, and 6 additional individual raters reviewed clinical data to arrive at a diagnosis of AD or FTD. None of the panelists or raters had direct interaction with the patients being considered. Although panelists and raters were aware that patients had only 1 of 2 possible diagnoses, they did not know the proportion with each diagnosis.

PANEL CHARACTERISTICS

A “trainee” panel met twice and consisted of 6 physician trainees in specialties involved in dementia care from a single institution: 2 neurology residents, 2 geriatric medicine fellows, 1 psychiatry resident, and 1 geriatric psychiatry fellow. One of these trainees was present for the review of only 28 of the 45 patients. A second “expert” panel met 3 times at least 6 months apart and was composed of 6 physicians (4 neurologists and 2 geriatric psychiatrists) involved in dementia care and research at 1 of 4 National Institute on Aging–funded Alzheimer centers.

RATER CHARACTERISTICS

Distinct from the members of the panels, this study also involved 6 “raters”: dementia specialist neurologists, each with 10 to 25 years of experience in dementia care, 2 from each of 3 National Institute on Aging–funded Alzheimer centers. Raters arrived at a diagnosis based solely on their private consideration of the same patient information provided to the panels. They did not convene as a panel for discussion or share information with each other about their diagnoses. These raters provided a set of decisions by individual experts to compare with panel diagnoses.

PATIENT DATA

Clinical scenarios and FDG-PET images were evaluated from 45 patients with a postmortem examination documenting a histopathologic diagnosis of AD (n = 31) or FTD (n = 14) uncomplicated by other abnormalities, such as a stroke or a significant number of cortical Lewy bodies. Foster et al11 provide a full description of the pathologic findings in these cases, scenario development, imaging methods, and training of raters and panelists in image interpretation. Neuropsychological data were not included. Three sets of data were prepared for each patient: clinical scenario alone, FDG-PET images alone, and scenarios plus PET images. Patient data were labeled using random number identifiers, with a different series of random numbers used in each data set.

DIAGNOSTIC DELIBERATIONS

Consensus panel deliberations uniformly followed the RAND–University of California at Los Angeles modified Delphi procedure.12 Each set of data was presented on a different day and in a different patient order to keep panelists blinded to their previous diagnostic judgments. A panel leader organized the meeting and encouraged discussion but did not participate in discussion or voting. Panelists began by privately considering the information provided about each patient. They then marked a card indicating their diagnosis of AD or FTD and their level of confidence in that diagnosis (very confident, somewhat confident, or uncertain). The panel leader collected the cards and announced the “vote tally” (eg, 3 AD and 3 FTD) to the panel. At that point, the panelists were encouraged to discuss the case and their reasons for arriving at a specific diagnosis. During individual review and group deliberations of the clinical scenarios, we encouraged reference to published diagnostic criteria for AD and FTD,1316 but we neither suggested nor imposed any rules regarding the interpretation of the criteria or individual patient information.

After discussion, panelists again marked a card in private indicating their diagnosis and diagnostic confidence. After these cards were collected, the group was asked to arrive at a final diagnosis. The panelists were not provided with a decision rule (eg, simple majority) but were told that they needed to return a decision for the panel. The leader then recorded the consensus decision, and the panel turned to the next patient and repeated the same procedure. There was no time limit for individual deliberation or group discussion. Research staff recorded the time taken for these deliberations and made qualitative observations.

Individual raters not involved in the panels reviewed the same 3 types of data as panelists and provided a diagnosis of AD or FTD and their level of confidence. In all, there were 810 diagnostic judgments by individual raters, 2126 judgments by individual panelists, and 180 consensus judgments by panels.

STATISTICAL ANALYSIS

Diagnostic judgments of raters, panelists, and the consensus panels were compared with the neuropathologic diagnoses (the reference standard). For each panel, we computed statistics for sensitivity, specificity, predictive value, and likelihood ratio. With only 2 diagnostic options, positive and negative predictive values were complementary, and sensitivity and specificity for FTD were reciprocal to those for AD. We used κ statistics to evaluate the reliability of consensus diagnoses across panels and the level of diagnostic agreement within panels. The degree of agreement was rated as fair (κ = 0.20-0.39), moderate (κ = 0.40-0.59), substantial (κ = 0.60-0.79), or almost perfect (κ = 0.80-1.0), according to convention.17 We analyzed consensus panel performance relative to that of raters and panelists by fitting logistic regression models to a binary variable representing correct diagnosis, with raters, panelists, and the consensus panel as covariates. This provides an estimate of the odds ratio that an expert was more accurate than the panel, which served as the reference category. The change in panelist diagnostic accuracy from before to after discussion in each panel was analyzed using logistic regression models fit to a binary response variable for whether the prediscussion or postdiscussion diagnosis was correct and included the timing of the diagnosis as a covariate (before or after the diagnosis). The change in diagnostic confidence from prediscussion to postdiscussion was evaluated in a similar manner, fitting the model to a binary variable for whether the panelist was “very confident.” To determine the extent to which changes in panelists' diagnoses were beneficial, we estimated logistic regression models for all the panelists who changed their confidence or diagnosis from prediscussion to postdiscussion. We fit the model to a binary variable indicating whether a change was beneficial, defined as a shift to the correct diagnosis, an increase in confidence in a correct diagnosis, or a decrease in confidence in an incorrect diagnosis. The intercept provides an estimate of the log odds ratio that the change was beneficial.

Because diagnoses of the same case by different panelists or of different cases by the same panelists are potentially correlated, estimates of standard errors were adjusted to account for violations of standard independence assumptions. Where relevant, standard errors were adjusted for the longitudinal nature of the prediscussion and postdiscussion data in some analyses. Specifically, the standard errors of the statistical tests were adjusted using a robust covariance estimator that incorporated estimates of correlation between panelists and between patients.18 We then used the adjusted variance estimate to generate corrected P values. Also, where relevant, P values were adjusted for multiple tests using the Hochberg correction.19 McNemar χ2 tests were used to assess whether consensus diagnoses were more accurate than alternative methods of group diagnosis (eg, simple majority rule).

RELIABILITY, ACCURACY, AND CONFIDENCE OF DIAGNOSIS

The accuracy of the consensus diagnoses of the trainee and expert panels was superior to that of the individual diagnoses of their own members when considering clinical scenarios alone (Figure 1A). In general, the accuracy of the consensus diagnoses were superior to those of expert raters making individual judgments (Figure 1B). The consensus diagnoses were more accurate than the diagnoses of 10 of the 11 individual panelists and 5 of the 6 individual expert raters, and these differences often reached statistical significance. On average, the 12 experts individually performed better than the 5 trainee panelists, although after deliberation, the trainee and expert panels had the same diagnostic accuracy. Indeed, the trainee panel was statistically significantly more accurate than the individual opinions of 3 of the 6 expert panelists (eFigure).

Place holder to copy figure label and caption
Figure 1.

Comparison of diagnostic accuracy of panelists and panels (A) and raters and panels (B) reviewing clinical scenarios. Consensus panel diagnoses (black bars) based on scenarios were more accurate than were diagnoses arrived at by 10 of 11 individual panelists and 5 of 6 raters. This panel superiority was significant (peach bars) for all members of the trainee panel, 4 members of the expert panel, and 3 raters (P < .05, Hochberg corrected). Note that 1 member of the trainee panel did not review 17 cases and was thus omitted from this analysis.

Graphic Jump Location

Individual diagnostic accuracy and confidence were high with review of FDG-PET images with or without scenarios, and there was less individual variation in diagnoses. In these situations, panel accuracy was rarely superior to that of individual raters or expert panelists, and deliberations did not provide the same benefits seen with scenarios alone (Figure 2). Indeed, most individual experts had the same or higher diagnostic accuracy as the panel.

Place holder to copy figure label and caption
Figure 2.

Comparison of diagnostic accuracy of expert raters and panelists and panels reviewing fludeoxyglucose F 18 positron emission tomography images. Consensus panel diagnoses (black bars) were more accurate than the diagnoses of 0 of 6 panelists and 2 of 6 raters when reviewing images and scenarios (A) and 2 of 6 panelists and 2 of 6 raters when reviewing images alone (B). Two panelists and 2 raters (peach bars) had significantly lower accuracy than each panel (P < .05, Hochberg corrected). In contrast, across the 2 panels, 6 of 12 panelists and 5 of 12 raters (in blue) had significantly greater accuracy than the panel diagnoses (P < .05, Hochberg corrected).

Graphic Jump Location

The consensus diagnoses ranged from 84% accurate when based exclusively on clinical scenarios to 89% when the diagnosis included review of FDG-PET images. The AD sensitivity and FTD specificity (89%-94%) were higher than AD specificity and FTD sensitivity (71%-86%) (eTable 1). As expected from previous experience, the diagnostic accuracy of individuals and panels was less when considering FTD compared with AD. Despite the concerns of other researchers,3 the consensus judgments were highly reproducible across panels (2-way κ = 0.68-0.90) despite differences in panel memberships and diagnostic information reviewed (Figure 3).

Place holder to copy figure label and caption
Figure 3.

Panel diagnostic accuracy by patient. Each horizontal line represents a single patient. Panel diagnoses in agreement with neuropathologic diagnoses are shown in gray. Panel diagnostic errors are shown in peach. Panels often were in error in the same patients. The pairwise κ agreement (SE) between the diagnoses of the trainee and expert panels for scenarios was 0.79 (0.15); for trainee (scenario) and expert (images) panels, 0.68 (0.15); for trainee (scenario) and expert (scenario + images) panels, 0.79 (0.15); for expert (scenario) and expert (images) panels, 0.69 (0.15); and for expert (images) (scenario + images) panels, 0.79 (0.15). AD indicates Alzheimer disease; FDG-PET, fludeoxyglucose F 18 positron emission tomography; and FTD, frontotemporal dementia.

Graphic Jump Location

Panelists' judgments tended to converge after discussion in all situations, as indicated by the increase in mean κ agreement scores within panels, and diagnostic confidence also increased (Table). This increase in agreement after deliberation was not uniformly associated with beneficial changes in diagnosis or confidence (eTable 2). Similar to the panel diagnoses, the salutary effect of the consensus process varied by type of diagnostic information. Panelists typically made beneficial changes when reviewing scenarios alone. These changes were predominantly due to panelists who were uncertain or only somewhat confident in their initial diagnoses (eTable 3). Similarly, panelists who were not very confident in their initial diagnosis accounted for all diagnostic changes when reviewing images. However, compared with reviewing scenarios alone, these changes were fewer in number and were typically not beneficial (eTable 3).

Table Graphic Jump LocationTable. Individual Panelist Accuracy, Confidence, and Agreement Before and After Panel Deliberationa
EFFECT OF PANEL CONSENSUS RULES ON DIAGNOSTIC ACCURACY

After discussion and the second vote, the panel was asked to determine a single final consensus diagnosis. When 5 of 6 or 6 of 6 panelists agreed on a prediscussion diagnosis, this diagnosis was always adopted as the consensus diagnosis. The final diagnosis also never deviated from the majority diagnosis after discussion. As the threshold for consensus increases from 4 of 6 to unanimity, accuracy generally improves, although gains are small and at the expense of many patients going undiagnosed (eTable 4). Voting again after discussion allowed more patients to be diagnosed and by a larger majority of panelists. None of the alternative rules exhibited a statistically significantly higher accuracy than the forced consensus rule (eTable 4).

In general, discussion caused panelists to converge around the prediscussion majority diagnosis, regardless of whether that diagnosis was correct or incorrect. The only exceptions were 3 cases in the trainee panel where discussion led to a change from a simple majority incorrect diagnosis to a majority correct diagnosis. There were no instances of discussion changing a correct majority diagnosis prediscussion into a majority incorrect diagnosis postdiscussion. As a result, the forced consensus rule and the postdiscussion 4 of 6 majority rule for final diagnosis differed only in that the forced consensus rule yielded a diagnosis for the 6 cases across all panels, with a 3-3 split postdiscussion. In these 6 cases, the panel was correct 3 times.

DURATION OF PANEL DELIBERATIONS

The duration of panel discussions varied considerably from case to case. Trainee panel discussions of scenarios (mean, 5 minutes; range, 0-15 minutes) were similar in length to expert panel discussions of the same information (mean, 4 minutes; range, 1-15 minutes). The time expended on discussions involving images was substantially less (expert panel discussions of images alone: mean, 2 minutes; range, 0-9 minutes; and images with scenario: mean, 2 minutes; range, 0-7 minutes).

The modified Delphi protocol resulted in reliable consensus diagnoses across panels of varying expertise and diagnostic information. The expertise of individuals does not negate the benefit of consensus; consensus improved the accuracy of nonexpert and expert panelists alike. When reviewing only clinical scenarios, trainee and expert consensus panel diagnoses were typically as accurate or more accurate than individual expert diagnoses. In addition, the consensus process led panelists to improve the accuracy of their individual diagnoses. Thus, when reviewing scenarios, a modified Delphi protocol for consensus panels provided sufficiently accurate diagnoses to be considered ideal when histopathologic information is unavailable.

In contrast, consensus diagnoses when reviewing FDG-PET images, with or without scenarios, were rarely better than those of individual experts, and panelists typically made adverse diagnostic changes after deliberation. What accounts for this variation in performance? These results are consistent with social science research on group decision making and the conditions under which consensus should be of value.20 A key determinant of the benefit of consensus is the level of diversity of individual panelist judgments. When reviewing clinical scenarios exclusively, the trainee and expert panelists were evaluating a type of information familiar to them and to which they could apply their own idiosyncratic diagnostic experience in reaching their judgments. In contrast, the interpretation of FDG-PET images offered relatively little room for variation in interpretation. As a result, the panelists demonstrated higher interrater agreement when reviewing images than when reviewing the clinical scenario alone (Table). This lower diversity led to lower panel performance in terms of relative accuracy of consensus diagnoses compared with individual diagnoses (Figure 2) and in terms of lower number and lesser quality of diagnostic changes by panelists (eTables 2 and 3).

Thus, a critical issue for application of this modified Delphi protocol is to ensure that the panels have sufficient diversity. The selection of an appropriate panel requires identifying panelists who are likely to make different errors in judgment.20 Sources of such diversity include variation in clinical training, medical specialty, and experience with particular socioeconomic, ethnic, and racial groups. These factors are particularly important when relying on the rich variety of information provided by a detailed clinical history.

PRACTICAL IMPLICATIONS

Review of the literature raises concerns about many of the consensus procedures currently in use in dementia research. Other consensus procedures may not provide similar positive results as does the modified Delphi protocol used in this study. The limitation of other consensus methods may not be readily apparent to investigators because there often is a high pretest probability of a single diagnosis. In this situation, diagnostic errors will change autopsy confirmation rates only slightly. On the other hand, in this study, pretest probability of FTD was unknown to the raters but was considerably higher than that in many AD research studies and, thus, provided an informative setting to assess consensus.

Properly constituted consensus panels are time consuming and are expensive, and require considerable effort to organize. In situations where resources are limited, these results suggest some steps that could increase efficiency without major loss of diagnostic accuracy. For panels designed to accurately diagnose all patients in a study, the best protocol involves forced consensus after deliberation. But when the panelists' initial judgments are unanimous or nearly so, simply adopting that position as the consensus judgment provides similar accuracy. Indeed, if the costs to conduct deliberation are particularly high, one might also consider lower majority thresholds applied to the panelists' initial diagnoses. To the extent that the panel seeks to identify patients with highly accurate diagnoses and has little regard for the share of patients diagnosed, a high-threshold rule without discussion is appropriate.

It is important to note that panels also confer professional legitimacy that typically does not accompany an individual judgment. Thus, to the extent that the legitimacy and accuracy of the judgment are important to a particular question, the cost of the modified Delphi protocol may well be justified, even if the improvement in judgment accuracy is modest.

Diversity of opinion is important for realizing the potential benefits of consensus panels, and panel membership should be multidisciplinary whenever feasible. It might be helpful for individuals with personal information about patients to present data for consideration and respond to questions, but including them on the diagnostic panel is problematic because it could discourage diverse opinions voiced by those without “special knowledge.” This study demonstrates the value of open discussion among equals using identical patient data.

POTENTIAL LIMITATIONS

Given the variety of consensus panels, these findings may not generalize to other settings. The clinical scenarios reviewed in this study were not based on a comprehensive longitudinal prospective study and varied considerably in the number of examinations, the detail and length of the medical record, and the quality of the medical history. Although this reflects many clinical situations, restricting data to an initial visit may provide more limited or ambiguous diagnostic information and would likely cause more panelist error than observed in this study. In contrast, prospectively collected comprehensive longitudinal data would probably produce less error because diagnostic accuracy improves with longitudinal information.21,22 We can only speculate as to whether diagnostic accuracy would be affected by a change in the quantity or quality of patient information. Nevertheless, as long as panelists can independently review and interpret the patient information, we would expect a benefit from consensus panels. Although not a desirable setting, situations that provide limited and ambiguous information likely would cause more individual diagnostic errors and provide a greater opportunity for improvement using consensus methods. Likewise, an expanded set of diagnostic choices is likely to reduce the reliability of consensus diagnosis but could result in even stronger performance of panels relative to individuals than was found in this study.

Eventually, validated biomarkers may make interpretation of clinical data less subjective. Until that elusive goal is achieved for dementing diseases, consensus diagnosis following a carefully considered protocol that allows for diverse opinion and deliberations involving a multidisciplinary panel without special knowledge will be an appropriate approach to maximizing diagnostic accuracy.

Correspondence: Matthew J. Gabel, PhD, Department of Political Science, Washington University, One Brookings Dr, St Louis, MO 63130-4899 (mgabel@artsci.wustl.edu).

Accepted for Publication: June 9, 2010.

Author Contributions: Drs Gabel and Foster had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Study concept and design: Foster, Gabel, Heidebrink, and Lipton. Acquisition of data: Aizenstein, Arnold, Barbas, Burke, Clark, DeKosky, Farlow, Foster, Gabel, Heidebrink, Jagust, Kawas, Leverenz, Lipton, Peskind, Turner, Womack, and Zamrini. Analysis and interpretation of data: Arnold, Boeve, DeKosky, Farlow, Foster, Gabel, Heidebrink, Higdon, Koeppe, and Lipton. Drafting of the manuscript: Gabel and Foster. Critical revision of the manuscript for important intellectual content: Aizenstein, Arnold, Barbas, Boeve, Burke, Clark, DeKosky, Farlow, Foster, Gabel, Heidebrink, Higdon, Jagust, Kawas, Koeppe, Leverenz, Lipton, Peskind, Turner, Womack, and Zamrini. Statistical analysis: Gabel, Higdon, and Koeppe. Obtaining funding: Foster and Turner. Administrative, technical, and material support: Arnold, Boeve, Burke, DeKosky, Farlow, Foster, and Heidebrink. Study supervision: DeKosky, Gabel, and Foster. Member of consensus panel: Aizenstein.

Financial Disclosure: None reported.

Funding/Support: This work was supported by grant AG22394 from the National Institutes of Health (NIH); by an anonymous private donation to the Center for Alzheimer's Care, Imaging, and Research; by pilot cooperative project grant AG16976 from the National Alzheimer's Coordinating Center; and by the following NIH Alzheimer's Disease Research Centers: University of Michigan (AG08671), University of California at Davis (AG10129), University of Pennsylvania (AG10124), University of California at Irvine (AG16573), Duke University (AG0238377), Indiana University (AG10133), University of Pittsburgh (AG05133), and University of Texas Southwestern (AG12300).

Online-Only Materials: The eFigure and eTables are available at http://www.archneurol.com.

Additional Contributions: David E. Kuhl, MD; Sid Gilman, MD; Henry Buchtel, PhD; David Knesper, MD; R. Scott Turner, MD, PhD; and Kirk Frey, MD, PhD, made images from their research available for this study; Charles DeCarli, MD, contributed as a site investigator for grant AG22394 from the National Institutes of Health; Peijun Chen, Charles Davies, Shelley Hershner, and Joseph O. Nnodim served on the pilot trainee panel; and Jeff Gill, PhD; Ryan Moore, PhD; and Diana O’Brien, MA, provided valuable statistical advice.

Ott  AStolk  RPvan Harskamp  FPols  HAHofman  ABreteler  MM Diabetes mellitus and the risk of dementia: the Rotterdam Study. Neurology 1999;53 (9) 1937- 1942
PubMed Link to Article
Lopez  OLBecker  JTKlunk  W  et al.  Research evaluation and diagnosis of probable Alzheimer's disease over the last two decades: I. Neurology 2000;55 (12) 1854- 1862
PubMed Link to Article
Shekelle  PGKahan  JPBernstein  SJLeape  LLKamberg  CJPark  RE The reproducibility of a method to identify the overuse and underuse of medical procedures. N Engl J Med 1998;338 (26) 1888- 1895
PubMed Link to Article
Huttin  C The use of clinical guidelines to improve medical practice: main issues in the United States. Int J Qual Health Care 1997;9 (3) 207- 214
PubMed Link to Article
Gabel  MJShipan  CR A social choice approach to expert consensus panels. J Health Econ 2004;23 (3) 543- 564
PubMed Link to Article
Jones  JHunter  D Consensus methods for medical and health services research. BMJ 1995;311 (7001) 376- 380
PubMed Link to Article
Robert  GMilne  R A Delphi study to establish national cost-effectiveness research priorities for positron emission tomography. Eur J Radiol 1999;30 (1) 54- 60
PubMed Link to Article
Fick  DMCooper  JWWade  WEWaller  JLMaclean  JRBeers  MH Updating the Beers criteria for potentially inappropriate medication use in older adults: results of a US consensus panel of experts. Arch Intern Med 2003;163 (22) 2716- 2724
PubMed Link to Article
Drasković  IVernooij-Dassen  MVerhey  FScheltens  PRikkert  MO Development of quality indicators for memory clinics. Int J Geriatr Psychiatry 2008;23 (2) 119- 128
PubMed Link to Article
Olde Rikkert  MGvan der Vorm  ABurns  A  et al.  Consensus statement on genetic research in dementia. Am J Alzheimers Dis Other Demen 2008;23 (3) 262- 266
PubMed Link to Article
Foster  NLHeidebrink  JLClark  CM  et al.  FDG-PET improves accuracy in distinguishing frontotemporal dementia and Alzheimer's disease. Brain 2007;130 (pt 10) 2616- 2635
PubMed Link to Article
Brook  RHChassin  MRFink  ASolomon  DHKosecoff  JPark  RE A method for the detailed assessment of the appropriateness of medical technologies. Int J Technol Assess Health Care 1986;2 (1) 53- 63
PubMed Link to Article
McKhann  GDrachman  DFolstein  MKatzman  RPrice  DStadlan  EM Clinical diagnosis of Alzheimer's disease: report of the NINCDS-ADRDA Work Group under the auspices of Department of Health and Human Services Task Force on Alzheimer's Disease. Neurology 1984;34 (7) 939- 944
PubMed Link to Article
Neary  DSnowden  JSGustafson  L  et al.  Frontotemporal lobar degeneration: a consensus on clinical diagnostic criteria. Neurology 1998;51 (6) 1546- 1554
PubMed Link to Article
Lund and Manchester Groups, Clinical and neuropathological criteria for frontotemporal dementia. J Neurol Neurosurg Psychiatry 1994;57 (4) 416- 418
PubMed Link to Article
McKhann  GMAlbert  MSGrossman  MMiller  BDickson  DTrojanowski  JQWork Group on Frontotemporal Dementia and Pick's Disease, Clinical and pathological diagnosis of frontotemporal dementia: report of the Work Group on Frontotemporal Dementia and Pick's Disease. Arch Neurol 2001;58 (11) 1803- 1809
PubMed Link to Article
Landis  JRKoch  GG The measurement of observer agreement for categorical data. Biometrics 1977;33 (1) 159- 174
PubMed Link to Article
Andrews  DWK Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica 1991;59 (3) 817- 858
Link to Article
Hochberg  YBenjamini  Y More powerful procedures for multiple significance testing. Stat Med 1990;9 (7) 811- 818
PubMed Link to Article
Page  SE The Difference.  Princeton, NJ Princeton University Press2007;
Becker  JTBoller  FLopez  OLSaxton  JMcGonigle  KL The natural history of Alzheimer's disease: description of study cohort and accuracy of diagnosis. Arch Neurol 1994;51 (6) 585- 594
PubMed Link to Article
Litvan  IAgid  YSastry  N  et al.  What are the obstacles for an accurate clinical diagnosis of Pick's disease? a clinicopathologic study [published correction appears in Neurology. 1997;49(6):1755]. Neurology 1997;49 (1) 62- 69
PubMed Link to Article

Figures

Place holder to copy figure label and caption
Figure 1.

Comparison of diagnostic accuracy of panelists and panels (A) and raters and panels (B) reviewing clinical scenarios. Consensus panel diagnoses (black bars) based on scenarios were more accurate than were diagnoses arrived at by 10 of 11 individual panelists and 5 of 6 raters. This panel superiority was significant (peach bars) for all members of the trainee panel, 4 members of the expert panel, and 3 raters (P < .05, Hochberg corrected). Note that 1 member of the trainee panel did not review 17 cases and was thus omitted from this analysis.

Graphic Jump Location
Place holder to copy figure label and caption
Figure 2.

Comparison of diagnostic accuracy of expert raters and panelists and panels reviewing fludeoxyglucose F 18 positron emission tomography images. Consensus panel diagnoses (black bars) were more accurate than the diagnoses of 0 of 6 panelists and 2 of 6 raters when reviewing images and scenarios (A) and 2 of 6 panelists and 2 of 6 raters when reviewing images alone (B). Two panelists and 2 raters (peach bars) had significantly lower accuracy than each panel (P < .05, Hochberg corrected). In contrast, across the 2 panels, 6 of 12 panelists and 5 of 12 raters (in blue) had significantly greater accuracy than the panel diagnoses (P < .05, Hochberg corrected).

Graphic Jump Location
Place holder to copy figure label and caption
Figure 3.

Panel diagnostic accuracy by patient. Each horizontal line represents a single patient. Panel diagnoses in agreement with neuropathologic diagnoses are shown in gray. Panel diagnostic errors are shown in peach. Panels often were in error in the same patients. The pairwise κ agreement (SE) between the diagnoses of the trainee and expert panels for scenarios was 0.79 (0.15); for trainee (scenario) and expert (images) panels, 0.68 (0.15); for trainee (scenario) and expert (scenario + images) panels, 0.79 (0.15); for expert (scenario) and expert (images) panels, 0.69 (0.15); and for expert (images) (scenario + images) panels, 0.79 (0.15). AD indicates Alzheimer disease; FDG-PET, fludeoxyglucose F 18 positron emission tomography; and FTD, frontotemporal dementia.

Graphic Jump Location

Tables

Table Graphic Jump LocationTable. Individual Panelist Accuracy, Confidence, and Agreement Before and After Panel Deliberationa

References

Ott  AStolk  RPvan Harskamp  FPols  HAHofman  ABreteler  MM Diabetes mellitus and the risk of dementia: the Rotterdam Study. Neurology 1999;53 (9) 1937- 1942
PubMed Link to Article
Lopez  OLBecker  JTKlunk  W  et al.  Research evaluation and diagnosis of probable Alzheimer's disease over the last two decades: I. Neurology 2000;55 (12) 1854- 1862
PubMed Link to Article
Shekelle  PGKahan  JPBernstein  SJLeape  LLKamberg  CJPark  RE The reproducibility of a method to identify the overuse and underuse of medical procedures. N Engl J Med 1998;338 (26) 1888- 1895
PubMed Link to Article
Huttin  C The use of clinical guidelines to improve medical practice: main issues in the United States. Int J Qual Health Care 1997;9 (3) 207- 214
PubMed Link to Article
Gabel  MJShipan  CR A social choice approach to expert consensus panels. J Health Econ 2004;23 (3) 543- 564
PubMed Link to Article
Jones  JHunter  D Consensus methods for medical and health services research. BMJ 1995;311 (7001) 376- 380
PubMed Link to Article
Robert  GMilne  R A Delphi study to establish national cost-effectiveness research priorities for positron emission tomography. Eur J Radiol 1999;30 (1) 54- 60
PubMed Link to Article
Fick  DMCooper  JWWade  WEWaller  JLMaclean  JRBeers  MH Updating the Beers criteria for potentially inappropriate medication use in older adults: results of a US consensus panel of experts. Arch Intern Med 2003;163 (22) 2716- 2724
PubMed Link to Article
Drasković  IVernooij-Dassen  MVerhey  FScheltens  PRikkert  MO Development of quality indicators for memory clinics. Int J Geriatr Psychiatry 2008;23 (2) 119- 128
PubMed Link to Article
Olde Rikkert  MGvan der Vorm  ABurns  A  et al.  Consensus statement on genetic research in dementia. Am J Alzheimers Dis Other Demen 2008;23 (3) 262- 266
PubMed Link to Article
Foster  NLHeidebrink  JLClark  CM  et al.  FDG-PET improves accuracy in distinguishing frontotemporal dementia and Alzheimer's disease. Brain 2007;130 (pt 10) 2616- 2635
PubMed Link to Article
Brook  RHChassin  MRFink  ASolomon  DHKosecoff  JPark  RE A method for the detailed assessment of the appropriateness of medical technologies. Int J Technol Assess Health Care 1986;2 (1) 53- 63
PubMed Link to Article
McKhann  GDrachman  DFolstein  MKatzman  RPrice  DStadlan  EM Clinical diagnosis of Alzheimer's disease: report of the NINCDS-ADRDA Work Group under the auspices of Department of Health and Human Services Task Force on Alzheimer's Disease. Neurology 1984;34 (7) 939- 944
PubMed Link to Article
Neary  DSnowden  JSGustafson  L  et al.  Frontotemporal lobar degeneration: a consensus on clinical diagnostic criteria. Neurology 1998;51 (6) 1546- 1554
PubMed Link to Article
Lund and Manchester Groups, Clinical and neuropathological criteria for frontotemporal dementia. J Neurol Neurosurg Psychiatry 1994;57 (4) 416- 418
PubMed Link to Article
McKhann  GMAlbert  MSGrossman  MMiller  BDickson  DTrojanowski  JQWork Group on Frontotemporal Dementia and Pick's Disease, Clinical and pathological diagnosis of frontotemporal dementia: report of the Work Group on Frontotemporal Dementia and Pick's Disease. Arch Neurol 2001;58 (11) 1803- 1809
PubMed Link to Article
Landis  JRKoch  GG The measurement of observer agreement for categorical data. Biometrics 1977;33 (1) 159- 174
PubMed Link to Article
Andrews  DWK Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica 1991;59 (3) 817- 858
Link to Article
Hochberg  YBenjamini  Y More powerful procedures for multiple significance testing. Stat Med 1990;9 (7) 811- 818
PubMed Link to Article
Page  SE The Difference.  Princeton, NJ Princeton University Press2007;
Becker  JTBoller  FLopez  OLSaxton  JMcGonigle  KL The natural history of Alzheimer's disease: description of study cohort and accuracy of diagnosis. Arch Neurol 1994;51 (6) 585- 594
PubMed Link to Article
Litvan  IAgid  YSastry  N  et al.  What are the obstacles for an accurate clinical diagnosis of Pick's disease? a clinicopathologic study [published correction appears in Neurology. 1997;49(6):1755]. Neurology 1997;49 (1) 62- 69
PubMed Link to Article

Correspondence

CME
Meets CME requirements for:
Browse CME for all U.S. States
Accreditation Information
The American Medical Association is accredited by the Accreditation Council for Continuing Medical Education to provide continuing medical education for physicians. The AMA designates this journal-based CME activity for a maximum of 1 AMA PRA Category 1 CreditTM per course. Physicians should claim only the credit commensurate with the extent of their participation in the activity. Physicians who complete the CME course and score at least 80% correct on the quiz are eligible for AMA PRA Category 1 CreditTM.
Note: You must get at least of the answers correct to pass this quiz.
You have not filled in all the answers to complete this quiz
The following questions were not answered:
Sorry, you have unsuccessfully completed this CME quiz with a score of
The following questions were not answered correctly:
Commitment to Change (optional):
Indicate what change(s) you will implement in your practice, if any, based on this CME course.
Your quiz results:
The filled radio buttons indicate your responses. The preferred responses are highlighted
For CME Course: A Proposed Model for Initial Assessment and Management of Acute Heart Failure Syndromes
Indicate what changes(s) you will implement in your practice, if any, based on this CME course.
Submit a Comment

Multimedia

Validation of Consensus Panel Diagnosis in Dementia
Arch Neurol.2010;67(12):1506-1512.eFigure and eTables

eFigure and eTables -Download PDF (102 KB). This file requires Adobe Reader®.

eFigure. Comparison of diagnostic accuracy of expert panelists and the trainee panel.

eTable 1. Panel Diagnostic Performance

eTable 2. Overall Effect of Consensus Panel Deliberation on Panelists’ Judgments

eTable 3. Diagnostic Confidence and Changes in Diagnostic Accuracy

eTable 4. Panel Accuracy Under Different Majority Thresholds
Supplemental Content

Some tools below are only available to our subscribers or users with an online account.

Web of Science® Times Cited: 6

Related Content

Customize your page view by dragging & repositioning the boxes below.

Articles Related By Topic
Related Collections
PubMed Articles
JAMAevidence.com

Users' Guides to the Medical Literature
Clinical Resolution

Users' Guides to the Medical Literature
Clinical Scenario