Thursday, December 26, 2024

No, CXR for Pediatric Pneumonia Does NOT have a 98% Negative Predictive Value


I was reading the current issue of the NEJM today and got to the article called Chest Radiography for Presumed Pneumonia in Children - it caught my attention as a medical decision making article. It's of the NEJM genre that poses a clinical quandary, and then asks two discussants each to defend a different management course. (Another memorable one was on whether to treat subsegmental PE.) A couple of things struck me about the discussants' remarks about CXRs for kids with possible pneumonia. The first discussant says that "a normal chest radiograph reliably rules out a diagnosis of pneumonia." That is certainly not true in adults where the CXR has on the order of 50-70% sensitivity for opacities or pneumonia. So I wondered if kids are different from adults. The second discussant then remarked that CXR has a 98% negative predictive value for pneumonia in kids. This number really should get your attention. Either the test is very very sensitive and specific, or the prior probability in the test sample was very low, something commonly done to inflate the reported number. (Or, worse, the number is wrong.) I teach trainees to always ignore PPV and NPV in reports and seek out the sensitivity and specificity, as they cannot be fudged by selecting a high or low prevalence population. It then struck me that this question of whether or not to get a CXR for PNA in kids is a classic problem in medical decision making that traces its origins to Ledley and Lusted (Science, 1959) and Pauker and Kassirer's Threshold Approach to Medical Decision Making. Surprisingly, neither discussant made mention of or reference to that perfectly applicable framework (but they did self-cite their own work). Here is the Threshold Approach applied to the decision to get a CT scan for PE (Klein, 2004) that is perfectly analogous to the pediatric CXR question. I was going to write a letter to the editor pointing out that 44 years ago the NEJM published a landmark article establishing a rational  framework for analyzing just this kind of question, but I decided to dig deeper and take a look at this 2018 paper in Pediatrics that both discussants referenced as the source for the NPV of 98% statistic.

In order to calculate the 98% NPV, we need to look a the n=683 kids in the study and see which cells they fall into in a classic epidemiological 2x2 table. The article's Figure 2 is the easiest way to get those numbers:



(Note that they exclude n=42 kids who were treated with antibiotics for other conditions despite not being diagnosed with pneumonia; I'm honestly unsure what else to do with those kids, so like the authors, I exclude them in the 2x2 tables below.) Here is a refresher on a 2x2 contingency table:


Here is the 2x2 table we can construct using the numbers from Figure 2 in the paper, before the follow-up of the 5 kids that were diagnosed with pneumonia 2 weeks later:



And here is the 2x2 table that accounts for the 5 kids that were initially called "no pneumonia" but were diagnosed with pneumonia within the next two weeks. Five from cell "d" (bottom right) must be moved to cell "c" (bottom left) because they were CXR-/PNA- kids that were moved into the CXR-/PNA+ column after the belated diagnosis:



The PPV has fallen trivially from 90% to 89%, but why are both so far away from the authors' claim of 98%? Because the authors conveniently ignored the 44 kids with an initially negative CXR that were nonetheless called PNA by the physicians in cell "c". They surely should be counted because, despite a negative CXR, they were still diagnosed with PNA, just 2 weeks earlier than the 5 that the authors concede were false negatives; there is no reason to make a distinction between these two groups of kids, as they are all clinically diagnosed pneumonia with a "falsely negative" CXR (cell "c").

It is peculiar, - rather, astonishing - that the NPV in this study, still being touted and referenced as a pivot for decision making, was miscalculated despite peer review. And while you may be tempted to say that 89% is pretty close to 98%, you would be making a mistake. Using the final sensitivity and specificity from this 2x2 table, we can calculate LR+ and LR- for CXR as a test for PNA: they are 10.8 and 0.26. We can also see from this table that the rate (some may say "prevalence") of PNA in this sample is 32%. What is the posterior probability of PNA based on the "correct" numbers if the pre-test probability (or the rate or prevalence of pneumonia) is 65% instead of 32%? The calculator in the Status Iatrogenicus sidebar can be used to easily calculate it: the NPV in that case is 68%, and of course 1-NPV (the output of the calculator, chosen to emphasize the residual probability of disease in the presence of a negative test) is 32%. Pneumonia in that circumstance is still far above the treatment threshold. By that I mean, if my child had probability of pneumonia of 32%, I would want them treated. (Because antibiotics are pretty benign, bro; resistance happens at the poultry farm.)

There are more fundamental problems. Like child abuse studies, there is a circular logic here: the kid has pneumonia because the doc says he has pneumonia, but the doc knows the CXR shows "pneumonia"; but then teh diagnosis of PNA leads to the CXR finding being classified as a true positive. How many of the pneumonia diagnoses were true/false positives/negatives? We can't know because we have no gold standard for pneumonia, just as we have no gold standard for child abuse - we are guessing which cells the numbers go in. This points to another violation of basic Bayesian assumptions: there must be conditional independence between the results of the test and the presence or absence of the disease in question. Here, there is very clearly dependence because the docs are making the pneumonia determination on the basis of the CXR. The study design is fundamentally flawed, and so are all conclusions that ramify from it.

I'm always a little surprised when I go digging into the studies that people bandy about as "evidence" for this and that, as I frequently find that they're either misunderstood, misrepresented, or just plain wrong. I can readily imagine a pediatrician (resident or attending) telling me with high confidence that the CXR can "rule out" pneumonia in my kid, because her attendings told her that on the basis of the 2018 Lipsett study, and yet none of them ever looked any deeper into the actual study to find its obvious mistakes and shortcomings.

As they say, "Trust, but verify." Or perhaps more apropos here: "Extraordinary claims require extraordinary evidence." An NPV of 98% (for CXR!) is an extraordinary claim indeed. The evidence for it, however, is not extraordinary. As a trusted mentor once told me "Scott, don't believe everything you read."

ETA: You can get a 98% NPV using the sensitivity and specificity from the Lipsett data (despite the erroneous assumptions that inhere in them) by using a prevalence of pneumonia of just 7%. To wit: if you want to get to a posterior probability of PNA of 2% (corresponding to the reported 98% NPV in the Lipsett study), you need to start with a population in which only 7 of 100 kids has pneumonia, and you need to do a CXR on all of them, to reduce it by 5 kids so that only 2 of them have PNA. 100 CXRs later, pneumonia cases in the cohort are reduced from 7 cases to 2. Is it worth it to do 100 CXRs to avoid 5 courses of antibiotics? We could make a formal Threshold analysis to answer this question, but apparently that was not the point of the "Clinical Decisions" section of this week's NEJM; rather, it was to highlight reference 1, which turns out to have conclusions based on a miscalculation.

Monday, October 28, 2024

Hickam's Dictum: Let's Talk About These Many Damn Diseases


This post is about our article on Hickam's dictum, just published online (open access!) today.

I don't know if I read the 2004 NEJM CPC that mentioned Hickam's Dictum (HD - "A patient can have as many diseases as he damn well pleases") and popularized it, but knowing me, I probably did. My interest in HD piqued over the past 5-10 years because it has been increasingly invoked in complex cases, and whenever this happened, I always thought it was more likely that a unifying diagnosis (according with Ockham's razor) was present, and that the case had not yet been completely solved. So, we set out to investigate HD formally using three lines of evidence that you can read about in the article. We learned much more than we were able to report in the article because of word limitations, so I will report other interesting findings and insights gained along the way here (if you want to skip over the summary info to the "other interesting findings", scroll to the bold "other insights" subheading).

First, a summary of our results. As should be obvious, we confirmed that patients get multiple diagnoses, but case reports alleging to instantiate HD did not document random diagnoses occurring in a patient - there was a pattern to their occurrence. The vast majority of the time, there was a primary diagnosis which explained the patient's chief complaint and acute presentation, as well as one or more of the following: 
  1. An incidentaloma (about 30% of cases)
  2. A pre-existing, already known condition (about 25% of cases)
  3. A component of a unifying diagnosis (about 40% of cases)
  4. A symptomatic, coincident, independent disease, unrelated to the primary diagnosis, necessary to fully explain the acute presentation (about 4% of cases)
As we explain in the discussion, finding an incidentaloma during investigation of the chief complaint represents a spurious coincidence. Finding a new disease superimposed upon a chronic one and being surprised suggests that clinicians anchored to the chronic condition and forgot that new diseases can be superimposed on it; e.g., they failed to recognize that having recurrent CHF does not preclude development of CAP this admission. When authors report a primary disease and its complications, epiphenomena, or underlying cause, they appear to have failed to realize that those are all components of a unifying diagnosis! In the three (3!) cases we categorized as #4, we actually were being generous and honestly these probably represent one of the other categories but we didn't have sufficient information to confidently confirm that.

Tuesday, September 26, 2023

The Fallacy of the Fallacy of the Single Diagnosis: Post-publication Peer Review to the Max!


Prepare for the Polemic.

Months ago, I stumbled across this article in Medical Decision Making called "The Fallacy of a Single Diagnosis" by Don Redelmeier and Eldar Shafir (hereafter R&S). In it they purport to show, using vignettes given to mostly lay people, that people have an intuition that there should be a single diagnosis, which, they claim, is wrong, and they attempt to substantiate this claim using a host of references. I make the following observations and counterclaims:

  1. R&S did indeed show that their respondents thought that having one virus, such as influenza (or EBV, or GAS pharyngitis), decreases the probability of having COVID simultaneously
  2. Their respondents are not wrong - having influenza does decrease the probability of a diagnosis of COVID
  3. R&S's own references show that their respondents were normative in judging a reduced probability of COVID if another respiratory pathogen was known to be present
My coauthor and I then submitted this now published letter "The Verity of a Unifying Diagnosis" enumerating and elucidating the many glaring deficiencies in the work, if it is to be taken as evidence that the notion of a "single" diagnosis is a fallacy. (Properly applied to medical decision making, Ockham's razor guides the clinician to search for a "unifying" diagnosis, not a single one.) R&S responded with "Persistent Challenges to a Single Diagnosis". Their response betrays persistent challenges in their understanding of our letter, their data, and the notion of a single diagnosis versus a unifying diagnosis. For the record and in the name of post-publication peer review, I will expound upon these issues here, paragraph by paragraph.

In paragraph 1, they claim that, since we limited our analysis of their supporting references to only coinfections with COVID and influenza (our Table 1), our analysis may be misleading because they also included mononucleosis (EBV) and strep throat in their vignettes. I will point out that a clear majority of their vignettes - 3 out of 5 - used influenza; their references had too few cases of mono and strep throat for us to analyze in aggregate; and we had only 1200 words and limited patience to parse their references any further than our already comprehensive Table 1.

The fundamental - and monumental - message of our Table 1 is that R&S played loose and careless with their references, and failed to recognize that - as our Table 1 clearly demonstrates - their very references undermine the main premise of their paper. 

They do not address this glaring problem directly. Their response amounts to "Yeah, but, we had cases with more than just influenza." If I were taken to task over such a haphazard reference list, I would own it, but I guess that's difficult to do without a major correction or a retraction of the paper.

Monday, June 26, 2023

Anchored on Anchoring: A Concept Cut from Whole Cloth


Welcome back to the blog. An article published today in JAMA Internal Medicine was just the impetus I needed to return after more than a year.

Hardly a student of medicine who has trained in the past 10 years has not heard of "anchoring bias" or anchoring on a diagnosis. What is this anchoring? Customarily in cognitive psychology, to demonstrate a bias empirically, you design an experiment that shows directional bias in some response, often by introducing an irrelevant (independent) variable e.g., a reference frame, as we did here and here. Alternatively, you could show bias if responses deviate from some known truth value as we did hereWhat does not pass muster is to simply say "I think there is a bias whereby..." and write an essay about it.

That is what happened 20 years ago when an expository essay by Crosskerry proposed "anchoring" as a bias in medical decision making, which he ostensibly named after the "anchoring and adjustment" heuristic demonstrated by Kahneman and Tversky (K&T) in  experiments published in their landmark 1974 Science paper. The contrast between "anchoring to a diagnosis" (A2D) and K&T's anchoring and adjustment (A&A) makes it clear why I bridle so much at the former. 

To wit: First, K&T showed A&A via an experiment with an independent (and irrelevant) variable. They had participants in this experiment spin a dial on a wheel with associated numbers, like on the Wheel of Fortune game show. (They did not know that the dial was rigged to land on either 10 or 65.) They were then asked whether the number of African countries that are members of the United Nations was more or less than that number; and then to give their estimate of the number of member countries. The numerical anchors, 10 and 65, biased responses. For the group of participants whose dials landed on 10, their estimates were lower, and for the other group (65), they were higher. 

Sunday, May 22, 2022

Common Things Are Common, But What is Common? Operationalizing The Axiom

"Prevalence [sic: incidence] is to the diagnostic process as gravity is to the solar system: it has the power of a physical law." - Clifton K Meador, A Little Book of Doctors' Rules


We recently published a paper with the same title as this blog post here. The intent was to operationalize the age-old "common things are common" axiom so that it is practicable to employ it during the differential diagnosis process to incorporate probability information into DDx. This is possible now in a way that it never has been before because there are now troves of epidemiological data that can be used to bring quantitative (e.g., 25 cases/100,000 person-years) rather than mere qualitative (e.g., very common, uncommon, rare, etc) information to bear on the differential diagnosis. I will briefly summarize the main points of the paper and demonstrate how it can be applied to real-world diagnostic decision making.

First is that the proper metric for "commonness" is disease incidence (standardized as cases/100,000 person-years) not disease prevalence. Incidence is the number of new cases per year - those that have not been previously diagnosed - whereas prevalence is the number of already diagnosed cases. It the disease is already present, there is no diagnosis to be made (see article for more discussion of this). Prevalence is approximately equal the product of incidence & disease duration, so it will be higher (oftentimes a lot higher) than incidence for diseases with a chronic component; this will lead to overestimation of the likelihood of diagnosing a new case. Furthermore, your intuitions about disease commonness are mostly based on how frequently you see patients with the disease (e.g., SLE) but most of these are prevalent not incident cases so you will think SLE is more common than it really is, diagnostically. If any of this seems counterintuitive, see our paper for details (email me for pdf copy if you can't access it).

Second is that commonness exists on a continuum spanning 5 or more orders of magnitude, so it is unwise to dichotomize diseases as common or rare as information is lost in doing so. If you need a rule of thumb though, it is this: if the disease you are considering has single-digit (or less) incidence in 100,000 p-y, that disease is unlikely to be the diagnosis out of the gate (before ruling out more common diseases). Consider that you have approximately a 15% chance of ever personally diagnosing a pheochromocytoma (incidence <1/100,000 P-Y) during an entire 40 year career as there are only 2000 cases diagnosed per year in the USA, and nearly one million physicians in a position to initially diagnose them. (Note also that if you're rounding with a team of 10 physicians, and a pheo gets diagnosed, you can't each count this is an incident diagnosis of pheo. If it's a team effort, you each diagnosed 1/10th of a pheochromocytoma. This is why "personally diagnosing" is emphasized above.)  A variant of the common things axiom states "uncommon presentations of common diseases are more common than common presentations of uncommon diseases" - for more on that, see this excellent paper about the range of presentations of common diseases.

Third is that you cannot take a raw incidence figure and use it as a pre-test probability of disease. The incidence in the general population does not represent the incidence of diseases presenting to the clinic or the emergency department. What you can do however, is take what you do know about patients presenting with a clinical scenario,  and about general population incidence, and make an inference about relative likelihoods of disease. For example, suppose a 60-year-old man presents with fever, hemoptysis and a pulmonary opacity that may be a cavity on CXR. (I'm intentionally simplifying the case so that the fastidious among you don't get bogged down in the details.) The most common cause of this presentation hands down is pneumonia. But, it could also represent GPA (formerly Wegener's, every pulmonologist's favorite diagnosis for hemoptysis) or TB (tuberculosis, every medical student's favorite diagnosis for hemoptysis). How could we use incidence data to compare the relative probabilities of these 3 diagnostic possibilities?


Suppose we were willing to posit that 2/3rds of the time we admit a patient with fever and opacities, it's pneumonia. Using that as a starting point, we could then do some back-of-the-envelope calculations. CAP has an incidence on the order of 650/100k P-Y; GPA and TB have incidences on the order of 2 to 3/100k PY respectively - CAP is 200-300x more common than these two zebras. (Refer to our paper for history and references about the "zebra" metaphor.)  If CAP occupies 65% of the diagnostic probability space (see image and this paper for an explication), then it stands to reason that, ceteris paribus (and things are not always ceteris paribus), the TB and GPA occupy on the order of 1/200th of 65%, or about 0.25% of the probability space. From an alternative perspective, a provider will admit 200 cases of pneumonia for every case of TB or GPA she admits - there's just more CAP out there to diagnose! Ask yourself if this passes muster - when you are admitting to the hospital for a day, how many cases of pneumonia do you admit, and when is the last time you yourself admitted and diagnosed a new case of GPA or TB? Pneumonia is more than two orders of magnitude more common than GPA and TB and, barring a selection or referral bias, there just aren't many of the latter to diagnose! If you live in a referral area of one million people, there will only be 20-30 cases of GPA diagnosed in that locale during in a year (spread amongst hospitals/clinics), whereas there will be thousands of cases of pneumonia.

As a parting shot, these are back-of-the-envelope calculations, and their several limitations are described in our paper. Nonetheless, they are grounding for understanding the inertial pull of disease frequency in diagnosis. Thus, the other day I arrived in the AM to hear that a patient was admitted with supposed TTP (thrombotic thrombocytopenic purpura) overnight. With an incidence of about 0.3 per 100,000 PY, that is an extraordinary claim - a needle in the haystack has been found! - so, without knowing anything else, I wagered that the final diagnosis would not be TTP. (Without knowing anything else about the case, I was understandably squeamish about giving long odds against it, so I wagered at even odds, a $10 stake.) Alas, the final diagnosis was vitamin B12 deficiency (with an incidence on the order of triple digits per 100k PY), with an unusual (but well recognized) presentation that mimics TTP & MAHA.

Incidence does indeed have the power of a physical law; and as Hutchison said in an address in 1928, the second  commandment of diagnosis (after "don't be too clever") is "Do not diagnose rarities." Unless of course the evidence demands it - more on that later.

Saturday, October 30, 2021

CANARD: Coronavirus Associated Nonpathogenic Aspergillus Respiratory Disease

Cavitary MSSA disease in COVID

One of the many challenges of pulmonary medicine is unexplained dyspnea which, after an extensive investigation, has no obvious cause excepting "overweight and out of shape" (OWOOS). It can occur in thin and fit people too, and can also have a psychological origin: psychogenic polydyspnea, if you will. This phenomenon has been observed for quite some time. Now, overlay a pandemic and the possibility of "long COVID". Inevitably, people who would have been considered to have unexplained dyspnea in previous eras will come to attention after being infected with (or thinking they may have been infected with) COVID. Without making any comment on long COVID, it is clear that some proportion of otherwise unexplained dyspnea that would have come to attention had there been no pandemic will now come after COVID and thus be blamed on it. (Post hoc non ergo propter hoc.) When something is widespread (e.g., COVID), we should be careful about drawing connections between it and other highly prevalent (and pre-existing) phenomena (e.g., unexplained dyspnea).

Since early in the pandemic, reports have emerged of a purported association between aspergillus pulmonary infection and COVID. There are several lines of biological reasoning being used to support the association, including an analogy with aspergillosis and influenza (an association itself open to debate - I don't believe I've seen a case in 20 years), and sundry arguments predicated upon the immune and epithelial disturbances wrought by COVID. Many of these reports include patients who have both COVID and traditional risk factors for invasive pulmonary aspergillosis (IPA), viz, immunosuppression. The association in these patients is between immunosuppression and aspergillus (already known), in the setting of a pandemic; whether or not COVID adds additional risk would require an epidemiological investigation in a cohort of immunosuppressed patients with and without COVID. To my knowledge, such as study has not been done. Instead, they are like this one, where there were allegedly 42 patients among 279 in the discovery cohort with CAPA (Coronavirus Associated Pulmonary Aspergillosis; see Figure 1) and of those 42, 23 of them were immunosuppressed (Table 1). In addition, 5 had rheumatological disease, 3 had solid organ malignancy. In the validation cohort, there were 21 suspected CAPA among 209 patients; among these 21, there were at least 6 and perhaps 9 who were immunosuppressed (Table 2).

There are other problems (many of which we outlined here: Aspergillosis in the ICU: Hidden Enemy or Bogeyman?). The strikingly high proportion of COVID patients with aspergillosis being reported - a whopping 15% - includes patients with three levels of diagnostic certainty: proven, probable, and possible. I understand the difficulties inherent in the diagnosis of this disease. I also understand that relying on nonspecific tests for diagnosis will result in many false positives, especially in low base rate populations. Therefore it is imperative that we carefully establish the base rate, and that is not what most of these studies are doing. Rather they are - apparently intentionally - comingling traditional risk factors and COVID leading to what is surely a gross overestimation of the incidence of true IPA in patients with COVID. Thus we warned:

We worry that if the immanent methodological limitations of this and similar studies are not adequately acknowledged—they are not listed among the possible explanations for the results enumerated by the editorialist (2)—an avalanche of testing for aspergillosis in ICUs may ensue, resulting in an epidemic of overdiagnosis and overtreatment. We caution readers of this report that it cannot establish the true prevalence of Aspergillus infection in patients with ventilator-associated pneumonia in the ICU, but it does underscore the fact that when tests with imperfect specificity are applied in low-prevalence cohorts, most positive results are false positives (10).

In the discovery cohort of the study linked above, Figure 1 shows that only 6 patients of 279 (2%) had "proven" disease, 4 with "tracheobronchitis" (it is not mentioned how this unusual manifestation of IPA was diagnosed; presumably via biopsy; if not, it should have counted as "probable" disease; see these guidelines), and 2 others meeting the proven definition. The remaining 36 patients had probable CAPA (32) and possible CAPA (4). In the validation cohort, there were 21 of 209 patients with alleged CAPA, all of them in the probable category (2 with probable tracheobronchitis, 19 with probable CAPA). Thus the prevalence (sic: incidence) of IPA in patients with COVID is a hodgepodge of a bunch of different diseases across a wide spectrum of diagnostic certainty.

Future studies should - indeed, must - exclude patients on chronic immunosuppression and those with immunodeficiency from these cohorts and also describe the specific details that form the basis of the diagnosis and diagnostic certainty category. Meanwhile, clinicians should recognize that the purported 15% incidence rate of CAPA/IPA in COVID comprises mostly immunosuppressed patients and patients with probable or possible - not proven - disease. Some proportion of alleged CAPA is actually a CANARD.

Wednesday, April 14, 2021

Bias in Assessing Cognitive Bias in Forensic Pathology: The Dror Nevada Death Certificate "Study"

Following the longest hiatus in the history of the Medical Evidence Blog, I return to issues of forensic medicine, by happenstance alone. In today's issue of the NYT is this article about bias in forensic medicine, spurred by interest in the trial of the murder of George Floyd. Among other things, the article discusses a recently published paper in the Journal of Forensic Sciences for which there were calls for retraction by some forensic pathologists. According to the NYT article, the paper showed that forensic pathologists have racial bias, a claim predicated upon an analysis of death certificates in Nevada, and a survey study of forensic pathologists, using a methodology similar to that I have used in studying physician decisions and bias (viz, randomizing recipients to receiving one of two forms of a case vignette that differ in the independent variable of interest). The remainder of this post will focus on that study, which is sorely in need of some post-publication peer review.

The study was led by Itiel Dror, PhD, a Harvard trained psychologist now at University College London who studies bias, with a frequent focus on forensic medicine, if my cursory search is any guide. The other authors are a forensic pathologist (FP) at University of Alabama Birmingham (UAB), a FP and coroner in San Luis Obispo, California, a lawyer with the Clark County public defender's office in Las Vegas, Nevada, a PhD psychologist from Towson University in Towson, Maryland, an FP proprietor of a Forensics company who is a part time medical examiner for West Virginia, and an FP who is proprietor of a forensics and legal consulting company in San Francisco, California. The purpose of identifying the authors was to try to understand why the analysis of death certificates was restricted to the state of Nevada. Other than one author's residence there, I cannot understand why Nevada was chosen, and the selection is not justified in the paltry methods section of the paper.

Sunday, February 16, 2020

Misunderstanding and Misuse of Basic Clinical Decision Principles among Child Abuse Pediatricians

The previous post about Dr. Cox, ensnared in a CPT (Child Protection Team) witch hunt in Wisconsin, has led me to evaluate several more research reports on child abuse, including SBS (shaken baby syndrome), AHT (abusive head trauma), and sentinel injuries.  These reports are rife with critical assumptions, severe limitations, and gross errors which greatly limit the resulting conclusions in most studies I have reviewed.  However, one study that was pointed out to me today  takes the cake.  I don't know what the prevalence of this degree of misunderstanding is, but CPTs and child abuse pediatricians need make sure they have a proper understanding of sensitivity, specificity, positive and negative predictive value, base rates, etc.  And they should not be testifying about the probability of child abuse at all if they don't have this stuff down cold. And I think this means that some proportion of them needs to go back to school or stop testifying.

The article and associated correspondence at issue is entitled The Positive Predictive Value of Rib Fractures as an Indicator of Nonaccidental Trauma in Children published in 2004.  The authors looked at a series of rib fractures in children at a single Trauma Center in Colorado during a six year period and identified all patients with a rib fracture.  They then restricted their analysis to children less than 3 years of age.  There were 316 rib fractures among just 62 children in the series; the average number of rib fractures per child is ~5.  The proper unit of analysis for a study looking at positive predictive value is children, sorted into those with and without abuse, and with and without rib fracture(s) as seen in the 2x2 tables below.

Tuesday, January 28, 2020

Bad Science + Zealotry = The Wisconsin Witch Hunts. The Case of John Cox, MD

John Cox, MD
I stumbled upon a very disturbing report on NBC News today of a physician couple in Wisconsin accused of abusing their adopted infant daughter.  It is surreal and horrifying and worth a read - not because these physicians abused their daughter, but because they almost assuredly did not.  One driving force behind the case appears to be a well-meaning and perfervid, if misguided and perfidious, pediatrician at University of Wisconsin who, with her group, coined the term "sentinel injuries" (SI) to describe small injuries such as bruises and oral injuries that they posit portend future larger scale abuse.  It was the finding of SI on the adopted infant in the story that in part led to charges of abuse against the father, Dr. Cox, got his child put in protective services, got him arrested, and threatens his career.  Interested readers can reference the link above for sordid and sundry details of the case.

Before delving into the 2013 study in Pediatrics upon which many contentions about SI rest, we should start with the fundamentals.  First, is it plausible that the thesis is correct, that before serious abuse, minor abuse is detectable by small bruises or oral injuries?  Of course it is, and it sounds like a good narrative.  But being a good plausible narrative does not make it true and it is likewise possible that bruises seen in kids who are and are not abused reflect nothing more than accidental injuries from rolling off a support, something falling or dropping on them, somebody dropping them, a sibling jabbing at them with a toy, and a number of things.  To my knowledge, the authors offer no direct evidence that the SIs they or others report have been directly traced to abuse.  They are doing nothing more than inferring that facial bruising is a precursor to Abusive Head Trauma (AHT), and based on their bibliography they have gone out of their way to promote this notion.

Thursday, December 5, 2019

Noninferiority Trials of Reduced Intensity Therapies: SCORAD trial of Radiotherapy for Spinal Metastases


No mets here just my PTX
A trial in JAMA this week by Hoskin et al (the SCORAD trial) compared two different intensities of radiotherapy for spinal metastases.  This is a special kind of noninferiority trial, which we wrote about last year in BMJ Open.  When you compare the same therapy at two intensities using a noninferiority trail, you are in perilous territory.  This is because if the therapy works on a dose response curve, it is almost certain, a priori, that the lower dose is actually inferior - if you consider inferior to represent any statistically significant difference disfavoring a therapy.  (We discuss this position, which goes against the CONSORT grain, here.)  You only need a big enough sample size.  This may be OK, so long as you have what we call in the BMJ Open paper, "a suitably conservative margin of noninferiority."  Most margins of noninferiority (delta) are far from this standard.

The results of the SCORAD trial were consistent with our analysis of 30+ other noninferiority trials of reduced intensity therapies, and the point estimate favored - you guessed it - the more intensive radiotherapy.  This is fine.  It is also fine that the 1-sided 95% confidence interval crossed the 11% prespecified margin of noninferiority (P=0.06).  That just means you can't declare noninferiority.  What is not fine, in my opinion, is that the authors suggest that we look at how little overlap there was, basically an insinuation that we should consider it noninferior anyway.  I crafted a succinct missive to point this out to the editors, but alas I'm too busy to submit it and don't feel like bothering, so I'll post it here for those who like to think about these issues.

To the editor:  Hoskin et al report results of a noninferiority trial comparing two intensities of radiotherapy (single fraction versus multi-fraction) for spinal cord compression from metastatic cancer (the SCORAD trial)1.  In the most common type of noninferiority trial, investigators endeavor to show that a novel agent is not worse than an established one by more than a prespecified margin.  To maximize the chances of this, they generally choose the highest tolerable dose of the novel agent.  Similarly, guidelines admonish against underdosing the active control comparator as this will increase the chances of a false declaration of noninferiority of the novel agent2,3.  In the SCORAD trial, the goal was to determine if a lower dose of radiotherapy was noninferior to a higher dose. Assuming radiotherapy is efficacious and operates on a dose response curve, the true difference between the two trial arms is likely to favor the higher intensity multi-fraction regimen.  Consequently, there is an increased risk of falsely declaring noninferiority of single fraction radiotherapy4.  Therefore, we agree with the authors’ concluding statement that “the extent to which the lower bound of the CI overlapped with the noninferiority margin should be considered when interpreting the clinical importance of this finding.”  The lower bound of a two-sized 95% confidence interval (the trial used a 1-sided 95% confidence interval) extends to 13.1% in favor of multi-fraction radiotherapy.  Because the outcome of the trial was ambulatory status, and there were no differences in serious adverse events, our interpretation is that single fraction radiotherapy should not be considered noninferior to a multi-fraction regimen, without qualifications.

1.            Hoskin PJ, Hopkins K, Misra V, et al. Effect of Single-Fraction vs Multifraction Radiotherapy on Ambulatory Status Among Patients With Spinal Canal Compression From Metastatic Cancer: The SCORAD Randomized Clinical Trial. JAMA. 2019;322(21):2084-2094.
2.            Piaggio G, Elbourne DR, Pocock SJ, Evans SW, Altman DG, f CG. Reporting of noninferiority and equivalence randomized trials: Extension of the consort 2010 statement. JAMA. 2012;308(24):2594-2604.
3.            Jones B, Jarvis P, Lewis JA, Ebbutt AF. Trials to assess equivalence: the importance of rigorous methods. BMJ. 1996;313(7048):36-39.
4.            Aberegg SK, Hersh AM, Samore MH. Do non-inferiority trials of reduced intensity therapies show reduced effects? A descriptive analysis. BMJ open. 2018;8(3):e019494-e019494.

Saturday, November 23, 2019

Pathologizing Lipid Laden Macrophages (LLMs) in Vaping Associated Lung Injury (VALI)

It's time to weigh in on an ongoing debate being waged in the correspondence pages of the NEJM.  To wit, what is the significance of lipid laden macrophages (LLMs) in VALI?  As we stated, quite clearly, in our original research letter,

"Although the pathophysiological significance of these lipid-laden macrophages and their relation to the cause of this syndrome are not yet known, we posit that they may be a useful marker of this disease.3-5 Further work is needed to characterize the sensitivity and specificity of lipid-laden macrophages for vaping-related lung injury, and at this stage they cannot be used to confirm or exclude this syndrome. However, when vaping-related lung injury is suspected and infectious causes have been excluded, the presence of lipid-laden macrophages in BAL fluid may suggest vaping-related lung injury as a provisional diagnosis."
There, we outlined the two questions about their significance:  1.) any relation to the pathogenesis of the syndrome; and 2.) whether, after characterizing their sensitivity and specificity, they can be used in diagnosis.  I am not a lung biologist, so I will ignore the first question and focus on the second, where I actually do know a thing or two.

We still do not know the sensitivity or specificity of LLMs for VALI, but we can make some wagers based on what we do know.  First, regarding sensitivity.  In our ongoing registry at the University of Utah, we have over 30 patients with "confirmed" VALI (if you dont' have a gold standard, how do you "confirm" anything?), and to date all but one patient had LLMs in excess of 20% on BAL.  For the first several months we bronched everybody.  So, in terms of BAL and LLMs, I'm guessing we have the most extensive and consistent experience.  Our sensitivity therefore is over 95%.  In the Layden et al WI/IL series in NEJM, there were 7 BAL samples and all 7 had "lipid Layden macrophages" (that was a pun).  In another Utah series, Blagev et al reported that 8 of 9 samples tested showed LLMs.  Combining those data (ours are not yet published, but soon will be) we can state the following:  "Given the presence of VALI, the probability of LLM on Oil Red O staining (OROS) is 96%."  You may recognize that as a statement of sensitivity.  It is unusual to not find LLMs on OROS of BAL fluid in cases of VALI, and because of that, their absence makes the case atypical, just as does the absence of THC vaping.  Some may go so far as to say their absence calls into question the diagnosis, and I am among them.  But don't read between the lines.  I did not say that bronchoscopy is indicated to look for them.  I simply said that their absence makes the case atypical and calls it into question.

Sunday, September 1, 2019

Pediatrics and Scare Tactics: From Rock-n-Play to Car Safety Seats

Is sleeping in a car seat dangerous?
Earlier this year, the Fisher-Price company relented to pressure from the AAP (American Academy of Pediatrics) and recalled 4.7 million of Rock 'n Play (RnP) baby rockers, which now presumably occupy landfills.  This recommendation stemmed from an "investigation" by consumer reports showing that since 2011, 32 babies died while sleeping in the RnP.  These deaths are tragic, but what does it mean?  In order to make sense of this "statistic" we need to determine a rate, based on the exposure period, something like "the rate of infant death in the RnP is 1 per 10 million RnP occupied hours" or something like that.  Then we would compare it to the rate of infant death sleeping in bed.  If it was higher, we would have a starting point for considering whether ceteris paribus, maybe it's the RnP that is causing the infant deaths.  We would want to know the ratio of observed deaths in the RnP to expected deaths sleeping in some other arrangement for the same amount of time.  Of course, even if we found the observed rate was higher than the expected rate, other possibilities exist, i.e., it's an association, a marker for some other factor, rather than a cause of the deaths.  A more sophisticated study would, through a variety of methods, try to control for those other factors, say, socioeconomic status, infant birth weight, and so on.  The striking thing to me and other evidence minded people was that this recall did not even use the observed versus expected rate, or any rate at all!  Just a numerator!  We could do some back of the envelope calculations with some assumptions about rate ratios, but I won't bother here.  Suffice it to say that we had an infant son at that time and we kept using the RnP until he outgrew it and then we gave it away.

Last week, the AAP was at it again, playing loose with the data but tight with recommendations based upon them.  This time, it's car seats.  In an article in the August, 2019 edition of the journal Pediatrics, Liaw et al present data showing that, in a cohort of 11,779 infant deaths, 3% occurred in "sitting devices", and in 63% of this 3%, the sitting device was a car safety seat (CSS).  In the deaths in CSSs, 51.6% occurred in the child's home rather than in a car.  What was the rate of infant death per hour in the CSS?  We don't know.  What is the expected rate of death for the same amount of time sleeping, you know, in the recommended arrangement?  We don't know!  We're at it again - we have a numerator without a denominator, so no rate and no rate to compare it to.  It could be that 3% of the infant deaths occurred in car seats because infants are sleeping in car seats 3% of the time!

Sunday, July 21, 2019

Move Over Feckless Extubation, Make Room For Reckless Extubation

Following the theme of some recent posts on Status Iatrogenicus (here and here) about testing and treatment thresholds, one of our stellar fellows Meghan Cirulis MD and I wrote a letter to the editor of JAMA about the recent article by Subira et al comparing shorter duration Pressure Support Ventilation to longer duration T-piece trials.  Despite adhering to my well hewn formula for letters to the editor, it was not accepted, so as is my custom, I will publish it here.

Spoiler alert - when the patients you enroll in your weaning trial have a base rate of extubation success of 93%, you should not be doing an SBT - you should be extubating them all, and figuring out why your enrollment criteria are too stringent and how many extubatable patients your enrollment criteria are missing because of low sensitivity and high specificity.

Tuesday, May 7, 2019

Etomidate Succs: Preventing Dogma from Becoming Practice in RSI

The editorial about the PreVent trial in the NEJM a few months back is entitled "Preventing Dogma from Driving Practice".  If we are not careful, we will let the newest dogma replace the old dogma and become practice.

The PreVent trial compared bagging versus no bagging after induction of anesthesia for rapid sequence intubation (RSI).  Careful readers of this and another recent trial testing the dogma of videolaryngoscopy will notice several things that may significantly limit the external validity of the results.
  • The median time from induction to intubation was 130 seconds in the no bag ventilation group, and 158 seconds in the bag ventilation group (NS).  That's 2 to 2.5 minutes.  In the Lascarrou 2017 JAMA trial of direct versus video laryngoscopy, it was three minutes.  Speed matters.  The time that a patient is paralyzed and non-intubated is a very dangerous time and it ought to be as short as possible
  • The induction agent was Etomidate (Amidate) in 80% of the patients in the PreVent trial and 90% of patients in the Larascarrou trial (see supplementary appendix of PreVent trial)
  • The intubations were performed by trainees in approximately 80% of intubations in both trials (see supplementary appendix of PreVent trial)
I don't think these trials are directly relevant to my practice.  Like surgeon Robert Liston who operated in the pre-anesthesia era and learned that speed matters (he could amputate a leg in 2.5 minutes), I have learned that the shorter the time from induction to intubation, the better - it is a vulnerable time and badness occurs during it:  atelectasis, hypoxemia, aspiration, hypotension, secretion accumulation, etc.

Thursday, April 25, 2019

The EOLIA ECMO Bayesian Reanalysis in JAMA

A Hantavirus patient on ECMO, circa 2000
Spoiler alert:  I'm a Bayesian decision maker (although maybe not a Bayesian trialist) and I "believe" in ECMO as documented here.

My letter to the editor of JAMA was published today (and yeah I know, I write too many letters, but hey, I read a lot and regular peer review often doesn't cut it) and even when you come at them like a spider monkey, the authors of the original article still get the last word (and they deserve it - they have done far more work than the post-publication peer review hecklers with their quibbles and their niggling letters.)

But to set some thing clear, I will need some more words to elucidate some points about the study's interpretation.  The authors' response to my letter has five points.
  1. I (not they) committed confirmation bias, because I postulated harm from ECMO.  First, I do not have a personal prior for harm from ECMO, I actually think it is probably beneficial in properly selected patients, as is well documented in the blog post from 2011 describing my history of experience with it in hantavirus, and as well in a book chapter I wrote in Cardiopulmonary Bypass Principles and Practice circa 2006.  There is irony here - I "believe in" ECMO, I just don't think their Bayesian reanalysis supports my (or anybody's) beliefs in a rational way!  The point is that it was a post hoc unregistered Bayesian analysis after a pre-registered frequentist study which was "negative" (for all that's worth and not worth), and the authors clearly believe in the efficacy of ECMO as do I.  In finding shortcomings in their analysis, I seek to disconfirm or at least challenge no only their but my own beliefs.  And I think that if the EOLIA trial had been positive, that we would not be publishing Bayesian reanalyses showing how the frequentist trial may be a type I error.  We know from long experience that if EOLIA had been "positive" that success would have been declared for ECMO as it has been with prone positioning for ARDS.  (I prone patients too.)  The trend is to confirm rather than to disconfirm, but good science relies more on the latter.
  2. That a RR of 1.0 for ECMO is a "strongly skeptical" prior.  It may seem strong from a true believer standpoint, but not from a true nonbeliever standpoint.  Those are the true skeptics (I know some, but I'll not mention names - I'm not one of them) who think that ECMO is really harmful on the net, like intensive insulin therapy (IIT) probably is.  Regardless of all the preceding trials, if you ask the NICE-SUGAR investigators, they are likely to maintain that IIT is harmful.  Importantly, the authors skirt the issue of the emphasis they place on the only longstanding and widely regarded as positive ARDS trial (of low tidal volume).  There are three decades of trials in ARDS patients, scores of them, enrolling tens of thousands of patients, that show no effect of the various therapies.  Why would we give primacy to the the one trial which was positive, and equate ECMO to low tidal volume?  Why not equate it to high PEEP, or corticosteroids for ARDS?  A truly skeptical prior would have been centered on an aggregate point estimate and associated distribution of 30 years of all trials in ARDS of all therapies (the vast majority of them "negative").  The sheer magnitude of their numbers would narrow the width of the prior distribution with RR centered on 1.0 (the "severely skeptical" one), and it would pull the posterior more towards zero benefit, a null result.  Indeed, such a narrow prior distribution may have shown that low tidal volume is an outlier and likely to be a false positive (I won't go any farther down that perilous path).  The point is, even if you think a RR of 1.0 is severely skeptical, the width of the distribution counts for a lot too, and the uninitiated are likely to miss that important point.
  3. Priors are not used to "boost" the effect of ECMO.  (My original letter called it a Bayesian boost, borrowing from Mayo, but the adjective was edited out.) Maybe not always, but that was the effect in this case, and the respondents did not cite any examples of a positive frequentist result that was reanalyzed with Bayesian methods to "dampen" the observed effect.  It seems to only go one way, and that's why I alluded to confirmation bias.  The "data-driven priors" they published were tilted towards a positive result, as described above.
  4. Evidence and beliefs.  But as Russell said "The degree to which beliefs are based on evidence is very much less than believers suppose."  I support Russell's quip with the aforementioned.
  5. Judgment is subjective, etc.  I would welcome a poll, in the spirit of crowdsourcing, as we did here to better understand what the community thinks about ECMO (my guess is it's split ratherly evenly, with a trend, perhaps strong, for the efficacy of ECMO).  The authors' analysis is laudable, but it is not based on information not already available to the crowd; rather it transforms it in ways may not be transparent to the crowd and may magnify it in a biased fashion if people unfamiliar with Bayesian methods do not scrutinize the chosen prior distributions.

Sunday, April 21, 2019

A Finding of Noninferiority Does Not Show Efficacy - It Shows Noninferiority (of short course rifampin for MDR-TB)

An image of two separated curves from Mayo's book SIST
Published in the March 28th, 2019 issue of the NEJM is the STREAM trial of a shorter regimen for Rifampin-resistant TB.  I was interested in this trial because if fits the pattern of a "reduced intensity therapy", a cohort of which we recently analyzed and published last year.  The basic idea is this:  if you want to show efficacy of a therapy, you choose the highest dose of the active drug to compare to placebo, to improve the chances that you will get "separation" of the two populations and statistically significant results.  Sometimes, the choice of the "dose" of something, say tidal volume in ARDS, is so high that you are accused of harming one group rather than helping the other.  The point is if you want positive results, use the highest dose so the response curves will separate further, assuming efficacy.

Conversely, in a noninferiority trial, your null hypothesis is not that there is no difference between the groups as it is in a superiority trial, but rather it is that there is a difference bigger than delta (the pre-specified margin of noninferiority.  Rejection of the null hypothesis a leads you to conclude that there is no difference bigger than delta, and you then conclude noninferiority.  If you are comparing a new antibiotic to vancomycin, and you want to be able to conclude noninferiority, you may intentionally or subconsciously dose vancomycin at the lower end of the therapeutic range, or shorten the course of therapy.  Doing this increases the chances that you will reject the null hypothesis and conclude that there is no difference greater than delta in favor of vancomycin and that your new drug is noninferior.  However, this increases your type 1 error rate - the rate at which you falsely conclude noninferiority.

Sunday, December 23, 2018

Do Doctors and Medical Errors Kill More People than Guns?

Recently released stats showing over 40,000 deaths due to firearms in the US this year have led to the usual hackneyed comparisons between those deaths and deaths due to medical errors, the tired refrain something like "Doctors kill more people than that!"  These claims were spreading among gun aficionados on social media last week, with references to this 2016 BMJ editorial by Makary and Michael, from my alma mater Johns Hopkins Bloomberg SPH, claiming that "Medical Error is the Third Leading Cause of Death."  I have been incredulous about this claim when I have encountered it in the past, because it just doesn't jibe with my 20 years of working in these dangerous slaughterhouses we call hospitals.  I have no intention to minimize medical errors - they certainly occur and are highly undesirable - but I think gross overestimates do a disservice too.  Since this keeps coming up, I decided to delve further.

First, just for the record, I'm going to posit that the 40,000 firearms deaths is a reliable figure because they will be listed as homicides and suicides in the "manner of death" section of death certificates, and they're all going to be medical examiner cases.  So I have confidence in this figure.

Contrarily, the Makary paper has no new primary data.  It is simply an extrapolation of existing data and the source is a paper by James in the Journal of Patient Safety in 2013.  (Consider for a moment whether you may have any biases if your career depended upon publishing articles in the Journal of Patient Safety.)  This paper also has no new primary data but relies on data from 4 published studies, two of them not peer-reviewed but Office of the Inspector General (OIG) reports.  I will go through each of these in turn so we can see where these apocalyptic estimates come from.

OIG pilot study from 2008.  This is a random sample of 278 Medicare beneficiaries hospitalized in 2 unspecified and nonrandom counties.  All extrapolations are made from this small sample which has wide confidence intervals because of its small size (Appendix F, Table F1, page 33).  A harm scale is provided on page 3 of the document where the worst category on the letter scale is "I" which is:
"An error occurred that may have contributed to or resulted in patient death."  [Italics added.]

Thursday, May 24, 2018

You Have No Idea of the Predictive Value of Weaning Parameters for Extubation Success, and You Probably Never Will

As Dr. O'brien eloquently described in this post, many people misunderstand the Yang-Tobin (f/Vt) index as being a "weaning parameter" that is predictive of extubation success.  Far from that, it's sensitivity and specificity and resultant ROC curve relate to the ability of f/Vt after one minute of spontaneous ventilation to predict the success of a prolonged (~ one hour) spontaneous breathing trial.  But why would I want to predict the result of a test (the SBT), and introduce error, when I can just do the test and get the result an hour later?  It makes absolutely no sense.  What we want is a parameter that predicts extubation success.  But we don't have that, and we probably will never have that.

In order to determine the sensitivity and specificity of a test for extubation success, we will need to ascertain the outcome in all patients regardless of their performance on the test of interest.  That means we would have to extubate patients that failed the weaning parameter test.  In the original Yang & Tobin article, their cohort consisted of 100 patients.  60(%) of the 100 were said to have passed the weaning test and were extubated, and 40(%) failed and were not extubated.  (There is some over-simplification here based on how Yang & Tobin classified and reported events - its not at all transparent in their article - the data to resolve the issues are not reported and the differences are likely to be small.  Suffice it to say that about 60% of their patients were successfully weaned and the remainder were not.)  Let's try to construct a 2x2 table to determine the sensitivity and specificity of a weaning parameter using a population like theirs.  The top row of the 2x2 table would look something like this, assuming an 85% extubation success rate - that is, of the 60 patients with a positive or "passing" SBT score (based on whatever parameter), all were extubated and the positive predictive value of the test is 85% (the actual rate of reintubation in patients with a passing weaning test is not reported, so this is a guess):



Thursday, May 17, 2018

Increasing Disparities in Infant Mortality? How a Narrative Can Hinge on the Choice of Absolute and Relative Change

An April, 11th, 2018 article in the NYT entitled "Why America's Black Mothers and Babies are in a Life-or-Death Crisis" makes the following alarming summary statement about racial disparities in infant mortality in America:
Black infants in America are now more than twice as likely to die as white infants — 11.3 per 1,000 black babies, compared with 4.9 per 1,000 white babies, according to the most recent government data — a racial disparity that is actually wider than in 1850, 15 years before the end of slavery, when most black women were considered chattel.
Racial disparities in infant mortality have increased since 15 years before the end of the Civil War?  That would be alarming indeed.  But a few paragraphs before, we are given these statistics:

In 1850, when the death of a baby was simply a fact of life, and babies died so often that parents avoided naming their children before their first birthdays, the United States began keeping records of infant mortality by race. That year, the reported black infant-mortality rate was 340 per 1,000; the white rate was 217 per 1,000.
The white infant mortality rate has fallen 217-4.9 = 212.1 infants per 1000.  The black infant mortality rate has fallen 340-11.3 = 328.7 infants per 1000.  So in absolute terms, the terms that concern babies (how many of us are alive?), the black infant mortality rate has fallen much more than the white infant mortality rate.  In fact, in absolute terms, the disparity is almost gone:  in 1850, the absolute difference was 340-217 = 123 more black infants per 1000 births dying and now it is 11.3-4.9 = 6.4 more black infants per 1000 births dying.

Analyzed a slightly different way, the proportion of white infants dying has been reduced by (217-4.9/217) 97.7%, and the proportion of black infants dying has been reduced by (340-11.3/340)= 96.7%.  So, within 1%, black and white babies shared almost equally in the improvements in infant mortality that have been seen since 15 years before the end of the Civil War.  Or, we could do a simple reference frame change and look at infant survival rather than mortality.  If we did that, the current infant survival rate is 98.87% for black babies and 99.51% for white babies.  The rate ratio for black:white survival is .994 - almost parity depending on your sensitivity to variances from unity.

It's easy to see how the author of the article arrived at different conclusions by looking only at the rate ratios in 1850 and contemporaneously.  But doing the math that way makes it seem as if a black baby is worse off today than in 1850!  Nothing could be farther from the truth.

You might say that this is just "fuzzy math" as our erstwhile president did in the debates of 2000.  But there could be important policy implications also.  Suppose that I have an intervention that I could apply across the US population and I estimate that it will save an additional 5 black babies per 1000 and an additional 3 white babies per 1000.  We implement this policy and it works as projected.  The black infant mortality rate is reduced to 6.3/1000 and the white infant mortality rate is 1.9/1000.  We have saved far many black babies than white babies across the population.  But the rate ratio for black:white mortality has increased from 2.3 to 3.3!  Black babies are now 3 (three!) times as likely to die as white babies!  The policy has increased disparities even though black babies are far better off after the policy change than before it.

It reminds me of the bias where people would rather take a smaller raise if it increased their standing relative to their neighbor.  Surprisingly, when presented with two choices:
  1. you make $50,000 and your peers make $25,000 per year
  2. You make $100,000 and your peers make $250,000 per year
many people choose 1, as if relative social standing is worth $50,000 per year in income.  (Note that relative social standing is just that, relative, and could change if you arbitrarily change the reference class.)

So, relative social standing has value and perhaps a lot of it.  But as regards the hypothetical policy change above, I'm not sure we should be focusing on relative changes in infant mortality.  We just want as few babies dying as possible. And it is disingenuous to present the statistics in a one-sided, tendentious way.