Sunday, December 18, 2011

Modern Day Bloodletting: The Good Samaritan, the Red Cross, and the Jehovah's Witness

How many studies do you suppose that we need before doctors realize that their tendency to want to transfuse blood in every manner of patient admitted to the hospital is nothing more than an exercise in stupidity futility based on the normalization heuristic?  It's a compelling logic and an irresistible practice, I know.  The hemoglobin level is low, that can't be good for the heart, circulation, perfusion, oxygen delivery, you name it.  If we just give a transfusion or two, everything will be all better.  I can hear family members on their mobile phones reassuring other loved ones that the doctors are acting with great prudence and diligence taking care of Mr. Jones, having perspicaciously measured his hemoglobin (as by routine, for a hospital charge of ~$300/day - the leviathan bill and the confused, incredulous faces come months later - "why does it cost so much?"), discovered that perilous anemia, and ordered two units of life-saving blood to be transfused.  It's so simple but so miraculous!  Thank God for the Red Cross!

Not so fast.  The TRICC trial published in 1999 demonstrated that at least in critically ill patients, using a lower as compared to a higher transfusion threshold led to a statistically insignificant trend towards improved outcomes in the lower threshold group.  That is, less blood is better.  For every reason you can think of that transfusion can improve physiological parameters or outcomes, there is a counterargument about how transfusions can wreak havoc on homeostasis and the immune system (see :Marik_2008_CCM, and others.)

Not to mention the cost.  My time honored estimate of the cost of one unit of PRBCs was about $400.  It may indeed be three times higher.  That's right, $1200 per unit transfused, and for reasons of parity or some other nonsense, in clinical practice they're usually transfused in "twos".  Yep, $2400 a pair.  (Even though Samaritans donate for free, the cost of processing, testing, storage, transportation, etc. drive up the price.)  What value do we get for this expense?

Thursday, November 10, 2011

Post-hOckham analyses - the simplest explanation is that it just plain didn't flipp'n work


You're probably familiar with that Franciscan friar Sir William of Ockham, and his sacred saw. Apparently the principle has been as oversimplified as it has ignored, as a search of Wikipedia will attest. Suffice it to say, nonetheless, that this maxim guides us to select the simplest from among multiple explanations for any phenomenon - and this intuitively makes sense, because there are infinite and infinitely complex possible explanations for any phenomenon.

So I'm always amused and sometimes astonished when medical scientists reappraise their theories after they've been defeated by their very own data and begin to formulate increasingly complex explanations and apologies, so smitten and beholden to them as they are. "True Believers" is what Jon Abrams, MD, one of my former attendings, used to call them. The transition from scientist to theist is an insidious and subversive one.

The question is begged: did we design such and such clinical trial to test the null hypothesis or not? If some post-hoc subgroup is going to do better with therapy XYZ, why didn't we identify that a priori? Why didn't we test just THAT group? Why didn't we say, in advance, "if this trial fails to show efficacy, it will be because we should have limited it to this or that subgroup. And if it fails, we will follow up with a trial of this or that subgroup."

Tuesday, November 8, 2011

The Nihilist versus the Trialist: Why Most Published Research Findings Are False

I came across this PLoS Med article today that I wish I had seen years ago: Why Most Published Research Findings Are False . In this delightful essay, John P. A. Ioannidis describes why you must be suspicious of everything you read, because most of it is spun hard enough to give you a wicked case of vertigo. He highlights one of the points made repeatedly on this blog, namely that all hypotheses are not created equal, and some require more evidence to confirm (or refute) than others - basically a Bayesian approach to the evidence. With this approach, the diagnostician's "pre-test probability" becomes the trialist's "pre-study probability" and likelihood ratios stem from the data from the trial as well as alpha and beta. He creates a function for trial bias and shows how this impacts the probability that the trial's results are true as the pre-study probability and the study power are varied. He infers that alpha is probably too high (and hence Type I error rates too high) and beta is too low (both alpha and beta influence the likelihood ratio of a given dataset). He discusses terms (coined by others whom he references) such as "false positive" for study reports, and highlights several corollaries of his analysis (often discussed on this blog), including:
  • beware of studies with small sample sizes
  • beware of studies with small effect sizes (delta)
  • beware of multiple hypothesis testing and soft outcome measures
  • beware of flexibility of designs (think Prowess/Xigris among others), definitions, outcomes (NETT trial), and analytic modes

Perhaps most importantly, he discusses the role that researcher bias may play in analyzing or aggregating data from research reports - the GIGO (garbage in, garbage out) principle. Conflicts of interest extend beyond the financial to tenure, grants, pride, and faith. Gone forever is the notion of the noble scientist in pursuit of the truth, replaced by the egoist climber of ivory and builder of Babel towers, so bent on promoting his or her (think Greet Van den Berghe) hypothesis that they lose sight of the basic purpose of scientific testing, and the virtues of scientific agnosticism.

Thursday, October 6, 2011

ECMO and H1N1 - more fodder for debate

There is perhaps no better way to revive the dormant blog than to highlight an article published in JAMA yesterday about the role and effect of ECMO in the H1N1 epidemic in England: http://jama.ama-assn.org/content/early/2011/09/28/jama.2011.1471.full . Other than to recognize its limitations which are similar if not identical to those of the CESAR trial, there is little to say about this study beyond that it further bolsters the arguments of my last post about ECMO and the ongoing debate about it.

In light of the recent failures of albuterol and omega-3 fatty acids in ARDS treatment, I echo the editorialist in calling for funding for a randomized controlled trial of ECMO in severe ARDS (see: http://jama.ama-assn.org/content/early/2011/09/28/jama.2011.1504.full ).

Tuesday, April 19, 2011

ECMO and logic: Absence of Evidence is not Evidence of Absence

I have been interested in ECMO for adults with cardiorespiratory failure since the late 1990s during the Hantavirus cardiopulmonary syndrome endemic in New Mexico, when I was a house officer at the University of New Mexico. Nobody knows for sure if our use of AV ECMO there saved any lives, but we all certainly suspected that it did. There were simply too many patients too close to death who survived. It made an impression.

I have since practiced in other centers where ECMO was occasionally used, and I had the privilege of writing a book chapter on ECMO for adult respiratory failure in the interim.

But alas, I now live in the Salt Lake Valley where, for reasons as cultural as they are scientific, ECMO is taboo. The main reason for this is, I think, an over-reliance on outdated data, along with too much confidence in and loyalty to, locally generated data.

And this is sad, because this valley was hit with another epidemic two years ago - the H1N1 epidemic, which caused the most severe ARDS I have seen since the Hanta days in New Mexico. To my knowledge, no patients in the Salt Lake Valley received ECMO for refractory hypoxemia in H1N1 disease.


Thus I read with interest the Pro Con debate in Chest a few months back, and revisited in the correspondence of the current issue of Chest, which was led by some of the local thought leaders (and those who believe that, short of incontrovertible evidence, ECMO should remain taboo and outright disparaged) - See: http://chestjournal.chestpubs.org/content/139/4/965.1.citation and associated content.

It was an entertaining and incisive exchange between a gentleman in Singapore with recent ECMO experience in H1N1 disease, and our local thought leaders, themselves led by Dr. Alan Morris. I leave it to interested readers to read the actual exchange, as it is too short to merit a summary here. My only comment is that I am particularly fond of the Popper quote, taken from The Logic of Scientific Discovery: "If you insist on strict proof (or disproof) in the empirical sciences, you will never benefit from experience and never learn from it how wrong you are." Poignant.

I will add my own perhaps Petty insight into the illogical and dare I say hypocritical local taboo on ECMO. ECMO detractors would be well-advised to peruse the first Chapter in Martin Tobin's Principles and Practice of Mechanical Ventilation called "HISTORICAL PERSPECTIVE ON THE DEVELOPMENT OF MECHANICAL VENTILATION". As it turns out, mechanical ventilation for most diseases, and particularly for ARDS, was developed empirically and iteratively during the better part of the last century, and none of that process was guided, until the last 20 years or so, by the kind of evidence that Morris considers both sacrosanct and compulsory. Indeed, Morris, each time he uses mechanical ventilation for ARDS, is using a therapy which is unproved to the standard that he himself requires. And indeed, the decision to initiate mechanical ventilation for a patient with respiratory failure remains one of the most opaque areas in our specialty. There is no standard. Nobody knows who should be intubated and ventilated, and exactly when - it is totally based on gestalt, is difficult to learn or to teach, and is not even addressed in studies of ARDS. Patients must be intubated and mechanically ventilated for entry to an ARDS trial, but there are no criteria which must be met on how, when, and why they were intubated. It's just as big a quagmire as the one Morris describes for ECMO.

And much as he, and all of us, will not stand by idly and allow a spontaneously breathing patient with ARDS to remain hypoxemic with unacceptable gas exchange, those of us with experience with ECMO, an open mind, equipoise, and freedom from rigid dogma will not stand by idly and watch a ventilated patient remain hypoxemic with unacceptable gas exchange for lack of ECMO.

It is the same thing. Exactly the same thing.

Saturday, April 9, 2011

Apixaban: It's been a while since I've read about a new drug that I actually like

In the March 3rd NEJM, Apixaban makes its debut on the scene of stroke prevention in Atrial Fibrillation with the AVERROES trial (see: http://www.nejm.org/doi/full/10.1056/NEJMoa1007432#t=abstract ), and I was favorably impressed. The prophylaxis of stroke in atrial fibrillation is truly an unmet need because of the problematic nature of chronic anticoagulation with coumarin derivatives. So a new player on the team is welcome.

A trusted and perspicacious friend dislikes the AVERROES study for the same reason that I like it - he says that comparing Apixaban to aspirin (placebo, as he called it) is tantamount to a "chump shot". But I think the comparison is justified. There are numerous patients who defer anticoagulation with coumarins because of their pesky monitoring requirements, and this trial assures any such patient that Apixaban is superior to aspirin, beyond any reasonable doubt (P=0.000002). (Incidentally, I applaud the authors for mentioning, in the second sentence of the discussion, that the early termination of the trial might have inflated the results - something that I'm less concerned with than usual because of the highly statistically significant results which all go basically in the same direction.) Indeed, it was recently suggested that "me-too" agents or those tested in a non-inferiority fashion should be tested in the population in which they are purported to have an advantage (see: http://jama.ama-assn.org/content/305/7/711.short?rss=1 ). The AVERROES trial does just that.

Had Apixiban been compared to warfarin in a non-inferiority trial (this trial is called Aristotle and is ongoing) without the AVERROES trial, some[busy]body would have come along and said that non-inferiority to warfarin does not demonstrate superiority over aspirin +/- clopidogrel, nor does it demonstrate safety compared to aspirin, etc. So I respectfully disagree with my friend who thinks it's a useless chump shot - I think the proof of efficacy from AVERROES is reassuring and welcome, especially for patients who wish to avoid coumarins.

Moreover, these data bolster the biological plausability of efficacy of oral factor Xa antagonists, and will increase my confidence in the results of any non-inferiority trial, e.g., ARISTOTLE. And that train of thought makes me muse as to whether the prior probability of some outcome from a non-inferiority trial might not depend on the strength of evidence for efficacy of each agent prior to the trial (self-evident, I know, but allow me to finish). That is, if you have an agent that has been used for decades compared to one that is a newcomer, is there really equipoise, and should there be equipoise about the result? Shouldn't the prior be higher that the old dog will win the fight or at least not be beat? These musings might have greater gravity in light of the high rate of recall from the market of new agents with less empirical evidence of safety buttressing them.

One concerning finding, especially in light of the early termination, is illustrated in Figure 1B. The time-to event curves for major bleeding were just beginning to separate between 9 and 14 months, and then, inexplicably, there were no more bleeding events documented after about 14 months. It almost looks as if monitoring for severe bleeding stopped at 14 months. I'm not sure I fully understand this, but one can't help but surmise that, if this agent truly is a suitable replacement for warfarin, it will come with a cost - and that cost will be bleeding. I wager that had the trial continued there would have been a statistically significant increase in major bleeding with apixaban.

It is one interesting that not a single mention is made by name in the article to a competitor oral factor Xa inhibitor, indeed one that is going to make it to market sooner than Apixaban albeit for a different indication: Rivaroxaban. Commercial strategizing is also seen foreshadowed in the last paragraph of the article before the summary: the basic tenets of a cost benefit analysis are laid bare before us with one exception: cost of Apixaban. Surely the sponsor will feel justified in usurping all but just a pittance of the cost savings from obviated INR monitoring and hospitalizations when they price this agent. Only time will tell.

Thursday, April 7, 2011

Conjugated Equine Estrogen (CEE) reduces breast cancer AFTER the trial is completed?

I awoke this morning to a press release from the AMA, and a front page NYT article declaring that, in a post-trial follow-up of the WHI study, CEE reduces breast cancer in the entire cohort of post-hysterectomy patients, and lowers CHD (coronary heart disease) risk in the youngest age stratum studied.

Here's a couple of links: http://well.blogs.nytimes.com/2011/04/05/estrogen-lowers-risk-of-heart-attack-and-breast-cancer-in-some/?src=me&ref=general

http://jama.ama-assn.org/content/305/13/1305.short

Now why would that be?

One need look no further than the data in figures 2 and 5 to see that it's a Type I statistical error (a signigicant result is found by chance when the null hypothesis is true and there is in reality no effect) - that's why.

For the love of Jehovah, did this really make the headlines? The P-value for the breast cancer risk is....well, they don't give a P-value, but the upper bound of the 95% CI is 0.95 so the P-value is about 0.04, BARELY significant. Seriously? This is one of FIFTEEN (15) comparisons in Table 2 alone. Corrected for multiple comparisons, this is NOT a statistically significant effect, NOT EVEN CLOSE. I think I'm having PTSD from flashbacks to the NETT trial.

And table 5? There are TEN outcomes with THREE age strata for each outcome, so, what, 30 comparisons? And look at the width of the 95% CI for the youngest age stratum in the CHD outcome - wide. So there weren't a lot of patients in that group.
And nevermind the lack of an a priori hypothesis, or a legitimate reason to think some difference based on age strata might make biological sense.

Bad old habits die hard. Like a former colleague is fond of pointing out, don't assume that because an investigator does not have drug company ties that s/he is not biased. Government funding and entire careers are at stake if an idea that you've been pursuing for years yields to the truth and dies off. Gotta keep stoking the coals of incinerated ideas long enough to get tenure!

Sunday, April 3, 2011

If at first you don't succeed, try, try again: Anacetrapib picks up where Torcetrapib left off

I previously blogged on Torcetrapib because of my interest in causality and in a similar vein, the cholesterol hypothesis. And I was surprised and delighted when the ILLUMINATE trial showed that Torcetrapib, in spite of doubling HDL, failed miserably. Surprised because like so many others I couldn't really believe that if you double HDL that on balance wonderful things wouldn't happen; and delighted because of the possible insights this might give into the cholesterol hypothesis and causality. (See this post: http://medicalevidence.blogspot.com/2007/11/torcetrapib-torpedoed-sunk-by-surrogate.html )

I must have been too busy skiing when this article on Anacetrapib came out last year: http://www.nejm.org/doi/full/10.1056/NEJMoa1009744 . You may recall that after Torcetrapib was torpedoed, the race was on for apologists to find reasons it didn't work besides the obvious one, that it doesn't work. It raised blood pressure and does things to aldosterone synthesis, etc. Which I find preposterous. Here is an agent with profound effects on HDL and mild effects on other parameters (at least the parameters we can measure) and I am supposed to believe that the minor effects outweigh the major ones when it comes time to measure final outcomes? Heparin also affects aldosterone synthesis, but to my knowledge, when used appropriately to treat patients with clots, its major effects prevail over its minor effects and thus it doesn't kill people.


This is no matter to the true believers. Anacetrapib doesn't have these pesky minor effects, and it too doubles HDL, so the DEFINE investigators conducted a safety study to see if its lack of effects on aldosterone synthesis and the like might finally allow its robust effects on HDL to shine down favorably on cardiovascular outcomes (or at least not kill people.) The results are favorable, and there is no increase in blood pressure or changes in serum electrolytes, so their discussion focuses on all the reasons that this agent might be that Holy Grail of cholesterol lowering agents after all. All the while they continue to ignore the lack of any positive signal on cardiovascular outcomes at 72 weeks with this HDL-raising miracle agent, and what I think may be a secret player in this saga: CRP levels.

Only time and additional studies will tell, but I'd like to be on the record as saying that given the apocalyptic failure of Torcetrapib, the burden of evidence is great to demonstrate that this class will have positive effects on cardiovascular outcomes. I don't think it will. And the implications for the cholesterol hypothesis will perhaps be the CETP inhibitors' greatest contributions to science and medicine.

Monday, March 28, 2011

Cultural Relativism in Clinical Trials: Composite endpoints are good enough for Cardiology, but not for Pulmonihilism and Critical Care


There are many differences between cardiology and pulmonary and critical care medicine as medical specialties, and some of these differences are seen in how they conduct clinical trials. One thing is for sure: cardiology has advanced in leaps and bounds in terms of new therapies (antiplatelet agents, coated stents, heparinoids, GP2B3A inhibitors, direct thrombin inhibitors, AICDs, etc.) in the last 15 years while critical care has....well, we have low tidal volume ventilation, and that's about it. Why might this be?

One possible explanation was visible last week as the NEJM released the results of the PROTECT study - http://www.nejm.org/doi/full/10.1056/NEJMoa1014475 - of Dalteparin ("Dalty") versus unfractionated heparin (UFH) for the prevention of proximal DVT in critically ill patients. Before we delve into the details of this study, imagine that a cardiologist was going to conduct a study of Dalty vs. UFH for use in acute coronary syndromes (ACS) - how would that study be designed?


Well, if it were an industry sponsored study, we might guess that it would be a non-inferiority study with an Ï‹ber-wide delta to bias the study in favor of the branded agent. (Occasionally we're surprised when a "me-too" drug such as Prasugrel is pitted against Plavix in a superiority contest.....and wins.....but this is the exception rather than the rule.) We would also be wise to guess that in addition to being a very large study with thousands of patients, that the endpoint in the cardiologists' study will be a composite endpoint - something like "death, recurrent MI, or revascularization," etc. How many agents currently used in cardiology would be around if it weren't for the use of composite endpoints?

Not so in critical care medicine. We're purists. We want mortality improvements and mortality improvements only. (Nevermind if you're alive at day 28 or 60 and slated to die in 6 months after a long run in nursing homes with multiple readmissions to the hospital, with a tracheostomy in place, on dialysis, being fed through a tube, not walking or getting out of bed....you're alive! Pulmonologists pat themselves on the back for "saves" such as those.) After all, mortality is the only truly objective outcome measure, there is no ascertainment bias, and no dispute about the value of the endpoint - alive is good, dead is bad, period.

Cardiologists aren't so picky. They'll take what they can get. They wish to advance their field, even if it does mean being complicit with the profit motives of Big Pharma.

So the PROTECT study is surprising in one way (that proximal DVT rather than mortality was the primary endpoint) but wholly unsurprising in others - the primary endpoint was not a composite, and the study was negative overall, the conclusion being that "Among critically ill patients, dalteparin was not superior to unfractionated heparin in decreasing the incidence of proximal deep-vein thrombosis." Yet another critical care therapy to be added to the therapeutic scrap heap?

Not so fast. Even though it was not the primary endpoint, the authors DID measure a composite: any venous thromboembolism (VTE) or death. And the 95% confidence interval for that outcome was 0.79-1.01, just barely missing the threshold for statistical significance with a P-value of 0.07. So, it appears that Dalty may be up to 1% worse than UFH or up to 21% better than UFH. Which drug do YOU want for YOURSELF or your family in the ICU? (We could devise an even better composite endpoint that includes any VTE, death, bleeding, or HITS, +/- others. Without the primary data, I cannot tell what the result would have been, but I'm interested.)

I don't have an answer for the question of why these cultural differences exist in the conduct of clinical trials in medicine. But I am pretty certain that they do indeed exist. The cardiologists appear to recognize that there are things that can happen to you short of death that have significant negative utility for you. The pulmonihilists, in their relentless pursuit of untainted truth, ignore inconvenient realities and eschew pragmatism in the name of purity. The right course probably lies somewhere in between these two extremes. I can only hope that one day soon pulmonary and critical care studies will mirror the reality that alive but bedridden with a feeding tube and a tracheostomy is not the same as being alive and walking, talking, breathing on your own, and eating. Perhaps we should, if only in the name of diversity, include some cardiologists the next time we design a critical care trial.

Wednesday, February 23, 2011

Burning Sugar in the Brain: Cell Phones Join the Fight Against Obesity

News channels are ablaze with spin on an already spun report of the effects of cell phone radiofrequency (RF) signal on glucose metabolism in the human brain (see: http://jama.ama-assn.org/content/305/8/808.short).

I'm not going to say that this study is an example of a waste of taxpayer money and research resources, but WOW, what a waste of taxpayer money and research resources.

Firstly, why would anybody go looking for an effect of RF from cell phones on brain glucose metabolism anyway? Answer: Because we have a PET scan, and that's what a PET scan can do -not because we have any good reason to believe that changes in brain glucose metabolism are meaningful in this context. This is an example of the hammer dictating the floorplan of the house. We are looking at glucose metabolism simply because we can, not because we have any remote inkling of what changes in glucose metabolism may mean.

Secondly, this whole topic is deeply permeated by a bias that assumes that cell phones are in some way harmful. To date, with the exception of distracted driving, which incidentally gets nobody excited until a law is proposed to reduce it and thus improve public safety, there is no credible evidence that cell phone radiation is harmful. It may be. But it may also be BENEFICIAL. Who's to say that the increase in brain glucose utilization isn't causing positive effects in the brain? Maybe it's making you smarter. That's just as likely as that it's causing harm, but far less likely than that there is no effect.


Thirdly, the experiment is inadequately controlled. What if you strap a cell phone to somebody's hind end and put them in a PET scanner? Does glucose metabolism of the gluteus maximus increase from the RF signal? If it did, we would have the same problem of interpretation: "What does it mean?" But we might somewhat be able to quell all the hand waving about radio signals altering the function of your brain. No, it's simply altering the biochemistry of your cells in a subtle way of unknown significance.

Here is what news organizations are saying, as proudly promulgated by the publicity intoxicated AMA this morning in their member communication:
"We need to rule out that there is a not long-lasting effect in healthy people." - Nora Volkow, first author of the study. [Of course she thinks that - it means millions more dollars in grants for her.]

The study, "by providing solid evidence that cellphone use has measurable effects on brain activity...suggests that the nation's passionate attachment to its 300 million cellphones may be altering the way we think and behave in subtle ways." - The LA Times. [Really? Really?]

Fortunately, the only thing more powerful than the inherent biases about RF signals from cellphones and the lay public's ignorance about PET scanners, glucose metabolism and the like, is the public's penchant for mobile devices, the latter which will surely overwhelm any concerns about altered sugar burning in the brain, just as it has any concerns about distracted driving.

Monday, January 17, 2011

Like Two Peas in a Pod: Cis-atracurium for ARDS and the Existence of Extra Sensory Perception (ESP)

Even the lay public and popular press (see: http://www.nytimes.com/2011/01/11/science/11esp.html?_r=1&scp=1&sq=ESP&st=cse) caught on to the subversive battle between frequentist and Bayesian statistics when it was announced (ahead of print) that a prominent psychologist was to publish a report purporting to establish the presence of Extra Sensory Perception (ESP) in the Journal of Personal and Social Psychology (I don't think it's even published yet, but here's the link to the journal: http://www.apa.org/pubs/journals/psp). So we're back to my Orange Juice (OJ) analogy - if I published the results of a study showing that the enteral administration of OJ reduced severe sepsis mortality by a [marginally] statistically significant 20%, would you believe it? As Carl Sagan was fond of saying, "extraordinary claims require extraordinary evidence" - which to me means, among other things, an unbelievably small P-value produced by a study with scant evidence of bias.

And I remain utterly incredulous that the administration of a paralytic agent for 48 hours in ARDS (see Papazian et al: http://www.nejm.org/doi/full/10.1056/NEJMoa1005372#t=abstrac) is capable of reducing mortality. Indeed, FEW THERAPIES IN CRITICAL CARE MEDICINE REDUCE MORTALITY (see Figure 1 in our article on Delta Inflation: http://www.ncbi.nlm.nih.gov/pubmed/20429873). So what was the P-value of the Cox regression (read: ADJUSTED) analysis in the Papazian article? It was 0.04. This is hardly the kind of P-value that Car Sagan would have accepted as Extraordinary Evidence.

The correspondence regarding this article in the December 23rd NEJM (see: http://www.nejm.org/doi/full/10.1056/NEJMc1011677) got me to thinking again about this article. It emphasized the striking sedation practices used in this trial: patients were sedated to a Ramsay score of 6 (no response to glabellar tap) prior to randomization - the highest score on the Ramsay scale. Then they received Cis-at or placebo. Thus the Cis-at group could not, for 48 hours, "fight the vent," while the placebo group could, thereby inducing practitioners to administer more sedation. Could it be that Cis-at simply saves you from oversedation, much as intensive insulin therapy (IIT) a la 2001 Leuven protocol saved you from the deleterious effects of massive dextrose infusion after cardiac surgery?

To explore this possibility further, one needs to refer to Table 9 in the supplementary appendix of the Papazian article (see: http://www.nejm.org/doi/suppl/10.1056/NEJMoa1005372/suppl_file/nejmoa1005372_appendix.pdf ) which tabluates the total sedative doses used in the Cis-at and placebo groups DURING THE FIRST SEVEN (7) DAYS OF THE STUDY. Now, why 7 days was chosen, when the KM curves separate at 14 days (as my former colleagues O'Brien and Prescott pointed out here: http://f1000.com/5240957 ), when the study reported data on other outcomes at 28 and 90 days, remains a mystery to me. I have e-mailed the corresponding author to see if he can/will provide data on sedative doses further out. I will post any updates as further data become available. Suffice it to say, that I'm not going to be satisfied unless sedative doses further out are equivalent.

Scrutiny of Table 9 in the SA leads to some other interesting discoveries, such as the massive doses of ketamine used in this study - a practice that does not exist in the United States, as well as strong trends toward increased midazolam use in the placebo group. And if you believe Wes Ely's and others' data on benzodiazepine use, and its association with delirium and mortality, one of your eyebrows might involuntarily rise. Especially when you consider that the TOTAL sedative dose administered between groups is an elusive sum, because equivalent doses of all the various sedatives are unknown and the total sedative dose calculation is insoluble.

Saturday, September 25, 2010

In the same vein: Intercessory Prayer for Heart Surgery and Neuromuscular Blockers for ARDS

Several years back, in the American Heart Journal, was published a now-widely referenced study of intercessory prayer to aid recovery of patients who had had open heart surgery (see: Am Heart J. 2006 Apr;151(4):934-42). This study was amusing for several reasons, not least of which because, in spite of being funded by a religious organization, the results were "negative" meaning that there was no apparent positive effect of prayer. Of course, the "true believers" called foul, claiming that the design was flawed, etc. (Another ironic twist of the study: patients who knew they were being prayed for actually fared worse than those who had received no prayers.)

The most remarkable thing about this study for me is that it was scientifically irresponsible to conduct it. Science (and biomedical research) must be guided by testing a defensible hypothesis, based on logic, historical and preliminary data, and, in the case of biomedical research, an understanding of the underlying pathophysiology of the disease process under study. Where there is no scientifically valid reason to believe that a therapy might work, no preliminary data - nothing - a hypothesis based on hope or faith has no defensible justification in biomedical research, and its study is arguably unethical.

Moreover, a clinical trial is in essence a diagnostic test of a hypothesis, and the posterior probability of a hypothesis (null or alternative) depends not only on the frequentist data produced by the trial, but also on a Bayesian analysis incorporating the prior probability that the alternative (or null) hypothesis is true (or false). That is, if I conducted a trial of orange juice (OJ) for the treatment of sepsis (another unethical design) and OJ appeared to reduce sepsis mortality by, say, 10% with P=0.03, you should be suspicious. With no biologically plausible reason to believe that OJ might be efficacious, the prior probability of Ha (that OJ is effective) is very low, and a P-value of 0.03 (or even 0.001) is unconvincing. That is, the less compelling the general idea supporting the hypothesis is, the more robust a P-value you should require to be convinced by the data from the trial.

Thus, a trial wherein the alternative hypothesis tested has a negligible probability of being true is uninformative and therefore unethical to conduct. In a trial such as the intercessory prayer trial, there is NO resultant P-value which is sufficient to convince us that the therapy is effective - in effect, all statistically significant results represent Type I errors, and the trial is useless.
(I should take a moment here to state that, ideally, the probability of Ho and Ha should both be around 50%, or not far off, representing true equipoise about the scenario being studied. Based on our data in the Delta Inflation article (see: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2887200/ ), it appears that at least in critical care trials evaluating comparative mortality, the prior probability of Ha is on the order of 18%, and even that figure is probably inflated because many of the trials that comprise it represent Type I errors. In any case, it is useful to consider the prior probability of Ha before considering the data from a trial, because that prior is informative. [And in the case of trials for biologics for the treatment of sepsis {be it OJ or drotrecogin, or anti-TNF-alpha}, the prior probability that any of them is efficacious is almost negligibly low.)

Which segues me to Neuromuscular Blockers (NMBs) for ARDS (see: http://www.nejm.org/doi/full/10.1056/NEJMoa1005372 ) - while I have several problems with this article, my most grievous concern is that we have no (in my estimation) substantive reason to believe that NMBs will improve mortality in ARDS. They may improve oxygenation, but we long ago abandoned the notion that oxygenation is a valid surrogate end-point in the management of ARDS. Indeed, the widespread abandonment of use of NMBs in ARDS reflects consensus agreement among practitioners that NMBs are on balance harmful. (Note in Figure 1 that, in contrast to the contention of the authors in the introduction that NMBs remain widely used, only 4.3% of patients were excluded because of use of NMBs at baseline.)

In short, these data fail to convince me that I should be using NMBs in ARDS. But many readers will want to know "then why was the study positive?" And I think the answer is staring us right in the face. In addition to the possibility of a simple Type I error, and the fact that the analysis was done with a Cox regression, controlling for baseline imbalances (even ones such as PF ratio which were NOT prospectively defined as variables to control for in the analysis), the study was effectively unblinded/unmasked. It is simply not possible to mask the use of NMBs, the clinicians and RNs will quickly figure out who is and is not paralyzed - paralyzed patients will "ride the vent" while unparalyzed ones will "fight the vent". And differences in care may/will arise.

It is the simplest explanation, and I wager it's correct. I will welcome data from other trials if they become available (should it even be studied further?), but in the meantime I don't think we should be giving NMBs to patients with ARDS any more than we should be praying (or avoiding prayer) for the recovery of open-heart patients.

Friday, August 20, 2010

Heads I Win, Tails it's a Draw: Rituximab, Cyclophosphamide, and Revising CONSORT



The recent article by Stone et al in the NEJM (see: http://www.nejm.org/doi/full/10.1056/NEJMoa0909905 ), which appears to [mostly] conform to the CONSORT recommendations for the conduct and reporting of NIFTs (non-inferiority trials, often abbreviated NIFs, but I think NIFTs ["Nifties"] sounds cooler), allowed me to realize that I fundamentally disagree with the CONSORT statement on NIFTs (see JAMA, http://jama.ama-assn.org/cgi/content/abstract/295/10/1152 ) and indeed the entire concept of NIFTs. I have discussed previously in this blog my disapproval of the asymmetry with which NIFTs are designed such that they favor the new (and often proprietary agent), but I will use this current article to illustrate why I think NIFTs should be done away with altogether and supplanted by equivalence trials.

This study rouses my usual and tired gripes about NIFTs: too large a delta, no justification for delta, use of intention-to-treat rather than per-protocol analysis, etc. It also describes a suspicious statistical maneuver which I suspect is intended to infuse the results (in favor of Rituximab/Rituxan) with extra legitimacy in the minds of the uninitiated: instead of simply stating (or showing with a plot) that the 95% CI excludes delta, thus making Rituxan non-inferior, the authors tested the hypothesis that the lower 95.1% CI boundary is different from delta, which test results in a very small P-value (<0.001). This procedure adds nothing to the confidence interval in terms of interpretation of the results, but seems to imbue them with an unassailable legitimacy - the non-inferiority hypothesis is trotted around as if iron-clad by this miniscule P-value, which is really just superfluous and gratuitious.

But I digress - time to focus on the figure. Under the current standards for conducting a NIFT, in order to be non-inferior, you simply need a 95% CI for the preferred [and usually proprietary] agent with an upper boundary which does not include delta in favor of the comparator (scenario A in the figure). For your preferred agent to be declared inferior, the LOWER 95% CI for the difference between the two agents must exclude the delta in favor of the comparator (scenario B in the figure.) For that to ever happen, the preferred/proprietary agent is going to have to be WAY worse than standard treatment. It is no wonder that such results are very, very rare, especially since deltas are generally much larger than is reasonable. I am not aware of any recent trial in a major medical journal where inferiority was declared. The figure shows you why this is the case.

Inferiority is very difficult to declare (the deck is stacked this way on purpose), but superiority is relatively easy to declare, because for superiority your 95% CI doesn't have to exclude an obese delta, but rather must just exclude zero with a point estimate in favor of the preferred therapy. That is, you don't need a mirror image of the 95% CI that you need for inferiority (scenario C in the figure), you simply need a point estimate in favor of the preferred agent with a 95% CI that does not include zero (scenario D in the figure). Looking at the actual results (bottom left in the figure), we see that they are very close to scenario D and that they would have only had to go a little bit more in favor of rituxan for superiority to have been able to be declared. Under my proposal for symmetry (and fairness, justice, and logic), the results would have had to be similar to scenario C, and Rituxan came nowhere near to meeting criteria for superiority.

The reason it makes absolutely no sense to allow this asymmetry can be demonstrated by imagining a counterfactual (or two) - supposing that the results had been exactly the same, but they had favored Cytoxan (cyclophosphamide) rather than Rituxan, that is, Cytoxan was associated with a 11% improvement in the primary endpoint. This is represented by scenario E in the figure; and since the 95% CI includes delta, the result is "inconclusive" according to CONSORT. So how can it be that the classification of the result changes depending on what we arbitrarily (a priori, before knowing the results) declare to be the preferred agent? That makes no sense, unless you're more interested in declaring victory for a preferred agent than you are in discovering the truth, and of course, you can guess my inferences about the motives of the investigators and sponsors in many/most of these studies. In another counterfactual example, scenario F in the figure represents the mirror image of scenario D, which represented the minimum result that would have allowed Stone et al to declare that Rituxan was superior. But if the results had favored Cytoxan by that much, we would have had another "inconclusive" result, according to CONSORT. Allowing this is just mind-boggling, maddening, and unjustifiable!

Given this "heads I win, tails it's a draw", it's no wonder that NIFTs are proliferating. It's time we stop accepting them, and require that non-inferiority hypotheses be symmetrical - in essence, making equivalence trials the standard operating procedure, and requiring the same standards for superiority as we require for inferiority.

Friday, July 16, 2010

Hyperoxia is worse than Hypoxia after cardiac arrest?

This blog was in part borne of an attempt to reduce the number of letters I sent to the editors of NEJM and JAMA.....but sometimes I still find it irresistable. The editors of JAMA, however, were able to resist this letter, and it was rejected, so I post it here.

To the Editor: Kilgannon et al (http://jama.ama-assn.org/cgi/content/abstract/303/21/2165?maxtoshow=&hits=10&RESULTFORMAT=&fulltext=hyperoxia&searchid=1&FIRSTINDEX=0&resourcetype=HWCIT) report the provocative results of an observational study of the outcome of post-arrest patients as a function of the first oxygen tension measured in the ICU. Unfortunately, the definitions they chose for categorizing oxygen tension introduce confounding which complicates the interpretation of their analysis. By classifying patients with a PF ratio less than 300 but a normal oxygen tension as having hypoxia, lung injury (an organ failure which may itself be an independent predictor of poor outcomes) is confounded with hypoxia. Given the hypothesis guiding the analysis, namely that the oxygen tension to which the brain is exposed influences mortality, we find this choice curious. If patients with a normal oxygen tension but a reduced PF ratio were not classified as hypoxic, they would have been included in the normoxia group, and the results of the overall analysis may change. Such potential misclassification is important to consider given that the reasons patients were managed with hyperoxia cannot be known because of the observational nature of the study - did such patients experience less active management and titration of FiO2? Is hyperoxia a marker of a more laissez faire approach to ventilatory management? PEEP, an important determinant of oxygen tension is also not known, and this could markedly influence the classification of patients in the scheme chosen by the authors.It would be helpful to know how the results of the analysis might change if patients with lung injury (PF ratio < 300) but normal oxygen tension were reclassified as normoxic in the analysis.

Sunday, May 16, 2010

What do erythropoetin, strict rate control, torcetrapib, and diagonal earfold creases have in common? The normalization heuristic


I was pleased to see the letters to the editor in the May 6th edition of the NEJM regarding the article on the use of synthetic erythropoetins (see http://content.nejm.org/cgi/content/extract/362/18/1742 ). The letter writers must have been reading our paper on the normalization heuristic (NH)! (Actually, I doubt it. It's in an obscure journal. But maybe they should.)

In our article (available here: http://www.medical-hypotheses.com/article/S0306-9877(09)00033-4/abstract ), we operationalized the definition of the NH and attributed it to 4 errors in reasoning that lead it to be fallible as a general clinical hypothesis. Here is the number one reasoning error:

Where the normalization hypothesis is based on the assumption that the abnormal value is causally related to downstream patient outcomes, but in reality the abnormal value is not part of the causal pathway and rather is an epiphenomenon of another underlying process.

The authors of some of the letters to the editor of the NEJM have the same concerns about normalizing hemoglobin values, and the assumptions that this practice involves about our understanding of causal pathways. Which is what I want to focus on. So please turn your attention to, yes, the picture of the billiards.

I wager that the pathophysiological processes that occur in the body are more complex than the 16 balls in the photo, but it serves as a great analogy for understanding the limitations of what we know about what's going on in the body. Suppose that every time (or a high percentage of the time - we can probability adjust and not lose the meaning of the analogy) the cue ball, launched from the same spot at the same speed and angle, hits the 1--2--4--7--11 balls. We know the 11 ball is, say, cholesterol. We have figured this out. And it falls in the corner pocket - it gets "lower". But we don't know what the other balls represent, or even how many of them there are, or where they fall. We needn't know all of this to make some inferences. We see that when the cue ball is launched at a certain speed and angle, the 11 ball, cholesterol, falls. So we think we understand cholesterol. But the playing field is way more complex than the initiating event and the one final thing that we happen to be watching or measuring - the corner pocket. In the whole body, we don't even know how many balls and how many pockets we're dealing with! We only can see what we know to look for!

Suppose also that as a consequence of this cascade, the 7 ball hits the 12 ball, which falls in another corner pocket. We happen to be watching that pocket also. We know what it does. For lack of a better term, let's call it the "reduced cardiovascular death pocket." Every time this sequence of balls is hit, cholesterol (number 11) falls in one corner pocket, and the 12 ball falls in another pocket, and we infer that cholesterol is part of the causal pathway to cardiovascular death. But look carefully at the diagram. We can remove the 11 ball altogether, the 7 ball will still hit the 12 and sink it thus reducing cardiovascular death. So it's not the cholesterol at all! We misunderstood the causal pathway! It's not cholesterol falling per se, but rather some epiphenomenon of the sequence.

By now, you've inferred who is breaking. His name is atorvastatin (which I fondly refer to as the Lipid-Torpedo). When a guy called torcetrapib breaks, all hell breaks loose. We learn that there's another pocket called "increased cardiovascular death pocket" and balls start falling into there.

(A necessary aside here - I am NOT challenging the cholesterol hypothesis here. It may or may not be correct, and I certainly am not the one to figure that out. I merely wish to emphasize how we COULD make incorrect inferences about causal pathways.)

So when I see an article like there was a couple of weeks ago in the NEJM (see http://content.nejm.org/cgi/content/abstract/362/15/1363 ) about "strict rate control" for atrial fibrillation (AF), I am not surprised that it doesn't work. I am not surprised that there are processes going on in a patient with AF that we can't even begin to understand. And the coincidental fact that we can measure heart rate and control it does not mean that we're interrupting the causal pathway that we wish to.

A new colleague of mine told me the other day of a joke he likes to make that causes this to all resonate harmoniously - "We don't go around trying to iron out diagonal earfold creases to reduce cardiovascular mortality." But show us a sexy sequence of molecular cascades that we think we understand, and the sirens begin to sing their irresistible song.

Saturday, May 1, 2010

Everyone likes their own brand - Delta Inflation: A bias in the design of RCTs in Critical Care


At long last, our article describing a bias in the design of RCTs in Critical Care Medicine (CCM) has been published (see: http://ccforum.com/content/14/2/R77 ). Interested readers are directed to the original manuscript. I'm not in the business of criticising my own work on my own blog, but I will provide at least a summary.

When investigators design a trial and do power and sample size (SS) calculations, they must estimate or predict a priori what the [dichotomous] effect size will be, in say, mortality (as is the case with most trials in CCM). This number should ideally be based upon preliminary data, or a minimal clinically important difference (MCID). Unfortunately, it does not usually happen that way, and investigators rather choose a number of patients that they think they can recruit with available funding, time and resources, and they calculate the effect size that they can find with that number of patients at 80-90% power and (usually) an alpha level of 0.05.

If power and SS calculations were being performed ideally, and investigators were using preliminary data or published data on similar therapies to predict delta for their trial, we would expect that, over the course of many trials, they would be just as likely to OVERESTIMATE observed delta as they are to underestimate it. If this were the case, we would expect random scatter around a line representing unity in a graph of observed versus predicted delta (see Figure 1 in the article). If, on the other hand, predicted delta uniformely exceeds observed delta, there is directional bias in its estimation. Indeed, this is exactly what we found. This is no different from the weatherman consistently overpredicting the probability of precipitation, a horserace handicapper consistently setting too long of odds on winning horses, or Tiger Woods consistently putting too far to the right. Bias is bias. And it is unequivocally demonstrated in Figure 1.

Another point, which we unfortunately failed to emphasize in the article, is that if the predicted deltas were being guided by a MCID, well, the MCID for mortality should be the same across the board. It is not (Figure 1 again). It ranges from 3%-20% absolute reduction in mortality. Moreover, in Figure 1, note the clustering around numbers like 10% - how many fingers or toes you have should not determine the effect size you seek to find when you design an RCT.

We surely hope that this article stimulates some debate in the critical care community about the statistical design of RCTs, and indeed in what primary endpoints are chosen for them. It seems that chasing mortality is looking more and more like a fool's erand.

Caption for Figure 1:
Figure 1. Plot of observed versus predicted delta (with associated 95% confidence intervals for observed delta) of 38 trials included in the analysis. Point estimates of treatment effect (deltas) are represented by green circles for non-statistically significant trials and red triangles for statistically significant trials. Numbers within the circles and triangles refer to the trials as referenced in Additional file 1. The blue ‘unity line’ with a slope equal to one indicates perfect concordance between observed and predicted delta; for visual clarity and to reduce distortions, the slope is reduced to zero (and the x-axis is horizontally expanded) where multiple predicted deltas have the same value and where 95% confidence intervals cross the unity line. If predictions of delta were accurate and without bias, values of observed delta would be symmetrically scattered above and below the unity line. If there is directional bias, values will fall predominately on one side of the line as they do in the figure.

If you want a fair shake, you gotta get a trach (and the sooner the better)

In the most recent issue of JAMA, Terragni et al report the results of an Italian multicenter trial of early (6-8 days) versus delayed (13-15 days) tracheostomy for patients requiring mechanical ventilation (http://jama.ama-assn.org/cgi/content/short/303/15/1483 ). This research complements and continues a line of investigation of early tracheostomy in RCTs by Rumbak et al in 2004. In that earlier trial (http://journals.lww.com/ccmjournal/Abstract/2004/08000/A_prospective,_randomized,_study_comparing_early.9.aspx ) the authors showed that [very] early tracheostomy (at 48 hours into a patient's illness) compared to delayed tracheostomy (after 14 days of illness) let to reduced mortality, pneumonia, sepsis, and other complications of critical illness. The mediators of this effect are not known with certainty, but may be related to the effects of reduced sedation requirements with tracheostomy, reduced dead space, facilitation of weaning from mechanical ventilation, or psychological effects on the patients or the physicians caring for them. Regardless of the mediators of the effect, I can say confidently from an anecdotal perspective that the effects appear to be robust. Almost as if by magic, something changes after a patient gets a tracheostomy, and recovery appears to accelerate - patients just "look better" and they appear to get better faster. (Removal of the endotracheal tube (ETT) allows spitting and swallowing, activities which will be required to protect the airway when all artificial airways are removed; it allows lip-reading by families and providers and in some cases speech; it allows sedation to be reduced expeditiously; it facilitates weaning; it allows easier positioning out of bed in a chair and working with physical therapy during weaning; the list goes on...)

What are the drawbacks to such an approach? Traditionally, a tracheostomy has been viewed by practitioners as the admission of a failure of medical care - we couldn't get you better fast, with a temporary airway, so we had to "resort" to a semi-permanent or permanent [surgical] airway. Moreover, a tracheostomy was traditionally a surgical procedure requiring transportation to the operating suite, although that has changed with the advent of the percutaneous dilitational approach. Nonetheless, whichever route is used to establish the tracheostomy, certain immediate and delayed risks are inherent in the procedure, and the costs are greater. So, the basic question we would like to answer is "are there benefits of tracheostomy that outweigh these risks?"

There were several criticisms of the Rumbak study which I will not elaborate upon here, but suffice it to say that the study did not lead to a sweeping change in practice with regard to the timing of tracheostomies and thus additional studies were planned and performed. One such study, referenced by last week's JAMA article, only enrolled 25% of the anticipated sample of patients with resulting wide confidence intervals. As a result, few conclusions can be drawn from that study, but it did not appear to show a benefit to earlier tracheostomy (http://journals.lww.com/ccmjournal/Abstract/2004/08000/A_prospective,_randomized,_study_comparing_early.9.aspx ) A meta-analysis which included "quasi-randomized" studies (GIGO: Garbage In, Garbage Out) concluded that while not reducing mortality or pneumonia, early tracheostomy reduced the duration of mechanical ventilation and ICU stay. (It seems likely to me that if you stay on the ventilator and in the ICU for a shorter period, since there is a time/dose-dependent effect of these things on complications such as catheter-related blood stream infections and ventilator-associated pneumonia (VAP), that these outcomes and outcomes further downstream such as mortality WOULD be affected by early tracheostomy - but, the further downstream an outcome is, the more it gets diluted out, and the larger a study you need to demonstrate a significant effect.)

Thus, to try to resolve these uncertainties, we have the JAMA study from last week. This study was technically "negative." But in it, every single pre-specified outcome (VAP, ventilator-free days, ICU-free days, mortality) trended (some significantly) in favor of early tracheostomy. The choice of VAP as a primary outcome (P-value for early versus delayed trach 0.07) is both curious and unfortunate. VAP is notoriously difficult to diagnose and differentiate from other causes of infection and pulmonary abnormalities in mechanically ventilated ICU patients (see http://content.nejm.org/cgi/content/abstract/355/25/2619 ) - it is a "soft" outcome for which no gold standard exists. Therefore, the signal-to-noise ratio for this outcome is liable to be low. What's perhaps worse, the authors used the Clinical Pulmonary Infection Score (CPIS, or Pugin Score: see Pugin et al, AJRCCM, 1991, Volume 143, 1121-1129) as the sole means of diagnosing VAP. This score, while conceptually appealing, has never been validated in such a way that its positive and negative predictive values are acceptable for routine use in clinical practice (it is not widely used), or for a randomized controlled trial (see http://ajrccm.atsjournals.org/cgi/content/abstract/168/2/173 ). Given this, and the other strong trends and significant secondary endpoints in this study, I don't think we can dichotomize it as "negative" - reality is just more complicated than that.

I feel about this trial, which failed its primary endpoint, much the same as I felt about the Levo versus Dopa article a few weeks back. Multiple comparisons, secondary endpoints, and marginal P-values notwithstanding, I think that from the perspective of a seriously ill patient or a provider, especially a provider with strong anecdotal experience that appears to favor earl(ier) tracheostomy, the choice appears to be clear: "If you want a fair shake, you gotta get a trach."

Tuesday, March 23, 2010

"Prospective Meta-analysis" makes as much sense as "Retrospective Randomized Controlled Trial"

A recent article in JAMA ( http://jama.ama-assn.org/cgi/content/abstract/303/9/865) reports a meta-analysis of (three) trials comparing a strategy of higher versus lower PEEP (positive end-expiratory pressure) in Acute Lung Injury (ALI – a less severe form of lung injury) and Acute Respiratory Distress Syndrome (ARDS – a more severe form, at least as measured by oxygenation, one facet of its effects on physiology). The results of this impeccably conducted analysis are interesting enough (High PEEP is beneficial in a pre-specified subgroup analysis of ARDS patients, but may be harmful in the subgroup with less severe ALI), but I am more struck by the discussion as it pertains to the future of trials in critical care medicine – a discussion which was echoed by the editorialist (http://jama.ama-assn.org/cgi/content/extract/303/9/883 ).

The trials included in this meta-analysis lacked statistical precision for two principal reasons: 1.) they used the typical cookbook approach to sample size determination, choosing a delta of 10% without any justification whatever for this number (thus the studies were guilty of DELTA INFLATION); 2.) according to the authors of the meta-analysis , two of the three trials were stopped early for futility, thus further decreasing the statistical precision of already effectively underpowered trials. The resulting 95% CIs for delta in these trials thus ranged from (-)10% (in the ARDSnet ALVEOLI trial; i.e., high PEEP may increase mortality by up to 10%) to +10% (in the Mercat and Meade trials; i.e., high(er) PEEP may decrease mortality by upwards of 10%).

Because of the lack of statistical precision of these trials, the authors of the meta-analysis appropriately used individual patient data from the trials as meta-analytical fodder, with a likely useful result – high PEEP is probably best reserved for the sickest patients with ARDS, and avoided for those with ALI. (Why there is an interaction between severity of lung injury and response to PEEP is open for speculation, and is an interesting topic in itself.) What interests me more than this main result is the authors' and editorialist's suggestion that we should be doing “prospective meta-analyses” or at least design our trials so that they easily lend themselves to this application should we later decide to do so. Which begs the question: why not just make a bigger trial from the outset, choosing a realistic delta and disallowing early stopping for “futility”?

(It is useful to note that the term futility is unhappily married to or better yet, enslaved by, alpha (the threshold P-value for statistical significance). A trial is deemed futile if there is no hope of crossing the alpha/p-value threshold. But it is certainly not futile to continue enrolling patients if each additional accrual increases the statistical precision of the final result, by narrowing the 95% CI of delta. Indeed, I’m beginning to think that the whole concept of “futility” is a specious one - unless you're a funding agency.)

Large trials may be cumbersome, but they are not impossible. The SAFE investigators (http://content.nejm.org/cgi/content/abstract/350/22/2247 ) enrolled ~7000 patients seeking a delta of 3% in a trial involving 16 ICUs in two countries. Moreover, a prospective meta-analysis doesn’t reduce the number of patients required, it simply splits the population into quanta and epochs, which will hinder homogeneity in the meta-analysis if enrollment and protocols are not standardized or if temporal trends in care and outcomes come into play. If enrollment and protocols ARE standardized, it is useful to ask “then why not just do one large study from the outset?” using a realistic delta and sample size? Why not coordinate all the data (American, French, Canadian, whatever) through a prospective RCT rather than a prospective meta-analysis?

Here’s my biggest gripe with the prospective meta-analysis – in essence, you are taking multiple looks at the data, one look after each trial is completed (I’m not even counting intra-trial interim analyses), but you’re not correcting for the multiple comparisons. And most likely, once there is a substantial positive trial, it will not be repeated, for a number of reasons such as all the hand-waving about it not being ethical to repeat it and randomize people to no treatment, (one of the cardinal features of science being repeatability notwithstanding). Think ARMA (http://content.nejm.org/cgi/content/extract/343/11/812 ) . There were smaller trials leading up to it, but once ARMA was positive, no additional noteworthy trials sought to test low tidal volume ventilation for ARDS. So, if we’re going to stop conducting trials for our “prospective meta-analysis”, what will our early stopping rule be? When will we stop our sequence of trials? Will we require a P-value of 0.001 or less after the first look at the data (that is, after the first trial is completed)? Doubtful. As soon as a significant result is found in a soundly designed trial, further earnest trials of the therapy will cease and victory will be declared. Only when there is a failure or a “near-miss” will we want a “do-over” to create more fodder for our “prospective meta-analysis”. We will keep chasing the result we seek until we find it, nitpicking design and enrollment details of “failed” trials along the way to justify the continued search for the “real” result with a bigger and better trial.

If we’re going to go to the trouble of coordinating a prospective meta-analysis, I don’t understand why we wouldn’t just coordinate an adequately powered RCT based on a realistic delta (itself based on an MCID or preliminary data), and carry it to its pre-specified enrollment endpoint, “futility stopping rules” be damned. With the statistical precision that would result, we could examine the 95% CI of the resulting delta to answer the practical questions that clinicians want answers for, even if our P-value were insufficient to satisfy the staunchest of statisticians. Perhaps the best thing about such a study is that the force of its statistical precision would incapacitate single center trialists, delta inflationists, and meta-analysts alike.

Friday, March 5, 2010

Levo your Dopa at the Door - how study design influences our interpretation of reality

Another excellent critical care article was published this week in NEJM, the SOAP II study: http://content.nejm.org/cgi/content/short/362/9/779 . In this RCT of norepinephrine (norepi, levophed, or "levo" for short) versus dopamine ("dopa" for short) for the treatment of shock, the authors tried to resolve the longstanding uncertainty and debate surrounding the treatment of patients in various shock states. Proponents of any agent in this debate have often hung their hats on extrapolations of physiological and pharmacological principles to intact humans, leading to colloquialisms such as "leave-em-dead" for levophed and "renal-dose dopamine". This blog has previously emphasized the frailty of pathophysiological reasoning, the same reasoning which has irresistibly drawn cardiologists and nephrologists to dopamine because of its presumed beneficial effects on cardiac and urine output, and, by association, outcomes.

Hopefully all docs with a horse in this race will take note of the outcome of this study. In its simplest and most straightforward and technically correct interpretation, levo was not superior to dopa in terms of an effect on mortality, but was indeed superior in terms of side effects, particularly cardiac arrhythmias (a secondary endpoint). The direction of the mortality trend was in favor of levo, consistent with observational data (the SOAP I study by many of the same authors) showing reduced mortality with levo compared with dopa in the treatment of shock. As followers of this blog also know, the interpretation of "negative" studies (that is, MOST studies in critical care medicine - more on that in a future post) can be more challenging than the interpretation of positive studies, because "absence of evidence is not evidence of absence".

We could go to the statistical analysis section, and I could harp on the choice of delta, the decision to base it on a relative risk reduction, the failure to predict a baseline mortality, etc. (I will note that at least the authors defended their delta based on prior data, something that is a rarity - again, a future post will focus on this.) But, let's just be practical and examine the 95% CI of the mortality difference (the primary endpoint) and try to determine whether it contains or excludes any clinically meaningful values that may allow us to compare these two treatments. First, we have to go to the raw data and find the 95% CI of the ARR, because the Odds Ratio can inflate small differences as you know. That is, if the baseline is 1%, then a statistically significant increase in odds of 1.4 is not meaningful because it represents only a 0.4% increase in the outcome - miniscule. With Stata, we find that the ARR is 4.0%, with a 95% CI of -0.76% (favors dopamine) to +8.8% (favors levo). Wowza! Suppose we say that a 3% difference in mortality in either direction is our threshold for CLINICAL significance. This 95% CI includes a whole swath of values between 3% and 8.8% that are of interest to us and they are all in favor of levo. (Recall that perhaps the most lauded trial in critical care medicine, the ARDSnet ARMA study, reduced mortality by just 9%.) On the other side of the spectrum, the range of values in favor of dopa is quite narrow indeed - from 0% to -0.76%, all well below our threshold for clinical significance (that is, the minimal clinically important difference or MCID) of 3%. So indeed, this study surely seems to suggest that if we ever choose between these two widely available and commonly used agents, the cake goes to levo, hands down. I hardly need a statistically significant result with a 95% CI like this one!

So, then, why was the study deemed "negative"? There are a few reasons. Firstly, the trial is probably guilty of "delta inflation" whereby investigators seek a pre-specified delta that is larger than is realistic. While they used, ostensibly, 7%, the value found in the observational SOAP I study, they did not account for regression to the mean, or allow any buffer for the finding of a smaller difference. However, one can hardly blame them. Had they looked instead for 6%, and had the 4% trend continued for additional enrollees, 300 additional patients in each group (or about 1150 in each arm) would have been required and the final P-value would have still fallen short at 0.06. Only if they had sought a 5% delta, which would have DOUBLED the sample size to 1600 per arm, would they have achieved a statistically significant result with 4% ARR, with P=0.024. Such is the magnitude of the necessary increase in sample size as you seek smaller and smaller deltas.

Which brings me to the second issue. If delta inflation leads to negative studies, and logistical and financial constraints prohibit the enrollment of massive numbers of patients, what is an investigator to do? Sadly, the poor investigator wishing to publish in the NEJM or indeed any peer reviewed journal is hamstrung by conventions that few these days even really understand anymore: namely, the mandatory use of 0.05 for alpha and "doubly significant" power calculations for hypothesis testing. I will not comment more on the latter other than to say that interested readers can google this and find some interesting, if arcane, material. As regards the former, a few comments.

The choice of 0.05 for the type 1 error rate, that is, the probability that we will reject the null hypothesis based on the data and falsely conclude that one therapy is superior to the other; and the choice of 10-20% for the type 2 error rate (power 80-90%), that is the probability that the alternative hypothesis is really true and we will reject it based on the data; derive from the traditional assumption, which is itself an omission bias, that it is better in the name of safety to keep new agents out of practice by having a more stringent requirement for accepting efficacy than the requirement for rejecting it. This asymmetry is the design of trials is of dubious rationality from the outset (because it is an omission bias), but it is especially nettlesome when the trial is comparing two agents already in widespread use. As opposed to the trial of a new drug compared to placebo, where we want to set the hurdle high for declaring efficacy, especially when the drug might have side effects - with levo versus dopa, the real risk is that we'll continue to consider them to be equivalent choices when there is strong reason to favor one over the other based either on previous or current data. This is NOT a trial of treatment versus no treatment of shock, this trial assumes that you're going to treat the shock with SOMETHING. In a trial such as this one, one could make a strong argument that a P-value of 0.10 should be the threshold for statistical significance. In my mind it should have been.

But as long as the perspicacious consumer of the literature and reader of this blog takes P-values with a grain of salt and pays careful attention to the confidence intervals and the MCID (whatever that may be for the individual), s/he will not be misled by the deeply entrenched convention of alpha at 0.05, power at 90%, and delta wildly inflated to keep the editors and funding agencies mollified.

Tuesday, February 9, 2010

Post hoc non ergo propter hoc extended: A is associated with B therefore A causes B and removal of A removes B

From Annane et al, JAMA, January 27th, 2010 (see: http://jama.ama-assn.org/cgi/content/abstract/303/4/341 ):

"...patients whose septic shock is treated with hydrocortisone commonly have blood glucose levels higher than 180. These levels have clearly been associated with marked increase in the risk of dying...Thus, we hypothesized that normalization of blood glucose levels with intensive insulin treatment may improve the outcome of adults with septic shock who are treated with hydrocortisone."

The normalization heuristic is at work again.

Endocrine interventions as adjunctive treatments in critical care medicine have a sordid history. Here are some landmarks. Rewind 25 years, and as Angus has recently described (http://jama.ama-assn.org/cgi/content/extract/301/22/2388 ) we had the heroic administration of high dose corticosteroids (e.g. gram doses of methylprednisolone) for septic shock, which therapy was later abandoned. In the 1990s, we had two concurrent trials of human growth hormone in critical illness, showing the largest statistically significant harms (increased mortality of ~20%) from a therapy in critical illness that I'm aware of (see http://content.nejm.org/cgi/content/abstract/341/11/785 ). Early in the new millennium, based on two studies that should by now be largely discredited by their successors, we had intensive insulin therapy for patients with hyperglycemia and low dose corticosteroid therapy for septic shock. It is fitting then, and at least a little ironic, that this new decade should start with publication of a study combining these latter two therapies of dubious benefit: The aptly named COIITSS study.

I know this sounds overly pessimistic, but some of these therapies, these two in particular, just need to die, but are being kept alive by the hope, optimism, and proselytizing of those few individuals whose careers were made on them or continue to depend upon them. And I lament the fact that, as a result of the promotional efforts of these wayward souls, we have been distracted from the actual data. Allow me to summarize these briefly:

1.) The original Annane study of steroids in septic shock (see: http://jama.ama-assn.org/cgi/content/abstract/288/7/862 ) utilized an adjusted analysis of a subgroup of patients not identifiable at the outset of the trial (responders versus non-responders). The entire ITT (intention to treat) population had an ADJUSTED P-value of 0.09. I calculated an unadjusted P-value of 0.29 for the overall cohort. Since you cannot know at the outset who's a responder and who's not, for a practitioner, the group of interest is the ITT population, and there was NO EFFECT in this population. Somehow, the enthusiasm for this therapy was so great that we lost sight of the reasons that I assume the NEJM rejected this article - an adjusted analysis of a subgroup. Seriously! How did we lose sight of this? Blinded by hope and excitement, and the simplicity of the hypothesis - if it's low, make it high, and everything will be better. Then Sprung and the CORTICUS folks came along (see: http://content.nejm.org/cgi/content/abstract/358/2/111 ), and, as far as I'm concerned, blew the whole thing out of the water.

2.) I remember my attending raving about the Van den Berghe article (see: http://content.nejm.org/cgi/content/abstract/345/19/1359 ) as a first year fellow at Johns Hopkins in late 2001. He said "this is either the greatest therapy ever to come to critical care medicine, or these data are faked!" That got me interested. And I still distinctly remember circling something in the methods section, which was in small print in those days, on the left hand column of the left page back almost 9 years ago - that little detail about the dextrose infusions. This therapy appeared to work in post-cardiac surgery patients on dextrose infusions at a single center. I was always skeptical about it, and then the follow-up study came out, and lo and behold, NO EFFECT! But that study is still touted by the author as a positive one! Because again, like in Annane, if you remove those pesky patients who didn't stay in the MICU for 3 days (again, like Annane, not identifiable at the outset), you have a SUBGROUP analysis in which IIT (intensive insulin therapy - NOT intention to treat, ITT is inimical to IIT) works. Then you had NICE-SUGAR (see: http://content.nejm.org/cgi/content/abstract/360/13/1283 ) AND Brunkhorst et al (see: http://content.nejm.org/cgi/content/abstract/358/2/125 ) showing that IIT doesn't work. How much more data do we need? Why are we still doing this?

Because old habits die hard and so do true believers. Thus it was perhaps inevitable that we would have COITSS combine these two therapies into a single trial. Note that this trial does nothing to address whether hydrocortisone for septic shock is efficacious (it probably is NOT), but rather assumes that it is. I note also that it was started in 2006 just shortly before the second Van den Berghe study was published and well after the data from that study were known. Annane et al make no comments about whether those data impacted the conduct of their study, and whether participants were informed that a repeat of the trial upon which the Annane trail was predicated, had failed.

Annane did not use blinding for fludrocortisone in the current study, but this is minor. It is difficult to blind IIT, but usually what you do when you can't blind things adequately is you protocolize care. That was not obviously done in this trial; instead we are reassured that "everybody should have been following the Surviving Sepsis Campaign guidelines". (I'm paraphrasing.)

As astutely pointed out by Van den Berghe in the accompanying editorial, this trial was underpowered. It was just plain silly to assume (or play dumb) that a 3% ARR which is a ~25% RRR (since the baseline was under 10%) would translate into a 12.5% ARR with a baseline mortality of near 50%. Indeed, I don't know why we even talk about RRRs anymore, they're a ruse to inflate small numbers and rouse our emotions. (Her other comments, about "separation", which would be facilitated by having a very very intensive treatment and a very very lax control is reminiscent of what folks were saying about ARMA low/high Vt - namely that the trial was ungeneralizable because the "control" 12 cc/kg was unrealistic. Then you get into the Eichacker and Natanson arguments about U-shaped curves [to which there may be some truth] and how too much is bad, not enough is bad, but somewhere in the middle is the "sweet spot". And this is key. Would that I could know the sweet spot for blood sugar - and coax patients to remain there.)

Because retrospective power calculations are uncouth, I elected to calculate the 95% confidence interval (CI) for delta (the difference between the two groups) in this trial. The point estimate for delta is -2.96% (negative delta means the therapy was WORSE than control!) with a 95% confidence interval of -11.6% to +5.65%. It is most likely between 11% worse and 5% better, and any good betting man would wager that it's worse than control! But in either case, this confidence interval is uncomfortably wide and contains values for harm and benefit which should be meaningful to us, so in essence the data do not help us decide what to do with this therapy.

(And look at table 2, the main results, look they are still shamelessly reporting adjusted P-values! Isn't that why we randomize? So we don't have to adjust?)

TTo bring this saga full circle, I note that, as we saw in NICE-SUGAR, Brunkhorst, and Van den Berghe, severe hypoglycemia (<40!) was far more common in the IIT group. And severe hypoglycemia is associated with death (in most studies, but curiously not in this one). So, consistent with the hypothesis which was the impetus for this study (A is associated with B, thus A causes B and removal of A removes B), one conclusion from all these data is that hypoglycemia causes death, and should be avoided through avoidance of IIT.

Tuesday, December 29, 2009

How much Epi should we give, if we give Epi at all?

Last month JAMA published another article that underscores the need for circumspection when, as by routine, habit, or tradition, we apply the results of laboratory experiments and pathophysiological reasoning to the treatment of intact persons. Olasveengen et al (http://jama.ama-assn.org/cgi/content/abstract/302/20/2222 ) report the results of a Norwegian trial in which people with Out of Hospital (OOH) cardiac arrest were randomized to receive or not receive intravenous medication during resuscitation attempts.

It's not as heretical as it sounds. In 2000, the NEJM reported the results of a Seattle study by Hallstrom et al (http://content.nejm.org/cgi/content/abstract/342/21/1546 ) showing that CPR appears to be as effective (and indeed perhaps more effective) when mouth-to-mouth ventilation is NOT performed along with chest compressions by bystanders. Other evidence with a similar message has since accumulated. With resuscitation, more effort, more intervention does not necessarily lead to better results. The normalization heuristic fails us again.

Several things can be learnt from the recent Norwegian trial. First, recall that RCTs are treasure troves of epidemiological data. The data from this trial reinforce what we practitioners already know, but which is not well-known among uninitiated laypersons: the survival of OOH cardiac arrest is dismal, on the order of 10% or so.

Next, looking at Table 2 of the outcomes data, we note that while survival to hospital discharge, the primary outcome, seems to be no different between the drug and no-drug groups, there are what appear to be important trends in favor of drug - there is more Return of Spontaneous Circulation (ROSC), there are more admissions to the ICU, there are more folks discharged with good neurological function. This is reminescent of a series of studies in the 1990s (e.g., http://content.nejm.org/cgi/content/abstract/339/22/1595 ) showing that high dose epinephrine, while improving ROSC, did not lead to improved survival. Ultimately, the usefulness of any of these interventions hinges on what your goals are. If your goal is survival with good neurological function, epinephrine in any dose may not be that useful. But if your goal is ROSC, you might prefer to give a bunch of it. I'll leave it to you to determine what your goals are, and whether, on balance, you think they're worthy goals.

There are two other good lessons from this article. In this study, the survival rate in the drug group was 10.5% and that in the no-drug group was 9.2%, for a difference of 1.3% and this small difference was not statistically significant. Does that mean there's no difference? No, it does not, not necessarily. There might be a difference that this study failed to detect because of a Type II error. (The study was designed with 91% power, so there's a 9% chance that a true difference will be missed, and the chances are even greater since the a priori sample size was not achieved.) If you follow this blog, you know that if the study is negative, we need to look at the 95% confidence interval (CI) around the difference to see if it might include clinically meaningful values. The 95% CI for this difference (not reported by the authors, but calculated by me using Stata) was -5.2% to +2.8%. That is, no drug might be up to about 5% worse or up to about 3% better than drug. Would you stop giving Epi for resuscitation on the basis of this study? Is the CI narrow enough for you? Is a 5% decrease in survival with no drug negligible? I'll leave that for you to decide.

(I should not gloss over the alternative possibility which is that the results are also compatible with no-drug being 2.8% better than drug. But if you're playing the odds, methinks you are best off betting the other way, given table 2.)

Now, as an extension of the last blog post, let's look at the relative numbers. The 95% CI for the relative risk (RR) is 0.59 - 1.33. That means that survival might be reduced by as much as 41% with no drug! That sounds like a LOT doesn't it? This is why I consistently argue that relative numbers be avoided in appraising the evidence. RRs give unfair advantages to therapies targeting diseases with survivals closer to 0%. There is no rational reason for such an advantage. A 1% chance of dying is a 1% chance of dying no matter where it falls along the continuum from zilch to unity.

Lessons from this article: beware of pathophysiological reasoning, and translation from the hampster and molecule labs; determine the goals of your therapy and whether they are worthy goals; absence of evidence is not evidence of absence; look at CIs for the difference between therapies in "negative" trials and see if they include clinically meaningful values; and finally, beware of inflation of perceived benefit caused by presentation of relative risks rather than absolute risks.

Wednesday, December 16, 2009

Dabigatran and Dabigscam of non-inferiority trials, pre-specified margins of non-inferiority, and relative risks

Anyone who thought, based on the evidence outlined in the last post on this blog, that dabigatran was going to be a "superior" replacement for warfarin was chagrinned last week with the publication in the NEJM of the RE-COVER study of dabigatran versus warfarin in the treatment of venous thromboembolism (VTE): http://content.nejm.org/cgi/content/abstract/361/24/2342 . Dabigatran for this indication is NOT superior to warfarin, but may be non-inferior to warfarin, if we are willing to accept the assumptions of the non-inferiority hypothesis in the RE-COVER trial.

Before we go on, I ask you to engage in a mental exercise of sorts that I'm trying to make a habit. (If you have already read the article and recall the design and the results, you will be biased, but go ahead anyway this time.) First, ask yourself what increase in an absolute risk of recurrent DVT/PE/death is so small that you consider it negligible for the practical purposes of clinical management. That is, what difference between two drugs is so small as to be pragmatically irrelevant? Next, ask yourself what RELATIVE increase in risk is negligible? (I'm purposefully not suggesting percentages and relative risks as examples here in order to avoid the pitfalls of "anchoring and adjustment": http://en.wikipedia.org/wiki/Anchoring .) Finally, assume that the baseline risk of VTE at 6 months is ~2% - with this "baseline" risk, ask yourself what absolute and relative increases above this risk are, for practical purposes, negligible. Do these latter numbers jibe with your answers to the first two questions which were answered when you had no particular baseline in mind?

Note how it is difficult to reconcile your "intuitive" instincts about what is a negligible relative and absolute risk with how these numbers might vary depending upon what the baseline risk is. Personally, I think about a 3% absolute increase in the risk of DVT at 6 months to be on the precipice of what is clinically significant. But if the baseline risk is 2%, a 3% absolute increase (to 5%) represents a 2.5x increase in risk! That's a 150% increase, folks! Imagine telling a patient that the use of drug ABC instead of XYZ "only doubles your risk of another clot or death". You can visualize the bewildered faces and incredulous, furrowed brows. But if you say, "the difference between ABC and XYZ is only 3%, and drug ABC costs pennies but XYZ is quite expensive, " that creates quite a different subjective impression of the same numbers. Of course, if the baseline risk were 10%, a 3% increase is only a 30% or 1.3x increase in risk. Conversely, with a baseline risk of 10%, a 2.5x increase in risk (RR=2.5) means a 15% absolute increase in the risk of DVT/PE/Death, and hardly ANYONE would argue that THAT is negligible. We know that doctors and laypeople respond better to, or are more impressed by, results that are described as RRR than ARR, ostensibly because the former inflates the risk because the number appears bigger (e-mail me if you want a reference for this). The bottom line is that what matters is the absolute risk. We're banking health dollars. We want the most dollars at the end of the day, not the largest increase over some [arbitrary] baseline. So I'm not sure why we're still designing studies with power calculations that utilize relative risks.

With this in mind, let's check the assumptions of the design of this non-inferiority trial (NIT). It was designed with 90% power to exclude a hazard ratio (HR; similar to a relative risk for our purposes) of 2.75. That HR of 2.75 sure SOUNDS like a lot. But with a predicted baseline risk of 2% (which prediction panned out in the trial - the baseline risk with warfarin was 2.1%), that amounts to only 5.78, or an increase of 3.78%, which I will admit is close to my own a priori negligibility level of 3%. The authors justify this assignment based on 4 referenced studies all prior to 1996. I find this curious. Because they are so dated and in a rather obscure journal, I have access only to the 1995 NEJM study (http://content.nejm.org/cgi/reprint/332/25/1661.pdf ). In this 1995 study, the statistical design is basically not even described, and there were 3 primary endpoints (ahhh, the 1990s). This is not exactly the kind of study that I want to model a modern trial after. In the table below, I have abstracted data from the 1995 trial and three more modern ones (al lcomparing two treatment regimens for DVT/PE) to determine both the absolute risk and relative risks that were observed in these trials.

Table 1. Risk reductions in several RCTs comparing treatment regimens for DVT/PE. Outcomes are the combination of recurrent DVT/PE/Death unless otherwise specified. *recurrent DVT/PE only; raw numbers used for simplicity in lieu of time to event analysis used by the authors

From this table we can see that in SUCCESSFUL trials of therapies for DVT/PE treatment, absolute risk reductions in the range of 5-10% have been demonstrated, with associated relative risk increases of ~1.75-2.75 (for placebo versus comparator - I purposefully made the ratio in this direction to make it more applicable to the dabigatran trial's null hypothesis [NH] that the 95% CI for dabigatran includes 2.75 HR - note that the NH in an NIT is the enantiomer of the NH in a superiority trial). Now, from here we must make two assumptions, one which I think is justified and the other which I think is not. The first is that the demonstrated risk differences in this table are clinically significant. I am inclined to say "yes, they are" not only because a 5-10% absolute difference just intuitively strikes me as clinically relevant compared to other therapies that I use regularly, but also because, in the cases of the 2003 studies, these trials were generally counted as successes for the comparator therapies. The second assumption we must make, if we are to take the dabigatran authors seriously, is that differences smaller than 5-10% (say 4% or less) are clinically negligible. I would not be so quick to make this latter assumption, particularly in the case of an outcome that includes death. Note also that the study referenced by the authors (reference 13 - the Schulman 1995 trial) was considered a success with a relative risk of 1.73, and that the 95% CI for the main outcome of the RE-COVER study ranged from 0.65-1.84 - it overlaps the Schulman point estimate of RR of 1.73, and the Lee point estimate of 1.83! Based on an analysis using relative numbers, I am not willing to accept the pre-specified margin of non-inferiority upon which this study was based/designed.

But, as I said earlier, relative differences are not nearly as important to us as absolute differences. If we take the upper bound of the HR in the RE-COVER trial (1.84) and multiply it by the baseline risk (2.1) we get an upper 95% CI for the risk of the outcome of 3.86, which corresponds to an absolute risk difference of 1.76. This is quite low, and personally it satisfies my requirement for very small differences between two therapies if I am to call them non-inferior to one another.


So, we have yet again a NIT which was designed upon precarious and perhaps untenable assumptions, but which, through luck or fate was nonetheless a success. I am beginning to think that this dabigatran drug has some merit, and I wager that it will be approved. But this does not change the fact that this and previous trials were designed in such a way as to allow a defeat of warfarin to be declared based on much more tenuous numbers.

I think a summary of sorts for good NIT design is in order:

• The pre-specified margin of non-inferiority should be smaller than the MCID (minimal clinically important difference), if there is an accepted MCID for the condition under study


• The pre-specified margin of non-inferiority should be MUCH smaller than statistically significant differences found in "successful" superiority trials, and ideally, the 95% CI in the NIT should NOT overlap with point estimates of significant differences in superiority trials


• NITs should disallow "asymmetry" of conclusions - see the last post on dabigatran. If the pre-specified margin of non-inferiority is a relative risk of 2.0 and the observed 95% CI must not include that value to claim non-inferiority, then superiority cannot be declared unless the 95% confidence interval of the point estimate does not cross -2.0. What did you say? That's impossible, it would require a HUGE risk difference and a narrow CI for that to ever happen? Well, that's why you can't make your delta unrealistically large - you'll NEVER claim superiority, if you're being fair about things. If you make delta very large it's easier to claim non-inferiority, but you should also suffer the consequences by basically never being able to claim superiority either.


• We should concern ourselves with Absolute rather than Relative risk reductions