This is discussion forum for physicians, researchers, and other healthcare professionals interested in the epistemology of medical knowledge, the limitations of the evidence, how clinical trials evidence is generated, disseminated, and incorporated into clinical practice, how the evidence should optimally be incorporated into practice, and what the value of the evidence is to science, individual patients, and society.
Saturday, September 25, 2010
In the same vein: Intercessory Prayer for Heart Surgery and Neuromuscular Blockers for ARDS
The most remarkable thing about this study for me is that it was scientifically irresponsible to conduct it. Science (and biomedical research) must be guided by testing a defensible hypothesis, based on logic, historical and preliminary data, and, in the case of biomedical research, an understanding of the underlying pathophysiology of the disease process under study. Where there is no scientifically valid reason to believe that a therapy might work, no preliminary data - nothing - a hypothesis based on hope or faith has no defensible justification in biomedical research, and its study is arguably unethical.
Moreover, a clinical trial is in essence a diagnostic test of a hypothesis, and the posterior probability of a hypothesis (null or alternative) depends not only on the frequentist data produced by the trial, but also on a Bayesian analysis incorporating the prior probability that the alternative (or null) hypothesis is true (or false). That is, if I conducted a trial of orange juice (OJ) for the treatment of sepsis (another unethical design) and OJ appeared to reduce sepsis mortality by, say, 10% with P=0.03, you should be suspicious. With no biologically plausible reason to believe that OJ might be efficacious, the prior probability of Ha (that OJ is effective) is very low, and a P-value of 0.03 (or even 0.001) is unconvincing. That is, the less compelling the general idea supporting the hypothesis is, the more robust a P-value you should require to be convinced by the data from the trial.
Thus, a trial wherein the alternative hypothesis tested has a negligible probability of being true is uninformative and therefore unethical to conduct. In a trial such as the intercessory prayer trial, there is NO resultant P-value which is sufficient to convince us that the therapy is effective - in effect, all statistically significant results represent Type I errors, and the trial is useless.
(I should take a moment here to state that, ideally, the probability of Ho and Ha should both be around 50%, or not far off, representing true equipoise about the scenario being studied. Based on our data in the Delta Inflation article (see: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2887200/ ), it appears that at least in critical care trials evaluating comparative mortality, the prior probability of Ha is on the order of 18%, and even that figure is probably inflated because many of the trials that comprise it represent Type I errors. In any case, it is useful to consider the prior probability of Ha before considering the data from a trial, because that prior is informative. [And in the case of trials for biologics for the treatment of sepsis {be it OJ or drotrecogin, or anti-TNF-alpha}, the prior probability that any of them is efficacious is almost negligibly low.)
Which segues me to Neuromuscular Blockers (NMBs) for ARDS (see: http://www.nejm.org/doi/full/10.1056/NEJMoa1005372 ) - while I have several problems with this article, my most grievous concern is that we have no (in my estimation) substantive reason to believe that NMBs will improve mortality in ARDS. They may improve oxygenation, but we long ago abandoned the notion that oxygenation is a valid surrogate end-point in the management of ARDS. Indeed, the widespread abandonment of use of NMBs in ARDS reflects consensus agreement among practitioners that NMBs are on balance harmful. (Note in Figure 1 that, in contrast to the contention of the authors in the introduction that NMBs remain widely used, only 4.3% of patients were excluded because of use of NMBs at baseline.)
In short, these data fail to convince me that I should be using NMBs in ARDS. But many readers will want to know "then why was the study positive?" And I think the answer is staring us right in the face. In addition to the possibility of a simple Type I error, and the fact that the analysis was done with a Cox regression, controlling for baseline imbalances (even ones such as PF ratio which were NOT prospectively defined as variables to control for in the analysis), the study was effectively unblinded/unmasked. It is simply not possible to mask the use of NMBs, the clinicians and RNs will quickly figure out who is and is not paralyzed - paralyzed patients will "ride the vent" while unparalyzed ones will "fight the vent". And differences in care may/will arise.
It is the simplest explanation, and I wager it's correct. I will welcome data from other trials if they become available (should it even be studied further?), but in the meantime I don't think we should be giving NMBs to patients with ARDS any more than we should be praying (or avoiding prayer) for the recovery of open-heart patients.
Friday, August 20, 2010
Heads I Win, Tails it's a Draw: Rituximab, Cyclophosphamide, and Revising CONSORT

The recent article by Stone et al in the NEJM (see: http://www.nejm.org/doi/full/10.1056/NEJMoa0909905 ), which appears to [mostly] conform to the CONSORT recommendations for the conduct and reporting of NIFTs (non-inferiority trials, often abbreviated NIFs, but I think NIFTs ["Nifties"] sounds cooler), allowed me to realize that I fundamentally disagree with the CONSORT statement on NIFTs (see JAMA, http://jama.ama-assn.org/cgi/content/abstract/295/10/1152 ) and indeed the entire concept of NIFTs. I have discussed previously in this blog my disapproval of the asymmetry with which NIFTs are designed such that they favor the new (and often proprietary agent), but I will use this current article to illustrate why I think NIFTs should be done away with altogether and supplanted by equivalence trials.
This study rouses my usual and tired gripes about NIFTs: too large a delta, no justification for delta, use of intention-to-treat rather than per-protocol analysis, etc. It also describes a suspicious statistical maneuver which I suspect is intended to infuse the results (in favor of Rituximab/Rituxan) with extra legitimacy in the minds of the uninitiated: instead of simply stating (or showing with a plot) that the 95% CI excludes delta, thus making Rituxan non-inferior, the authors tested the hypothesis that the lower 95.1% CI boundary is different from delta, which test results in a very small P-value (<0.001). This procedure adds nothing to the confidence interval in terms of interpretation of the results, but seems to imbue them with an unassailable legitimacy - the non-inferiority hypothesis is trotted around as if iron-clad by this miniscule P-value, which is really just superfluous and gratuitious.
But I digress - time to focus on the figure. Under the current standards for conducting a NIFT, in order to be non-inferior, you simply need a 95% CI for the preferred [and usually proprietary] agent with an upper boundary which does not include delta in favor of the comparator (scenario A in the figure). For your preferred agent to be declared inferior, the LOWER 95% CI for the difference between the two agents must exclude the delta in favor of the comparator (scenario B in the figure.) For that to ever happen, the preferred/proprietary agent is going to have to be WAY worse than standard treatment. It is no wonder that such results are very, very rare, especially since deltas are generally much larger than is reasonable. I am not aware of any recent trial in a major medical journal where inferiority was declared. The figure shows you why this is the case.
Inferiority is very difficult to declare (the deck is stacked this way on purpose), but superiority is relatively easy to declare, because for superiority your 95% CI doesn't have to exclude an obese delta, but rather must just exclude zero with a point estimate in favor of the preferred therapy. That is, you don't need a mirror image of the 95% CI that you need for inferiority (scenario C in the figure), you simply need a point estimate in favor of the preferred agent with a 95% CI that does not include zero (scenario D in the figure). Looking at the actual results (bottom left in the figure), we see that they are very close to scenario D and that they would have only had to go a little bit more in favor of rituxan for superiority to have been able to be declared. Under my proposal for symmetry (and fairness, justice, and logic), the results would have had to be similar to scenario C, and Rituxan came nowhere near to meeting criteria for superiority.
The reason it makes absolutely no sense to allow this asymmetry can be demonstrated by imagining a counterfactual (or two) - supposing that the results had been exactly the same, but they had favored Cytoxan (cyclophosphamide) rather than Rituxan, that is, Cytoxan was associated with a 11% improvement in the primary endpoint. This is represented by scenario E in the figure; and since the 95% CI includes delta, the result is "inconclusive" according to CONSORT. So how can it be that the classification of the result changes depending on what we arbitrarily (a priori, before knowing the results) declare to be the preferred agent? That makes no sense, unless you're more interested in declaring victory for a preferred agent than you are in discovering the truth, and of course, you can guess my inferences about the motives of the investigators and sponsors in many/most of these studies. In another counterfactual example, scenario F in the figure represents the mirror image of scenario D, which represented the minimum result that would have allowed Stone et al to declare that Rituxan was superior. But if the results had favored Cytoxan by that much, we would have had another "inconclusive" result, according to CONSORT. Allowing this is just mind-boggling, maddening, and unjustifiable!
Given this "heads I win, tails it's a draw", it's no wonder that NIFTs are proliferating. It's time we stop accepting them, and require that non-inferiority hypotheses be symmetrical - in essence, making equivalence trials the standard operating procedure, and requiring the same standards for superiority as we require for inferiority.
Friday, July 16, 2010
Hyperoxia is worse than Hypoxia after cardiac arrest?
To the Editor: Kilgannon et al (http://jama.ama-assn.org/cgi/content/abstract/303/21/2165?maxtoshow=&hits=10&RESULTFORMAT=&fulltext=hyperoxia&searchid=1&FIRSTINDEX=0&resourcetype=HWCIT) report the provocative results of an observational study of the outcome of post-arrest patients as a function of the first oxygen tension measured in the ICU. Unfortunately, the definitions they chose for categorizing oxygen tension introduce confounding which complicates the interpretation of their analysis. By classifying patients with a PF ratio less than 300 but a normal oxygen tension as having hypoxia, lung injury (an organ failure which may itself be an independent predictor of poor outcomes) is confounded with hypoxia. Given the hypothesis guiding the analysis, namely that the oxygen tension to which the brain is exposed influences mortality, we find this choice curious. If patients with a normal oxygen tension but a reduced PF ratio were not classified as hypoxic, they would have been included in the normoxia group, and the results of the overall analysis may change. Such potential misclassification is important to consider given that the reasons patients were managed with hyperoxia cannot be known because of the observational nature of the study - did such patients experience less active management and titration of FiO2? Is hyperoxia a marker of a more laissez faire approach to ventilatory management? PEEP, an important determinant of oxygen tension is also not known, and this could markedly influence the classification of patients in the scheme chosen by the authors.It would be helpful to know how the results of the analysis might change if patients with lung injury (PF ratio < 300) but normal oxygen tension were reclassified as normoxic in the analysis.
Sunday, May 16, 2010
What do erythropoetin, strict rate control, torcetrapib, and diagonal earfold creases have in common? The normalization heuristic

I was pleased to see the letters to the editor in the May 6th edition of the NEJM regarding the article on the use of synthetic erythropoetins (see http://content.nejm.org/cgi/content/extract/362/18/1742 ). The letter writers must have been reading our paper on the normalization heuristic (NH)! (Actually, I doubt it. It's in an obscure journal. But maybe they should.)
In our article (available here: http://www.medical-hypotheses.com/article/S0306-9877(09)00033-4/abstract ), we operationalized the definition of the NH and attributed it to 4 errors in reasoning that lead it to be fallible as a general clinical hypothesis. Here is the number one reasoning error:
Where the normalization hypothesis is based on the assumption that the abnormal value is causally related to downstream patient outcomes, but in reality the abnormal value is not part of the causal pathway and rather is an epiphenomenon of another underlying process.
The authors of some of the letters to the editor of the NEJM have the same concerns about normalizing hemoglobin values, and the assumptions that this practice involves about our understanding of causal pathways. Which is what I want to focus on. So please turn your attention to, yes, the picture of the billiards.
I wager that the pathophysiological processes that occur in the body are more complex than the 16 balls in the photo, but it serves as a great analogy for understanding the limitations of what we know about what's going on in the body. Suppose that every time (or a high percentage of the time - we can probability adjust and not lose the meaning of the analogy) the cue ball, launched from the same spot at the same speed and angle, hits the 1--2--4--7--11 balls. We know the 11 ball is, say, cholesterol. We have figured this out. And it falls in the corner pocket - it gets "lower". But we don't know what the other balls represent, or even how many of them there are, or where they fall. We needn't know all of this to make some inferences. We see that when the cue ball is launched at a certain speed and angle, the 11 ball, cholesterol, falls. So we think we understand cholesterol. But the playing field is way more complex than the initiating event and the one final thing that we happen to be watching or measuring - the corner pocket. In the whole body, we don't even know how many balls and how many pockets we're dealing with! We only can see what we know to look for!
Suppose also that as a consequence of this cascade, the 7 ball hits the 12 ball, which falls in another corner pocket. We happen to be watching that pocket also. We know what it does. For lack of a better term, let's call it the "reduced cardiovascular death pocket." Every time this sequence of balls is hit, cholesterol (number 11) falls in one corner pocket, and the 12 ball falls in another pocket, and we infer that cholesterol is part of the causal pathway to cardiovascular death. But look carefully at the diagram. We can remove the 11 ball altogether, the 7 ball will still hit the 12 and sink it thus reducing cardiovascular death. So it's not the cholesterol at all! We misunderstood the causal pathway! It's not cholesterol falling per se, but rather some epiphenomenon of the sequence.
By now, you've inferred who is breaking. His name is atorvastatin (which I fondly refer to as the Lipid-Torpedo). When a guy called torcetrapib breaks, all hell breaks loose. We learn that there's another pocket called "increased cardiovascular death pocket" and balls start falling into there.
(A necessary aside here - I am NOT challenging the cholesterol hypothesis here. It may or may not be correct, and I certainly am not the one to figure that out. I merely wish to emphasize how we COULD make incorrect inferences about causal pathways.)
So when I see an article like there was a couple of weeks ago in the NEJM (see http://content.nejm.org/cgi/content/abstract/362/15/1363 ) about "strict rate control" for atrial fibrillation (AF), I am not surprised that it doesn't work. I am not surprised that there are processes going on in a patient with AF that we can't even begin to understand. And the coincidental fact that we can measure heart rate and control it does not mean that we're interrupting the causal pathway that we wish to.
A new colleague of mine told me the other day of a joke he likes to make that causes this to all resonate harmoniously - "We don't go around trying to iron out diagonal earfold creases to reduce cardiovascular mortality." But show us a sexy sequence of molecular cascades that we think we understand, and the sirens begin to sing their irresistible song.
Saturday, May 1, 2010
Everyone likes their own brand - Delta Inflation: A bias in the design of RCTs in Critical Care

At long last, our article describing a bias in the design of RCTs in Critical Care Medicine (CCM) has been published (see: http://ccforum.com/content/14/2/R77 ). Interested readers are directed to the original manuscript. I'm not in the business of criticising my own work on my own blog, but I will provide at least a summary.
When investigators design a trial and do power and sample size (SS) calculations, they must estimate or predict a priori what the [dichotomous] effect size will be, in say, mortality (as is the case with most trials in CCM). This number should ideally be based upon preliminary data, or a minimal clinically important difference (MCID). Unfortunately, it does not usually happen that way, and investigators rather choose a number of patients that they think they can recruit with available funding, time and resources, and they calculate the effect size that they can find with that number of patients at 80-90% power and (usually) an alpha level of 0.05.
If power and SS calculations were being performed ideally, and investigators were using preliminary data or published data on similar therapies to predict delta for their trial, we would expect that, over the course of many trials, they would be just as likely to OVERESTIMATE observed delta as they are to underestimate it. If this were the case, we would expect random scatter around a line representing unity in a graph of observed versus predicted delta (see Figure 1 in the article). If, on the other hand, predicted delta uniformely exceeds observed delta, there is directional bias in its estimation. Indeed, this is exactly what we found. This is no different from the weatherman consistently overpredicting the probability of precipitation, a horserace handicapper consistently setting too long of odds on winning horses, or Tiger Woods consistently putting too far to the right. Bias is bias. And it is unequivocally demonstrated in Figure 1.
Another point, which we unfortunately failed to emphasize in the article, is that if the predicted deltas were being guided by a MCID, well, the MCID for mortality should be the same across the board. It is not (Figure 1 again). It ranges from 3%-20% absolute reduction in mortality. Moreover, in Figure 1, note the clustering around numbers like 10% - how many fingers or toes you have should not determine the effect size you seek to find when you design an RCT.
We surely hope that this article stimulates some debate in the critical care community about the statistical design of RCTs, and indeed in what primary endpoints are chosen for them. It seems that chasing mortality is looking more and more like a fool's erand.
Caption for Figure 1:
Figure 1. Plot of observed versus predicted delta (with associated 95% confidence intervals for observed delta) of 38 trials included in the analysis. Point estimates of treatment effect (deltas) are represented by green circles for non-statistically significant trials and red triangles for statistically significant trials. Numbers within the circles and triangles refer to the trials as referenced in Additional file 1. The blue ‘unity line’ with a slope equal to one indicates perfect concordance between observed and predicted delta; for visual clarity and to reduce distortions, the slope is reduced to zero (and the x-axis is horizontally expanded) where multiple predicted deltas have the same value and where 95% confidence intervals cross the unity line. If predictions of delta were accurate and without bias, values of observed delta would be symmetrically scattered above and below the unity line. If there is directional bias, values will fall predominately on one side of the line as they do in the figure.
If you want a fair shake, you gotta get a trach (and the sooner the better)
What are the drawbacks to such an approach? Traditionally, a tracheostomy has been viewed by practitioners as the admission of a failure of medical care - we couldn't get you better fast, with a temporary airway, so we had to "resort" to a semi-permanent or permanent [surgical] airway. Moreover, a tracheostomy was traditionally a surgical procedure requiring transportation to the operating suite, although that has changed with the advent of the percutaneous dilitational approach. Nonetheless, whichever route is used to establish the tracheostomy, certain immediate and delayed risks are inherent in the procedure, and the costs are greater. So, the basic question we would like to answer is "are there benefits of tracheostomy that outweigh these risks?"
There were several criticisms of the Rumbak study which I will not elaborate upon here, but suffice it to say that the study did not lead to a sweeping change in practice with regard to the timing of tracheostomies and thus additional studies were planned and performed. One such study, referenced by last week's JAMA article, only enrolled 25% of the anticipated sample of patients with resulting wide confidence intervals. As a result, few conclusions can be drawn from that study, but it did not appear to show a benefit to earlier tracheostomy (http://journals.lww.com/ccmjournal/Abstract/2004/08000/A_prospective,_randomized,_study_comparing_early.9.aspx ) A meta-analysis which included "quasi-randomized" studies (GIGO: Garbage In, Garbage Out) concluded that while not reducing mortality or pneumonia, early tracheostomy reduced the duration of mechanical ventilation and ICU stay. (It seems likely to me that if you stay on the ventilator and in the ICU for a shorter period, since there is a time/dose-dependent effect of these things on complications such as catheter-related blood stream infections and ventilator-associated pneumonia (VAP), that these outcomes and outcomes further downstream such as mortality WOULD be affected by early tracheostomy - but, the further downstream an outcome is, the more it gets diluted out, and the larger a study you need to demonstrate a significant effect.)
Thus, to try to resolve these uncertainties, we have the JAMA study from last week. This study was technically "negative." But in it, every single pre-specified outcome (VAP, ventilator-free days, ICU-free days, mortality) trended (some significantly) in favor of early tracheostomy. The choice of VAP as a primary outcome (P-value for early versus delayed trach 0.07) is both curious and unfortunate. VAP is notoriously difficult to diagnose and differentiate from other causes of infection and pulmonary abnormalities in mechanically ventilated ICU patients (see http://content.nejm.org/cgi/content/abstract/355/25/2619 ) - it is a "soft" outcome for which no gold standard exists. Therefore, the signal-to-noise ratio for this outcome is liable to be low. What's perhaps worse, the authors used the Clinical Pulmonary Infection Score (CPIS, or Pugin Score: see Pugin et al, AJRCCM, 1991, Volume 143, 1121-1129) as the sole means of diagnosing VAP. This score, while conceptually appealing, has never been validated in such a way that its positive and negative predictive values are acceptable for routine use in clinical practice (it is not widely used), or for a randomized controlled trial (see http://ajrccm.atsjournals.org/cgi/content/abstract/168/2/173 ). Given this, and the other strong trends and significant secondary endpoints in this study, I don't think we can dichotomize it as "negative" - reality is just more complicated than that.
I feel about this trial, which failed its primary endpoint, much the same as I felt about the Levo versus Dopa article a few weeks back. Multiple comparisons, secondary endpoints, and marginal P-values notwithstanding, I think that from the perspective of a seriously ill patient or a provider, especially a provider with strong anecdotal experience that appears to favor earl(ier) tracheostomy, the choice appears to be clear: "If you want a fair shake, you gotta get a trach."
Tuesday, March 23, 2010
"Prospective Meta-analysis" makes as much sense as "Retrospective Randomized Controlled Trial"
The trials included in this meta-analysis lacked statistical precision for two principal reasons: 1.) they used the typical cookbook approach to sample size determination, choosing a delta of 10% without any justification whatever for this number (thus the studies were guilty of DELTA INFLATION); 2.) according to the authors of the meta-analysis , two of the three trials were stopped early for futility, thus further decreasing the statistical precision of already effectively underpowered trials. The resulting 95% CIs for delta in these trials thus ranged from (-)10% (in the ARDSnet ALVEOLI trial; i.e., high PEEP may increase mortality by up to 10%) to +10% (in the Mercat and Meade trials; i.e., high(er) PEEP may decrease mortality by upwards of 10%).
Because of the lack of statistical precision of these trials, the authors of the meta-analysis appropriately used individual patient data from the trials as meta-analytical fodder, with a likely useful result – high PEEP is probably best reserved for the sickest patients with ARDS, and avoided for those with ALI. (Why there is an interaction between severity of lung injury and response to PEEP is open for speculation, and is an interesting topic in itself.) What interests me more than this main result is the authors' and editorialist's suggestion that we should be doing “prospective meta-analyses” or at least design our trials so that they easily lend themselves to this application should we later decide to do so. Which begs the question: why not just make a bigger trial from the outset, choosing a realistic delta and disallowing early stopping for “futility”?
(It is useful to note that the term futility is unhappily married to or better yet, enslaved by, alpha (the threshold P-value for statistical significance). A trial is deemed futile if there is no hope of crossing the alpha/p-value threshold. But it is certainly not futile to continue enrolling patients if each additional accrual increases the statistical precision of the final result, by narrowing the 95% CI of delta. Indeed, I’m beginning to think that the whole concept of “futility” is a specious one - unless you're a funding agency.)
Large trials may be cumbersome, but they are not impossible. The SAFE investigators (http://content.nejm.org/cgi/content/abstract/350/22/2247 ) enrolled ~7000 patients seeking a delta of 3% in a trial involving 16 ICUs in two countries. Moreover, a prospective meta-analysis doesn’t reduce the number of patients required, it simply splits the population into quanta and epochs, which will hinder homogeneity in the meta-analysis if enrollment and protocols are not standardized or if temporal trends in care and outcomes come into play. If enrollment and protocols ARE standardized, it is useful to ask “then why not just do one large study from the outset?” using a realistic delta and sample size? Why not coordinate all the data (American, French, Canadian, whatever) through a prospective RCT rather than a prospective meta-analysis?
Here’s my biggest gripe with the prospective meta-analysis – in essence, you are taking multiple looks at the data, one look after each trial is completed (I’m not even counting intra-trial interim analyses), but you’re not correcting for the multiple comparisons. And most likely, once there is a substantial positive trial, it will not be repeated, for a number of reasons such as all the hand-waving about it not being ethical to repeat it and randomize people to no treatment, (one of the cardinal features of science being repeatability notwithstanding). Think ARMA (http://content.nejm.org/cgi/content/extract/343/11/812 ) . There were smaller trials leading up to it, but once ARMA was positive, no additional noteworthy trials sought to test low tidal volume ventilation for ARDS. So, if we’re going to stop conducting trials for our “prospective meta-analysis”, what will our early stopping rule be? When will we stop our sequence of trials? Will we require a P-value of 0.001 or less after the first look at the data (that is, after the first trial is completed)? Doubtful. As soon as a significant result is found in a soundly designed trial, further earnest trials of the therapy will cease and victory will be declared. Only when there is a failure or a “near-miss” will we want a “do-over” to create more fodder for our “prospective meta-analysis”. We will keep chasing the result we seek until we find it, nitpicking design and enrollment details of “failed” trials along the way to justify the continued search for the “real” result with a bigger and better trial.
If we’re going to go to the trouble of coordinating a prospective meta-analysis, I don’t understand why we wouldn’t just coordinate an adequately powered RCT based on a realistic delta (itself based on an MCID or preliminary data), and carry it to its pre-specified enrollment endpoint, “futility stopping rules” be damned. With the statistical precision that would result, we could examine the 95% CI of the resulting delta to answer the practical questions that clinicians want answers for, even if our P-value were insufficient to satisfy the staunchest of statisticians. Perhaps the best thing about such a study is that the force of its statistical precision would incapacitate single center trialists, delta inflationists, and meta-analysts alike.
Friday, March 5, 2010
Levo your Dopa at the Door - how study design influences our interpretation of reality
Hopefully all docs with a horse in this race will take note of the outcome of this study. In its simplest and most straightforward and technically correct interpretation, levo was not superior to dopa in terms of an effect on mortality, but was indeed superior in terms of side effects, particularly cardiac arrhythmias (a secondary endpoint). The direction of the mortality trend was in favor of levo, consistent with observational data (the SOAP I study by many of the same authors) showing reduced mortality with levo compared with dopa in the treatment of shock. As followers of this blog also know, the interpretation of "negative" studies (that is, MOST studies in critical care medicine - more on that in a future post) can be more challenging than the interpretation of positive studies, because "absence of evidence is not evidence of absence".
We could go to the statistical analysis section, and I could harp on the choice of delta, the decision to base it on a relative risk reduction, the failure to predict a baseline mortality, etc. (I will note that at least the authors defended their delta based on prior data, something that is a rarity - again, a future post will focus on this.) But, let's just be practical and examine the 95% CI of the mortality difference (the primary endpoint) and try to determine whether it contains or excludes any clinically meaningful values that may allow us to compare these two treatments. First, we have to go to the raw data and find the 95% CI of the ARR, because the Odds Ratio can inflate small differences as you know. That is, if the baseline is 1%, then a statistically significant increase in odds of 1.4 is not meaningful because it represents only a 0.4% increase in the outcome - miniscule. With Stata, we find that the ARR is 4.0%, with a 95% CI of -0.76% (favors dopamine) to +8.8% (favors levo). Wowza! Suppose we say that a 3% difference in mortality in either direction is our threshold for CLINICAL significance. This 95% CI includes a whole swath of values between 3% and 8.8% that are of interest to us and they are all in favor of levo. (Recall that perhaps the most lauded trial in critical care medicine, the ARDSnet ARMA study, reduced mortality by just 9%.) On the other side of the spectrum, the range of values in favor of dopa is quite narrow indeed - from 0% to -0.76%, all well below our threshold for clinical significance (that is, the minimal clinically important difference or MCID) of 3%. So indeed, this study surely seems to suggest that if we ever choose between these two widely available and commonly used agents, the cake goes to levo, hands down. I hardly need a statistically significant result with a 95% CI like this one!
So, then, why was the study deemed "negative"? There are a few reasons. Firstly, the trial is probably guilty of "delta inflation" whereby investigators seek a pre-specified delta that is larger than is realistic. While they used, ostensibly, 7%, the value found in the observational SOAP I study, they did not account for regression to the mean, or allow any buffer for the finding of a smaller difference. However, one can hardly blame them. Had they looked instead for 6%, and had the 4% trend continued for additional enrollees, 300 additional patients in each group (or about 1150 in each arm) would have been required and the final P-value would have still fallen short at 0.06. Only if they had sought a 5% delta, which would have DOUBLED the sample size to 1600 per arm, would they have achieved a statistically significant result with 4% ARR, with P=0.024. Such is the magnitude of the necessary increase in sample size as you seek smaller and smaller deltas.
Which brings me to the second issue. If delta inflation leads to negative studies, and logistical and financial constraints prohibit the enrollment of massive numbers of patients, what is an investigator to do? Sadly, the poor investigator wishing to publish in the NEJM or indeed any peer reviewed journal is hamstrung by conventions that few these days even really understand anymore: namely, the mandatory use of 0.05 for alpha and "doubly significant" power calculations for hypothesis testing. I will not comment more on the latter other than to say that interested readers can google this and find some interesting, if arcane, material. As regards the former, a few comments.
The choice of 0.05 for the type 1 error rate, that is, the probability that we will reject the null hypothesis based on the data and falsely conclude that one therapy is superior to the other; and the choice of 10-20% for the type 2 error rate (power 80-90%), that is the probability that the alternative hypothesis is really true and we will reject it based on the data; derive from the traditional assumption, which is itself an omission bias, that it is better in the name of safety to keep new agents out of practice by having a more stringent requirement for accepting efficacy than the requirement for rejecting it. This asymmetry is the design of trials is of dubious rationality from the outset (because it is an omission bias), but it is especially nettlesome when the trial is comparing two agents already in widespread use. As opposed to the trial of a new drug compared to placebo, where we want to set the hurdle high for declaring efficacy, especially when the drug might have side effects - with levo versus dopa, the real risk is that we'll continue to consider them to be equivalent choices when there is strong reason to favor one over the other based either on previous or current data. This is NOT a trial of treatment versus no treatment of shock, this trial assumes that you're going to treat the shock with SOMETHING. In a trial such as this one, one could make a strong argument that a P-value of 0.10 should be the threshold for statistical significance. In my mind it should have been.
But as long as the perspicacious consumer of the literature and reader of this blog takes P-values with a grain of salt and pays careful attention to the confidence intervals and the MCID (whatever that may be for the individual), s/he will not be misled by the deeply entrenched convention of alpha at 0.05, power at 90%, and delta wildly inflated to keep the editors and funding agencies mollified.
Tuesday, February 9, 2010
Post hoc non ergo propter hoc extended: A is associated with B therefore A causes B and removal of A removes B
"...patients whose septic shock is treated with hydrocortisone commonly have blood glucose levels higher than 180. These levels have clearly been associated with marked increase in the risk of dying...Thus, we hypothesized that normalization of blood glucose levels with intensive insulin treatment may improve the outcome of adults with septic shock who are treated with hydrocortisone."
The normalization heuristic is at work again.
Endocrine interventions as adjunctive treatments in critical care medicine have a sordid history. Here are some landmarks. Rewind 25 years, and as Angus has recently described (http://jama.ama-assn.org/cgi/content/extract/301/22/2388 ) we had the heroic administration of high dose corticosteroids (e.g. gram doses of methylprednisolone) for septic shock, which therapy was later abandoned. In the 1990s, we had two concurrent trials of human growth hormone in critical illness, showing the largest statistically significant harms (increased mortality of ~20%) from a therapy in critical illness that I'm aware of (see http://content.nejm.org/cgi/content/abstract/341/11/785 ). Early in the new millennium, based on two studies that should by now be largely discredited by their successors, we had intensive insulin therapy for patients with hyperglycemia and low dose corticosteroid therapy for septic shock. It is fitting then, and at least a little ironic, that this new decade should start with publication of a study combining these latter two therapies of dubious benefit: The aptly named COIITSS study.
I know this sounds overly pessimistic, but some of these therapies, these two in particular, just need to die, but are being kept alive by the hope, optimism, and proselytizing of those few individuals whose careers were made on them or continue to depend upon them. And I lament the fact that, as a result of the promotional efforts of these wayward souls, we have been distracted from the actual data. Allow me to summarize these briefly:
1.) The original Annane study of steroids in septic shock (see: http://jama.ama-assn.org/cgi/content/abstract/288/7/862 ) utilized an adjusted analysis of a subgroup of patients not identifiable at the outset of the trial (responders versus non-responders). The entire ITT (intention to treat) population had an ADJUSTED P-value of 0.09. I calculated an unadjusted P-value of 0.29 for the overall cohort. Since you cannot know at the outset who's a responder and who's not, for a practitioner, the group of interest is the ITT population, and there was NO EFFECT in this population. Somehow, the enthusiasm for this therapy was so great that we lost sight of the reasons that I assume the NEJM rejected this article - an adjusted analysis of a subgroup. Seriously! How did we lose sight of this? Blinded by hope and excitement, and the simplicity of the hypothesis - if it's low, make it high, and everything will be better. Then Sprung and the CORTICUS folks came along (see: http://content.nejm.org/cgi/content/abstract/358/2/111 ), and, as far as I'm concerned, blew the whole thing out of the water.
2.) I remember my attending raving about the Van den Berghe article (see: http://content.nejm.org/cgi/content/abstract/345/19/1359 ) as a first year fellow at Johns Hopkins in late 2001. He said "this is either the greatest therapy ever to come to critical care medicine, or these data are faked!" That got me interested. And I still distinctly remember circling something in the methods section, which was in small print in those days, on the left hand column of the left page back almost 9 years ago - that little detail about the dextrose infusions. This therapy appeared to work in post-cardiac surgery patients on dextrose infusions at a single center. I was always skeptical about it, and then the follow-up study came out, and lo and behold, NO EFFECT! But that study is still touted by the author as a positive one! Because again, like in Annane, if you remove those pesky patients who didn't stay in the MICU for 3 days (again, like Annane, not identifiable at the outset), you have a SUBGROUP analysis in which IIT (intensive insulin therapy - NOT intention to treat, ITT is inimical to IIT) works. Then you had NICE-SUGAR (see: http://content.nejm.org/cgi/content/abstract/360/13/1283 ) AND Brunkhorst et al (see: http://content.nejm.org/cgi/content/abstract/358/2/125 ) showing that IIT doesn't work. How much more data do we need? Why are we still doing this?
Because old habits die hard and so do true believers. Thus it was perhaps inevitable that we would have COITSS combine these two therapies into a single trial. Note that this trial does nothing to address whether hydrocortisone for septic shock is efficacious (it probably is NOT), but rather assumes that it is. I note also that it was started in 2006 just shortly before the second Van den Berghe study was published and well after the data from that study were known. Annane et al make no comments about whether those data impacted the conduct of their study, and whether participants were informed that a repeat of the trial upon which the Annane trail was predicated, had failed.
Annane did not use blinding for fludrocortisone in the current study, but this is minor. It is difficult to blind IIT, but usually what you do when you can't blind things adequately is you protocolize care. That was not obviously done in this trial; instead we are reassured that "everybody should have been following the Surviving Sepsis Campaign guidelines". (I'm paraphrasing.)
As astutely pointed out by Van den Berghe in the accompanying editorial, this trial was underpowered. It was just plain silly to assume (or play dumb) that a 3% ARR which is a ~25% RRR (since the baseline was under 10%) would translate into a 12.5% ARR with a baseline mortality of near 50%. Indeed, I don't know why we even talk about RRRs anymore, they're a ruse to inflate small numbers and rouse our emotions. (Her other comments, about "separation", which would be facilitated by having a very very intensive treatment and a very very lax control is reminiscent of what folks were saying about ARMA low/high Vt - namely that the trial was ungeneralizable because the "control" 12 cc/kg was unrealistic. Then you get into the Eichacker and Natanson arguments about U-shaped curves [to which there may be some truth] and how too much is bad, not enough is bad, but somewhere in the middle is the "sweet spot". And this is key. Would that I could know the sweet spot for blood sugar - and coax patients to remain there.)
Because retrospective power calculations are uncouth, I elected to calculate the 95% confidence interval (CI) for delta (the difference between the two groups) in this trial. The point estimate for delta is -2.96% (negative delta means the therapy was WORSE than control!) with a 95% confidence interval of -11.6% to +5.65%. It is most likely between 11% worse and 5% better, and any good betting man would wager that it's worse than control! But in either case, this confidence interval is uncomfortably wide and contains values for harm and benefit which should be meaningful to us, so in essence the data do not help us decide what to do with this therapy.
(And look at table 2, the main results, look they are still shamelessly reporting adjusted P-values! Isn't that why we randomize? So we don't have to adjust?)
TTo bring this saga full circle, I note that, as we saw in NICE-SUGAR, Brunkhorst, and Van den Berghe, severe hypoglycemia (<40!) was far more common in the IIT group. And severe hypoglycemia is associated with death (in most studies, but curiously not in this one). So, consistent with the hypothesis which was the impetus for this study (A is associated with B, thus A causes B and removal of A removes B), one conclusion from all these data is that hypoglycemia causes death, and should be avoided through avoidance of IIT.
Tuesday, December 29, 2009
How much Epi should we give, if we give Epi at all?
It's not as heretical as it sounds. In 2000, the NEJM reported the results of a Seattle study by Hallstrom et al (http://content.nejm.org/cgi/content/abstract/342/21/1546 ) showing that CPR appears to be as effective (and indeed perhaps more effective) when mouth-to-mouth ventilation is NOT performed along with chest compressions by bystanders. Other evidence with a similar message has since accumulated. With resuscitation, more effort, more intervention does not necessarily lead to better results. The normalization heuristic fails us again.
Several things can be learnt from the recent Norwegian trial. First, recall that RCTs are treasure troves of epidemiological data. The data from this trial reinforce what we practitioners already know, but which is not well-known among uninitiated laypersons: the survival of OOH cardiac arrest is dismal, on the order of 10% or so.
Next, looking at Table 2 of the outcomes data, we note that while survival to hospital discharge, the primary outcome, seems to be no different between the drug and no-drug groups, there are what appear to be important trends in favor of drug - there is more Return of Spontaneous Circulation (ROSC), there are more admissions to the ICU, there are more folks discharged with good neurological function. This is reminescent of a series of studies in the 1990s (e.g., http://content.nejm.org/cgi/content/abstract/339/22/1595 ) showing that high dose epinephrine, while improving ROSC, did not lead to improved survival. Ultimately, the usefulness of any of these interventions hinges on what your goals are. If your goal is survival with good neurological function, epinephrine in any dose may not be that useful. But if your goal is ROSC, you might prefer to give a bunch of it. I'll leave it to you to determine what your goals are, and whether, on balance, you think they're worthy goals.
There are two other good lessons from this article. In this study, the survival rate in the drug group was 10.5% and that in the no-drug group was 9.2%, for a difference of 1.3% and this small difference was not statistically significant. Does that mean there's no difference? No, it does not, not necessarily. There might be a difference that this study failed to detect because of a Type II error. (The study was designed with 91% power, so there's a 9% chance that a true difference will be missed, and the chances are even greater since the a priori sample size was not achieved.) If you follow this blog, you know that if the study is negative, we need to look at the 95% confidence interval (CI) around the difference to see if it might include clinically meaningful values. The 95% CI for this difference (not reported by the authors, but calculated by me using Stata) was -5.2% to +2.8%. That is, no drug might be up to about 5% worse or up to about 3% better than drug. Would you stop giving Epi for resuscitation on the basis of this study? Is the CI narrow enough for you? Is a 5% decrease in survival with no drug negligible? I'll leave that for you to decide.
(I should not gloss over the alternative possibility which is that the results are also compatible with no-drug being 2.8% better than drug. But if you're playing the odds, methinks you are best off betting the other way, given table 2.)
Now, as an extension of the last blog post, let's look at the relative numbers. The 95% CI for the relative risk (RR) is 0.59 - 1.33. That means that survival might be reduced by as much as 41% with no drug! That sounds like a LOT doesn't it? This is why I consistently argue that relative numbers be avoided in appraising the evidence. RRs give unfair advantages to therapies targeting diseases with survivals closer to 0%. There is no rational reason for such an advantage. A 1% chance of dying is a 1% chance of dying no matter where it falls along the continuum from zilch to unity.
Lessons from this article: beware of pathophysiological reasoning, and translation from the hampster and molecule labs; determine the goals of your therapy and whether they are worthy goals; absence of evidence is not evidence of absence; look at CIs for the difference between therapies in "negative" trials and see if they include clinically meaningful values; and finally, beware of inflation of perceived benefit caused by presentation of relative risks rather than absolute risks.
Wednesday, December 16, 2009
Dabigatran and Dabigscam of non-inferiority trials, pre-specified margins of non-inferiority, and relative risks
Before we go on, I ask you to engage in a mental exercise of sorts that I'm trying to make a habit. (If you have already read the article and recall the design and the results, you will be biased, but go ahead anyway this time.) First, ask yourself what increase in an absolute risk of recurrent DVT/PE/death is so small that you consider it negligible for the practical purposes of clinical management. That is, what difference between two drugs is so small as to be pragmatically irrelevant? Next, ask yourself what RELATIVE increase in risk is negligible? (I'm purposefully not suggesting percentages and relative risks as examples here in order to avoid the pitfalls of "anchoring and adjustment": http://en.wikipedia.org/wiki/Anchoring .) Finally, assume that the baseline risk of VTE at 6 months is ~2% - with this "baseline" risk, ask yourself what absolute and relative increases above this risk are, for practical purposes, negligible. Do these latter numbers jibe with your answers to the first two questions which were answered when you had no particular baseline in mind?
Note how it is difficult to reconcile your "intuitive" instincts about what is a negligible relative and absolute risk with how these numbers might vary depending upon what the baseline risk is. Personally, I think about a 3% absolute increase in the risk of DVT at 6 months to be on the precipice of what is clinically significant. But if the baseline risk is 2%, a 3% absolute increase (to 5%) represents a 2.5x increase in risk! That's a 150% increase, folks! Imagine telling a patient that the use of drug ABC instead of XYZ "only doubles your risk of another clot or death". You can visualize the bewildered faces and incredulous, furrowed brows. But if you say, "the difference between ABC and XYZ is only 3%, and drug ABC costs pennies but XYZ is quite expensive, " that creates quite a different subjective impression of the same numbers. Of course, if the baseline risk were 10%, a 3% increase is only a 30% or 1.3x increase in risk. Conversely, with a baseline risk of 10%, a 2.5x increase in risk (RR=2.5) means a 15% absolute increase in the risk of DVT/PE/Death, and hardly ANYONE would argue that THAT is negligible. We know that doctors and laypeople respond better to, or are more impressed by, results that are described as RRR than ARR, ostensibly because the former inflates the risk because the number appears bigger (e-mail me if you want a reference for this). The bottom line is that what matters is the absolute risk. We're banking health dollars. We want the most dollars at the end of the day, not the largest increase over some [arbitrary] baseline. So I'm not sure why we're still designing studies with power calculations that utilize relative risks.
With this in mind, let's check the assumptions of the design of this non-inferiority trial (NIT). It was designed with 90% power to exclude a hazard ratio (HR; similar to a relative risk for our purposes) of 2.75. That HR of 2.75 sure SOUNDS like a lot. But with a predicted baseline risk of 2% (which prediction panned out in the trial - the baseline risk with warfarin was 2.1%), that amounts to only 5.78, or an increase of 3.78%, which I will admit is close to my own a priori negligibility level of 3%. The authors justify this assignment based on 4 referenced studies all prior to 1996. I find this curious. Because they are so dated and in a rather obscure journal, I have access only to the 1995 NEJM study (http://content.nejm.org/cgi/reprint/332/25/1661.pdf ). In this 1995 study, the statistical design is basically not even described, and there were 3 primary endpoints (ahhh, the 1990s). This is not exactly the kind of study that I want to model a modern trial after. In the table below, I have abstracted data from the 1995 trial and three more modern ones (al lcomparing two treatment regimens for DVT/PE) to determine both the absolute risk and relative risks that were observed in these trials.

Table 1. Risk reductions in several RCTs comparing treatment regimens for DVT/PE. Outcomes are the combination of recurrent DVT/PE/Death unless otherwise specified. *recurrent DVT/PE only; raw numbers used for simplicity in lieu of time to event analysis used by the authors
From this table we can see that in SUCCESSFUL trials of therapies for DVT/PE treatment, absolute risk reductions in the range of 5-10% have been demonstrated, with associated relative risk increases of ~1.75-2.75 (for placebo versus comparator - I purposefully made the ratio in this direction to make it more applicable to the dabigatran trial's null hypothesis [NH] that the 95% CI for dabigatran includes 2.75 HR - note that the NH in an NIT is the enantiomer of the NH in a superiority trial). Now, from here we must make two assumptions, one which I think is justified and the other which I think is not. The first is that the demonstrated risk differences in this table are clinically significant. I am inclined to say "yes, they are" not only because a 5-10% absolute difference just intuitively strikes me as clinically relevant compared to other therapies that I use regularly, but also because, in the cases of the 2003 studies, these trials were generally counted as successes for the comparator therapies. The second assumption we must make, if we are to take the dabigatran authors seriously, is that differences smaller than 5-10% (say 4% or less) are clinically negligible. I would not be so quick to make this latter assumption, particularly in the case of an outcome that includes death. Note also that the study referenced by the authors (reference 13 - the Schulman 1995 trial) was considered a success with a relative risk of 1.73, and that the 95% CI for the main outcome of the RE-COVER study ranged from 0.65-1.84 - it overlaps the Schulman point estimate of RR of 1.73, and the Lee point estimate of 1.83! Based on an analysis using relative numbers, I am not willing to accept the pre-specified margin of non-inferiority upon which this study was based/designed.
But, as I said earlier, relative differences are not nearly as important to us as absolute differences. If we take the upper bound of the HR in the RE-COVER trial (1.84) and multiply it by the baseline risk (2.1) we get an upper 95% CI for the risk of the outcome of 3.86, which corresponds to an absolute risk difference of 1.76. This is quite low, and personally it satisfies my requirement for very small differences between two therapies if I am to call them non-inferior to one another.
So, we have yet again a NIT which was designed upon precarious and perhaps untenable assumptions, but which, through luck or fate was nonetheless a success. I am beginning to think that this dabigatran drug has some merit, and I wager that it will be approved. But this does not change the fact that this and previous trials were designed in such a way as to allow a defeat of warfarin to be declared based on much more tenuous numbers.
I think a summary of sorts for good NIT design is in order:
• The pre-specified margin of non-inferiority should be smaller than the MCID (minimal clinically important difference), if there is an accepted MCID for the condition under study
• The pre-specified margin of non-inferiority should be MUCH smaller than statistically significant differences found in "successful" superiority trials, and ideally, the 95% CI in the NIT should NOT overlap with point estimates of significant differences in superiority trials
• NITs should disallow "asymmetry" of conclusions - see the last post on dabigatran. If the pre-specified margin of non-inferiority is a relative risk of 2.0 and the observed 95% CI must not include that value to claim non-inferiority, then superiority cannot be declared unless the 95% confidence interval of the point estimate does not cross -2.0. What did you say? That's impossible, it would require a HUGE risk difference and a narrow CI for that to ever happen? Well, that's why you can't make your delta unrealistically large - you'll NEVER claim superiority, if you're being fair about things. If you make delta very large it's easier to claim non-inferiority, but you should also suffer the consequences by basically never being able to claim superiority either.
• We should concern ourselves with Absolute rather than Relative risk reductions
Monday, September 21, 2009
The unreliable assymmetric design of the RE-LY trial of Dabigatran: Heads I win, tails you lose

I'm growing weary of this. I hope it stops. We can adapt the diagram of non-inferiority shenanigans from the Gefitinib trial (see http://medicalevidence.blogspot.com/2009/09/theres-no-such-thing-as-free-lunch.html ) to last week's trial of dabigatran, which came on the scene of the NEJM with another ridiculously designed non-inferiority trial (see http://content.nejm.org/cgi/content/short/361/12/1139 ). Here we go again.
These jokers, lulled by the corporate siren song of Boehringer Ingelheim, had the utter unmitigated gall to declare a delta of 1.46 (relative risk) as the margin of non-inferiority! Unbelievable! To say that a 46% difference in the rate of stroke or arterial clot is clinically non-significant! Seriously!?
They justified this felonious choice on the basis of trials comparing warfarin to PLACEBO as analyzed in a 10-year-old meta-analysis. It is obvious (or should be to the sentient) that an ex-post difference between a therapy and placebo in superiority trials does not apply to non-inferiority trials of two active agents. Any ex-post finding could be simply fortuitously large and may have nothing to do with the MCID (minimal clinically important difference) that is SUPPOSED to guide the choice of delta in a non-inferiority trial (NIT). That warfarin SMOKED placebo in terms of stroke prevention does NOT mean that something that does not SMOKE warfarin is non-inferior to warfarin. This kind of duplicitious justification is surely not what the CONSORT authors had in mind when they recommended a referenced justification for delta.
That aside, on to the study and the figure. First, we're testing two doses, so there are multiple comparisons, but we'll let that slide for our purposes. Look at the point estimate and 95% CI for the 110 mg dose in the figure (let's bracket the fact that they used one-sided 97.5% CIs - it's immaterial to this discussion). There is a non-statistically significant difference between dabigatran and warfarin for this dose, with a P-value of 0.34. But note that in Table 2 of the article, they declare that the P-value for "non-inferiority" is <0.001 [I've never even seen this done before, and I will have to look to see if we can find a precedent for reporting a P-value for "non-inferiority"]. Well, apparently this just means that the RR point estimate for 110 mg versus warfarin is statistically significantly different from a RR of 1.46. It does NOT mean, but it is misleadingly suggested that the comparison between the two drugs on stroke and arterial clot is highly clinically significant, but it is not. This "P-value for non-inferiority" is just an artifical comparison: had we set the margin of non-inferiority at a [even more ridiculously "P-value for non-inferiority" as small as we like by just inflating the margin of non-inferiority! So this is a useless number, unless your goal is to create an artificial and exaggerated impression of the difference between these two agents.
Now let's look at the 150 mg dose. Indeed, it is statistically significantly different than warfarin (I shall resist using the term "superior" here), and thus the authors claim superiority. But here again, the 95% CI is narrower than the margin of non-inferiority, and had the results gone the other direction, as in Scenarios 3 and 4, (in favor of warfarin), we would have still claimed non-inferiority, even though warfarin would have been statistically significantly "better than" dabigatran! So it is unfair to claim superiority on the basis of a statistically significant result favoring dabigatran, but that's what they do. This is the problem that is likely to crop up when you make your margin of non-inferiority excessively wide, which you are wont to do if you wish to stack the deck in favor of your therapy.
But here's the real rub. Imagine if the world were the mirror image of what it is now and dabigatran were the existing agent for prevention of stroke in A-fib, and warfarin were the new kid on the block. If the makers of warfarin had designed this trial AND GOTTEN THE EXACT SAME DATA, they would have said (look at the left of the figure and the dashed red line there) that warfarin is non-inferior to the 110 mg dose of dabigatran, but that it was not non-inferior to the 150 mg dose of dabigatran. They would NOT have claimed that dabigatran was superior to warfarin, nor that warfarin was inferior to dabigatran, because the 95% CI of the difference between warfarin and dabigatran 150 mg crosses the pre-specified margin of non-inferiority. And to claim superiority of dabigatran, the 95% CI of the difference would have to fall all the way to the left of the dashed red line on the left. (See Piaggio, JAMA, 2006.)
The claims that result from a given dataset should not depend on who designs the trial, and which way the asymmetry of interpretation goes. But as long as we allow asymmetry in the interpretation of data, they shall. Heads they win, tails we lose.
Tuesday, September 15, 2009
Plavix (clopidogrel), step aside, and prasugrel (Effient), watch your back: Ticagrelor proves that some "me-too" drugs are truly superior
I will rarely be using either of these drugs or Plavix because I rarely treat AMI or patients undergoing PCI. My interest in this trial and that of prasugrel stems from the fact that in the cases of these two agents, the sponsoring company indeed succeeded in making what is in essence a "me-too" drug that is superior to an earlier-to-market agent(s). They did not monkey around with this non-inferiority trial crap like anidulafungin and gefitinib and just about every antihypertensive that has come to market in the past 10 years, they actually took Plavix to task and beat it, legitimately. For this, and for the sheer size of the trial and its superb design, they deserve to be commended.
One take-home message here, and from other posts on this blog is "beware the non-inferiority trial". There are a number of reasons that a company will choose to do a non-inferiority trial (NIT) rather than a superiority trial. First, as in the last post (http://medicalevidence.blogspot.com/2009/09/theres-no-such-thing-as-free-lunch.html ) running a NIT often allows you to have your cake and eat it too - you can make it easy to claim non-inferiority (wide delta) AND make the criterion for superiority (of your agent) more lenient than the inferiority criterion, a conspicuous asymmetry that just ruffles my feathers again and again. Second, you don't run the risk of people saying after the fact "that stuff doesn't work," even though absence of evidence does not constitute evidence of absence. Third, you have great latitude with delta in a NIT and that's appealing from a sample size standpoint. Fourth, you don't actually have to have a better product which might not even be your goal, which is rather to get market share for an essentially identical product. Fifth, maybe you can't recruit enough patients to do a superiority trial. The ticagrelor trial recruited over 18,000 patients. You can look at this in two ways. One is that the difference they're trying to demonstrate is quite small, so what does it matter to you? (If you take this view, you should be especially dismissive of NITs, since they're not trying to show any difference at all.) The other is that if you can recruit 18,000 patients into a trial, even a multinational trial, the problem that is being treated must be quite prevalent, and thus the opportunity for impact from a superior treatment, even one with a small advantage, is much greater. It is much easier and more likely, in a given period of time, to treat 50 acute MIs and save a life with ticagrelor (compared to Plavix - NNT=50=[1/0.02]) than it is to find 8 patients with invasive candidiasis and treat them with anidulafungin (compared to fluconazole; [1/.12~8]; see Reboli et al: http://content.nejm.org/cgi/reprint/356/24/2472.pdf ), and in that latter case, you're not saving one life but rather just preventing a treatment failure. Thus, compared to anidulafungin, with its limited scope of application and limited impact, a drug like ticagrelor has much more public health impact. You should simply pay more attention to larger trials, there's more likely to be something important going on there. By inference, the conditions they are treating are likely to be a "bigger deal".
Of course, perhaps I'm giving the industry too much credit in the cases of prasugrel and ticagrelor. Did they really have much of a choice? Probably not. Generally, when you do a non-inferiority trial, you try to show non-inferiority and also something like preferable dosing schedules, reduced cost or side effects. That way, when the trial is done (if you have shown non-inferiority), you can say, "yeah, they have basically the same effect on xyz, but my drug has better [side effects, dosing, etc.]". Because of the enhanced potency of prasugrel and ticagrelor, they knew there would be more bleeding and that this would cause alarm. So they needed to show improved mortality (or similar) to show that that bleeding cost is worth paying. Regardless, it is refreshing to see that the industry is indeed designing drugs with demonstrable benefits over existing agents. I am highly confident that the FDA will find ticagrelor to be approvable, and I wager that it will quickly supplant prasugrel. I also wager that when clopidogrel goes generic (soon), it will be a boon for patients who can know that they are sacrificing very little (2% efficacy compared to ticagrelor of prasugrel) for a large cost savings. For most people, this trade-off will be well worth it. For those fortunate enough to have insurance or another way of paying for ticagrelor, more power to them.
Sunday, September 6, 2009
There's no such thing as a free lunch - unless you're running a non-inferiority trial. Gefitinib for pulmonary adenocarcinoma

A 20% difference in some outcome is either clinically relevant, or it is not. If A is worse than B by 19% and that's NOT clinically relevant and significant, then A being better than B by 19% must also NOT be clinically relevant and significant. But that is not how the authors of trials such as this one see it: http://content.nejm.org/cgi/content/short/361/10/947 . According to Mok and co-conspirators, if gefitinib is no worse in regard to progression free survival than Carboplatin-Paclitaxel based on a 95% confidence interval that does not include 20% (that is, it may be up to 19.9% worse, but not more worse), then they call the battle a draw and say that the two competitors are equally efficacious. However, if the trend is in the other direction, that is, in favor of gefitinib BY ANY AMOUNT HOWEVER SMALL (as long as it's statistically significant), they declare gefinitib the victor and call it a day. It is only because of widespread lack of familiarity with non-inferiority methods that they can get away with a free lunch like this. A 19% difference is either significant, or it is not. I have commented on this before, and it should come as no surprise that these trials are usually used to test proprietary agents (http://content.nejm.org/cgi/content/extract/357/13/1347 ). Note also that in trials of adult critical illness, the most commonly sought mortality benefit is about 10% (more data on this forthcoming in a article soon to be submitted and hopefully published). So it's a difficult argument to subtend to say that something is "non-inferior" if it is less than 20% worse than something else. Designers of critical care trials will tell you that a 10% difference, often much less, is clinically significant.
I have created a figure to demonstrate the important nuances of non-inferiority trials using the gefitinib trial as an example. (I have adapted this from the Piaggio 2006 JAMA article of the CONSORT statement for the reporting of non-inferiority trials - a statement that has been largely ignored: http://jama.ama-assn.org/cgi/content/abstract/295/10/1152?lookupType=volpage&vol=295&fp=1152&view=short .) The authors specified delta, or the margin of non-inferiority, to be 20%. I have already made it clear that I don't buy this, but we needn't challenge this value to make war with their conclusions, although challenging it is certainly worthwhile, even if it is not my current focus. This 20% delta corresponds to a hazard ratio of 1.2, as seen in the figure demarcated by a dashed red line on the right. If the hazard ratio (for progression or death) demonstrated by the data in the trial were 1.2, that would mean that gefitinib is 20% worse than comparator. The purpose of a non-inferiority trial is to EXCLUDE a difference as large as delta, the pre-specified margin of non-inferiority. So, to demonstrate non-inferiority, the authors must show that the 95% confidence interval for the hazard ratio falls all the way to the left of that dashed red line at HR of 1.2 on the right. They certainly achieved this goal. Their data, represented by the lowermost point estimate and 95% CI, falls entirely to the left of the pre-specified margin of non-inferiority (the right red dashed line). I have no arguments with this. Accepting ANY margin of non-inferiority (delta), gefitinib is non-inferior to the comparator. What I take exception to is the conclusion that gefitinib is SUPERIOR to comparator, a conclusion that is predicated in part on the chosen delta, to which we are beholden as we make such conclusions.
First, let's look at [hypothetical] Scenario 1. Because the chosen delta was 20% wide (and that's pretty wide - coincidentally, that's the exact width of the confidence interval of the observed data), it is entirely possible that the point estimate could have fallen as pictured for Scenario 1 with the entire CI between an HR of 1 and 1.2, the pre-specified margin of non-inferiority. This creates the highly uncomfortable situation in which the criterion for non-inferiority is fulfilled, AND the comparator is statistically significantly better than gefitinib!!! This could have happened! And it's more likely to happen the larger you make delta. The lesson here is that the wider you make delta, the more dubious your conclusions are. Deltas of 20% in a non-inferiority trial are ludicrous.
Now let's look at Scenarios 2 and 3. In these hypothetical scenarios, comparator is again statistically significantly better than gefitinib, but now we cannot claim non-inferiority because the upper CI falls to the right of delta (red dashed line on the right). But because our 95% confidence interval includes values of HR less than 1.2 and our delta of 20% implies (or rather states) that we consider differences of less than 20% to be clinically irrelevant, we cannot technically claim superiority of comparator over gefitinib either. The result is dubious. While there is a statistically significant difference in the point estimate, the 95% CI contains clinically irrelevant values and we are left in limbo, groping for a situation like Scenario 4, in which comparator is clearly superior to gefitinib, and the 95% CI lies all the way to the right of the HR of 1.2.
Pretend you're in Organic Chemistry again, and visualize the mirror image (enantiomer) of scenario 4. That is what is required to show superiority of gefitinib over comparator - a point estimate for the HR whose 95% CI does not include delta or -20%, an HR of 0.8. The actual results come close to Scenario 5, but not quite, and therefore, the authors are NOT justified in claiming superiority. To do so is to try to have a free lunch, to have their cake and eat it too.
You see, the larger you make delta, the easier it is to achieve non-inferiority. But the more likely it is also that you might find a statistically significant difference favoring comparator rather than the preferred drug which creates a serious conundrum and paradox for you. At the very least, if you're going to make delta large, you should be bound by your honor and your allegiance to logic and science to make damned sure that to claim superiority, your 95% confidence interval must not include negative delta. If not, shame on you. Eat your free lunch if you will, but know that the ireful brow of logic and reason is bent unfavorably upon you.
Saturday, September 5, 2009
Troponin I, Troponin T, Troponin is the Woe of Me
Thus I raised at least one brow slightly on August 27th when the NEJM reported two studies of highly sensitive troponin assays for the "early diagnosis of myocardial infarction" (wasn't troponin sensitive enough already? see: http://content.nejm.org/cgi/content/abstract/361/9/858 and http://content.nejm.org/cgi/content/abstract/361/9/868 ). Without commenting on the studies' methodological quality specifically, I will emphasize some pitfalls and caveats related to the adoption of this "advance" in clinical practice, especially that outside of the setting of an appropriately aged person with risk factors who presents to an acute care setting with SYMPTOMS SUGGESTIVE OF MYOCARDIAL INFARCTION.
In such a patient, say a 59 year old male with hypertension, diabetes and a family history of coronary artery disease, who presents to the ED with chest pain, we (and our cardiology colleagues) are justified in having a high degree of confidence in the results of this test based on these and a decade or more of other data. But I suspect that only the MINORITY of cardiac troponin tests at my institution are ordered for that kind of indication. Rather, it is used as a screening test for just about any patient presenting to the ED who is ill enough to warrant admission. And that's where the problem has its roots. Our confidence in the diagnostic accuracy of this test in the APPROPRIATE SETTING (read appropriate clinical pre-test probability) should not extend to other scenarios, but all too often it does, and it makes a real conundrum when it is positive in those other scenarios. Here's why.
Suppose that we have a pregnancy test that is evaluated in women who have had a sexual encounter and who have missed two menstrual periods and it is found to be 99.9% sensitive and 99.9% specific. (I will bracket for now the possibility that you could have a 100% sensitive and/or specific test.) Now suppose that you administer this test to 10,000 MEN. Does a positive test mean that a man is pregnant? Heavens No! He probably has testicular cancer or some other malady. This somewhat silly example is actually quite useful to reinforce the principle that no matter how good a test is, if it is not used appropriately or in the appropriate scenario that the results are likely to be misleading. Likewise, consider this test's use in a woman who has not missed a menstrual cycle - does a negative test mean that she is not pregnant? Perhaps not, since the sensitivity was determined in a population that had missed 2 cycles. If a woman were obviously24 weeks pregnant and the test was negative, what would we think? It is important to bear in mind that these tests are NOT direct tests for the conditions we seek to diagnose, but are tests of ASSOCIATED biological phenomena, and insomuch as our understanding of those phenomena is limited or there is variation in them, the tests are liable to be fallible. A negative test in a woman with a fetus in utero may mean that the sample was mishandled, that the testing reagents were expired, that there is an interfering antibody, etc. Tests are not perfect, and indeed are highly prone to be misleading if not used in the appropriate clinical scenario.
And thus we return to cardiac troponins. In the patients I'm called to admit to the ICU who have sepsis, PE, COPD, pneumonia, respiratory failure, renal failure, metabolic acidosis, a mildly positive troponin which is a COMMON occurrence is almost ALWAYS an epiphenomenon of critical illness rather than an acute myocardial infarction. Moreover, the pursuit of diagnosis via cardiac catheterization or the empiric treatment with antiplatelet agents and anticoagulants almost always is a therapeutic misadventure in these patients who are at much greater risk of bleeding and renal failure via these interventions which are expected to have a much reduced positive utility for them. More often than not, I would just rather not know the results of a troponin test outside the setting of isolated acute chest pain. Other practitioners should be acutely aware of the patient populations in which these tests are performed, and the significant limitations of using these highly sensitive tests in other clinical scenarios.
Thursday, August 13, 2009
The enemy of good evidence is better evidence: Aspirin, colorectal cancer, and knowing when enough is enough
I should start by stating that there is biological plausibility of the hypothesis that ASA might influence the course of these cancers which express COX-2. I am no expert in this area, so I will take it as granted that the basic science evidence is sound enough to inflate the pre-test probability of an effect of ASA to a non-negligible level. Moreover, as pointed out by the authors, other smaller epidemiological investigations have suggested that ASA might improve outcomes from CRCA. The authors of the current investigation found a [marginally] statistically significant reduced hazards of death of approximately 0.3 in patients who took ASA after a diagnosis of CRCA, but not before.
Without delving into the details (knowing that one might find the devil there), I found the conclusions the authors made interesting, namely that additional investigations and randomized controlled trials will be needed before we can recommend ASA to patients diagnosed with CRCA. This caught me as a bit odd, depending upon what our goals are. If our goal is to further study the mechanisms of this disease in pursuit of the truth of the EFFICACY of ASA (see previous blog entry on vertebroplasty for the distinctions between efficacy and effectiveness research), then fine, we need a randomized controlled trial to eliminate all the potential confounding that is inherent in the current study, most notably the possibility that patients who took ASA are different from those who didn't in some important way that also influences outcome. But I'm prepared to accept that there is ample evidence that ASA benefits this condition and that if I had CRCA, the risks of not taking ASA far exceed the risks of taking it, and I would shun participation in any study in which I might be randomized to placebo. This may sound heretical, but allow me to explain my thinking.
I do worry that something that "makes sense" biologically and which is bolstered by epidemiological data might prove to be spurious, as happened in the decades-long saga of Premarin-prevention which came to a close with the Women's Health Initiative (WHI) study. But there are important differences here. Premarin had known side effects (clotting, increased risk of breast cancer) and it was being used long-term for the PREVENTION of remote diseases that would afflict women in the [distant] future. ASA has a proven safety profile spanning over a century, and patients with CRCA have a near-term risk of DEATH from it. So, even though both premarin and ASA might be used on the basis of fallible epidemiological data, there are important differences that we must consider. (I am also reminded of the ongoing debates and study of NAC for prevention of contrast nephropathy, which I think has gone on for far too long. There is ample evidence that it might help, and no evidence of adverse effects or excessive cost. When is enough enough?)
I just think we have become too beholden to certain mantras (like RCTs being the end-all-be-all or mortality being the only acceptable outcome measure), and we don't look at different situations with an independently critical eye. This is not low tidal volume ventilation where the critical care community needs unassailable evidence of efficacy to be convinced to administer it to patients who will have little say in the tidal volume their doctor uses to ventilate them. These are cognizant patients with cancer, this is a widely available over-the-counter drug, and this is a disease which makes people feel desperate, desperate enough to enroll in trials of experimental and toxic therapies. The minor side effects of ASA are the LEAST of their worries, especially considering that most of the patients in the cohort examined in this trial were using ASA for analgesia! If they are generally not concerned about side effects when it is used for arthritis, how can we justify withholding or not recommending it for patients with CANCER whose LIVES may be saved by it?
I were a patient with CRCA, I would take ASA (in fact I already take ASA!) and I would scoff at attempts to enroll me into a trial where I might receive placebo. The purists in pursuit of efficacy and mechanisms and the perfect trial be damned. I would much rather have a gastrointestinal hemorrhage than an early death from CRCA. That's just me. Others may appraise the risks and values of the various outcomes differently. And if they want to enroll in a trial, more power to them, so long as the investigators have adequately and accurately informed them of the existing data and the risks of both ASA and placebo, in the specific context of their specific disease and given the epidemiological data. Otherwise, their enrollment is probably ethically precarious, especially if they would go home and take an ASA for a more benign condition without another thought about it.
Tuesday, August 11, 2009
Vertebroplasty: Absence of Evidence Yields to Evidence of Absence. It Takes a Sham to Discover a Sham but how will I Get a Sham if I Need One?
In a beautiful extension of that line of critical thinking, two groups of investigators in last week's NEJM challenged the widely and ardently held assumption that vertebroplasty improves patient pain and symptom scores. (See http://content.nejm.org/cgi/content/abstract/361/6/557 ; and http://content.nejm.org/cgi/content/abstract/361/6/569 .) These two similar studies compared vertebroplasty to a sham procedure (control group) in order to control for the powerful placebo effect that accounts for part of the benefit of many medical and surgical interventions, and which is almost assuredly responsible for the reported and observed benefits of such "alternative and complementary medicines" as accupuncture.
There is no difference. In these adequately powered trials (80% power to detect a 2.5 and a 1.5 point difference on the pain scales respectively), the 95% confidence intervals for delta (the difference between the groups in pain scores) were -0.7 to +1.8 at 3 months in the first study and -0.3 to + 1.7 at 1 month in the second study. Given that the minimal clinically important difference in the pain score is considered to be 1.5 points, these two studies all but rule out a clinically significant difference between the procedure and sham. They also show that there is no statistically significant difference between the two, but the former is more important to us as clinicians given that the study is negative. And this is exactly how we should approach a negative study: by asking "does the 95% confidence interval for the observed delta include a clinically important difference?" If it does not, we can be reasonably assured that the study was adequately powered to answer the question that we as practitioners are most interested in. If it does include such a value, we must assume that for us given our judgment of clinical value, the study is not helpful and essentially underpowered. Note also that by looking at delta this way, we can determine the statistical precision (power) of the study - powerful studies will result in narrow(er) confidence intervals, and underpowered studies will result in wide(r) ones.
These results reinforce the importance of the placebo effect in medical care, and the limitations of inductive thinking in determining the efficacy of a therapy. We must be careful - things that "make sense" do not always work.
But there is a twist of irony in this saga, and something a bit concerning about this whole approach to determining the truth using studies such as these with impeccable internal validity: they lead beguillingly to the message that because the therapy is not beneficial compared to sham that it is of no use. But, very unfortunately and very importantly, that is not a clinically relevant question because we will not now adopt sham procedures as an alternative to vertebroplasty! These data will either be ignored by the true-believers of vertebroplasty, or touted by devotees of evidence based medicine as confimation that "vertebroplasty doesn't work". If we fall in the latter camp, we will give patients medical therapy that, I wager, will not have as strong a placebo effect as surgery. And thus, an immaculately conceived study such as this becomes its own bugaboo, because in achieving unassailable internal validity, it estranges its relevance to clinical practice insomuch as the placebo effect is powerful and useful and desireable. What a shame, and what a quandry from which there is no obvious escape.
If I were a patient with such a fracture (and ironically I have indeed suffered 2 vertebral fractures [oh, the pain!]), I would try to talk my surgeon into performing a sham procedure (to avoid the costs and potential side effects of the cement).....but then I would know, and would the "placebo" really work?