Medical Evidence Blog: margin of non-inferiority

Showing posts with label margin of non-inferiority. Show all posts

Monday, May 20, 2013

It All Hinges on the Premises: Prophylactic Platelet Transfusion in Hematologic Malignancy

A quick update before I proceed with the current post: The Institute of Medicine has met and they agree with me that sodium restriction is for the birds. (Click here for a New York Times summary article.) In other news, the oh-so-natural Omega-3 fatty acid panacea did not improve cardiovascular outcomes as reported in the NEJM on May 9th, 2013.

An article by the TOPPS investigators in the May 9th NEJM is very useful to remind us not to believe everything we read, to always check our premises, and that some data are so dependent on the perspective from which they're interpreted or the method or stipulations of analysis that they can be used to support just about any viewpoint.

The authors sought to determine if a strategy of withholding prophylactic platelet transfusions for platelet counts below 10,000 in patients with hematologic malignancy was non-inferior to giving prophylactic platelet transfusions. I like this idea, because I like "less is more" and I think the body is basically antifragile. But non-inferior how? And what do we mean by non-inferior in this trial?

Heads I Win, Tails it's a Draw: Rituximab, Cyclophosphamide, and Revising CONSORT

The recent article by Stone et al in the NEJM (see: http://www.nejm.org/doi/full/10.1056/NEJMoa0909905 ), which appears to [mostly] conform to the CONSORT recommendations for the conduct and reporting of NIFTs (non-inferiority trials, often abbreviated NIFs, but I think NIFTs ["Nifties"] sounds cooler), allowed me to realize that I fundamentally disagree with the CONSORT statement on NIFTs (see JAMA, http://jama.ama-assn.org/cgi/content/abstract/295/10/1152 ) and indeed the entire concept of NIFTs. I have discussed previously in this blog my disapproval of the asymmetry with which NIFTs are designed such that they favor the new (and often proprietary agent), but I will use this current article to illustrate why I think NIFTs should be done away with altogether and supplanted by equivalence trials.

This study rouses my usual and tired gripes about NIFTs: too large a delta, no justification for delta, use of intention-to-treat rather than per-protocol analysis, etc. It also describes a suspicious statistical maneuver which I suspect is intended to infuse the results (in favor of Rituximab/Rituxan) with extra legitimacy in the minds of the uninitiated: instead of simply stating (or showing with a plot) that the 95% CI excludes delta, thus making Rituxan non-inferior, the authors tested the hypothesis that the lower 95.1% CI boundary is different from delta, which test results in a very small P-value (<0.001). This procedure adds nothing to the confidence interval in terms of interpretation of the results, but seems to imbue them with an unassailable legitimacy - the non-inferiority hypothesis is trotted around as if iron-clad by this miniscule P-value, which is really just superfluous and gratuitious.

But I digress - time to focus on the figure. Under the current standards for conducting a NIFT, in order to be non-inferior, you simply need a 95% CI for the preferred [and usually proprietary] agent with an upper boundary which does not include delta in favor of the comparator (scenario A in the figure). For your preferred agent to be declared inferior, the LOWER 95% CI for the difference between the two agents must exclude the delta in favor of the comparator (scenario B in the figure.) For that to ever happen, the preferred/proprietary agent is going to have to be WAY worse than standard treatment. It is no wonder that such results are very, very rare, especially since deltas are generally much larger than is reasonable. I am not aware of any recent trial in a major medical journal where inferiority was declared. The figure shows you why this is the case.

Inferiority is very difficult to declare (the deck is stacked this way on purpose), but superiority is relatively easy to declare, because for superiority your 95% CI doesn't have to exclude an obese delta, but rather must just exclude zero with a point estimate in favor of the preferred therapy. That is, you don't need a mirror image of the 95% CI that you need for inferiority (scenario C in the figure), you simply need a point estimate in favor of the preferred agent with a 95% CI that does not include zero (scenario D in the figure). Looking at the actual results (bottom left in the figure), we see that they are very close to scenario D and that they would have only had to go a little bit more in favor of rituxan for superiority to have been able to be declared. Under my proposal for symmetry (and fairness, justice, and logic), the results would have had to be similar to scenario C, and Rituxan came nowhere near to meeting criteria for superiority.

The reason it makes absolutely no sense to allow this asymmetry can be demonstrated by imagining a counterfactual (or two) - supposing that the results had been exactly the same, but they had favored Cytoxan (cyclophosphamide) rather than Rituxan, that is, Cytoxan was associated with a 11% improvement in the primary endpoint. This is represented by scenario E in the figure; and since the 95% CI includes delta, the result is "inconclusive" according to CONSORT. So how can it be that the classification of the result changes depending on what we arbitrarily (a priori, before knowing the results) declare to be the preferred agent? That makes no sense, unless you're more interested in declaring victory for a preferred agent than you are in discovering the truth, and of course, you can guess my inferences about the motives of the investigators and sponsors in many/most of these studies. In another counterfactual example, scenario F in the figure represents the mirror image of scenario D, which represented the minimum result that would have allowed Stone et al to declare that Rituxan was superior. But if the results had favored Cytoxan by that much, we would have had another "inconclusive" result, according to CONSORT. Allowing this is just mind-boggling, maddening, and unjustifiable!

Given this "heads I win, tails it's a draw", it's no wonder that NIFTs are proliferating. It's time we stop accepting them, and require that non-inferiority hypotheses be symmetrical - in essence, making equivalence trials the standard operating procedure, and requiring the same standards for superiority as we require for inferiority.

Wednesday, December 16, 2009

Dabigatran and Dabigscam of non-inferiority trials, pre-specified margins of non-inferiority, and relative risks

Anyone who thought, based on the evidence outlined in the last post on this blog, that dabigatran was going to be a "superior" replacement for warfarin was chagrinned last week with the publication in the NEJM of the RE-COVER study of dabigatran versus warfarin in the treatment of venous thromboembolism (VTE): http://content.nejm.org/cgi/content/abstract/361/24/2342 . Dabigatran for this indication is NOT superior to warfarin, but may be non-inferior to warfarin, if we are willing to accept the assumptions of the non-inferiority hypothesis in the RE-COVER trial.

Before we go on, I ask you to engage in a mental exercise of sorts that I'm trying to make a habit. (If you have already read the article and recall the design and the results, you will be biased, but go ahead anyway this time.) First, ask yourself what increase in an absolute risk of recurrent DVT/PE/death is so small that you consider it negligible for the practical purposes of clinical management. That is, what difference between two drugs is so small as to be pragmatically irrelevant? Next, ask yourself what RELATIVE increase in risk is negligible? (I'm purposefully not suggesting percentages and relative risks as examples here in order to avoid the pitfalls of "anchoring and adjustment": http://en.wikipedia.org/wiki/Anchoring .) Finally, assume that the baseline risk of VTE at 6 months is ~2% - with this "baseline" risk, ask yourself what absolute and relative increases above this risk are, for practical purposes, negligible. Do these latter numbers jibe with your answers to the first two questions which were answered when you had no particular baseline in mind?

Note how it is difficult to reconcile your "intuitive" instincts about what is a negligible relative and absolute risk with how these numbers might vary depending upon what the baseline risk is. Personally, I think about a 3% absolute increase in the risk of DVT at 6 months to be on the precipice of what is clinically significant. But if the baseline risk is 2%, a 3% absolute increase (to 5%) represents a 2.5x increase in risk! That's a 150% increase, folks! Imagine telling a patient that the use of drug ABC instead of XYZ "only doubles your risk of another clot or death". You can visualize the bewildered faces and incredulous, furrowed brows. But if you say, "the difference between ABC and XYZ is only 3%, and drug ABC costs pennies but XYZ is quite expensive, " that creates quite a different subjective impression of the same numbers. Of course, if the baseline risk were 10%, a 3% increase is only a 30% or 1.3x increase in risk. Conversely, with a baseline risk of 10%, a 2.5x increase in risk (RR=2.5) means a 15% absolute increase in the risk of DVT/PE/Death, and hardly ANYONE would argue that THAT is negligible. We know that doctors and laypeople respond better to, or are more impressed by, results that are described as RRR than ARR, ostensibly because the former inflates the risk because the number appears bigger (e-mail me if you want a reference for this). The bottom line is that what matters is the absolute risk. We're banking health dollars. We want the most dollars at the end of the day, not the largest increase over some [arbitrary] baseline. So I'm not sure why we're still designing studies with power calculations that utilize relative risks.

With this in mind, let's check the assumptions of the design of this non-inferiority trial (NIT). It was designed with 90% power to exclude a hazard ratio (HR; similar to a relative risk for our purposes) of 2.75. That HR of 2.75 sure SOUNDS like a lot. But with a predicted baseline risk of 2% (which prediction panned out in the trial - the baseline risk with warfarin was 2.1%), that amounts to only 5.78, or an increase of 3.78%, which I will admit is close to my own a priori negligibility level of 3%. The authors justify this assignment based on 4 referenced studies all prior to 1996. I find this curious. Because they are so dated and in a rather obscure journal, I have access only to the 1995 NEJM study (http://content.nejm.org/cgi/reprint/332/25/1661.pdf ). In this 1995 study, the statistical design is basically not even described, and there were 3 primary endpoints (ahhh, the 1990s). This is not exactly the kind of study that I want to model a modern trial after. In the table below, I have abstracted data from the 1995 trial and three more modern ones (al lcomparing two treatment regimens for DVT/PE) to determine both the absolute risk and relative risks that were observed in these trials.

Table 1. Risk reductions in several RCTs comparing treatment regimens for DVT/PE. Outcomes are the combination of recurrent DVT/PE/Death unless otherwise specified. *recurrent DVT/PE only; raw numbers used for simplicity in lieu of time to event analysis used by the authors

From this table we can see that in SUCCESSFUL trials of therapies for DVT/PE treatment, absolute risk reductions in the range of 5-10% have been demonstrated, with associated relative risk increases of ~1.75-2.75 (for placebo versus comparator - I purposefully made the ratio in this direction to make it more applicable to the dabigatran trial's null hypothesis [NH] that the 95% CI for dabigatran includes 2.75 HR - note that the NH in an NIT is the enantiomer of the NH in a superiority trial). Now, from here we must make two assumptions, one which I think is justified and the other which I think is not. The first is that the demonstrated risk differences in this table are clinically significant. I am inclined to say "yes, they are" not only because a 5-10% absolute difference just intuitively strikes me as clinically relevant compared to other therapies that I use regularly, but also because, in the cases of the 2003 studies, these trials were generally counted as successes for the comparator therapies. The second assumption we must make, if we are to take the dabigatran authors seriously, is that differences smaller than 5-10% (say 4% or less) are clinically negligible. I would not be so quick to make this latter assumption, particularly in the case of an outcome that includes death. Note also that the study referenced by the authors (reference 13 - the Schulman 1995 trial) was considered a success with a relative risk of 1.73, and that the 95% CI for the main outcome of the RE-COVER study ranged from 0.65-1.84 - it overlaps the Schulman point estimate of RR of 1.73, and the Lee point estimate of 1.83! Based on an analysis using relative numbers, I am not willing to accept the pre-specified margin of non-inferiority upon which this study was based/designed.

But, as I said earlier, relative differences are not nearly as important to us as absolute differences. If we take the upper bound of the HR in the RE-COVER trial (1.84) and multiply it by the baseline risk (2.1) we get an upper 95% CI for the risk of the outcome of 3.86, which corresponds to an absolute risk difference of 1.76. This is quite low, and personally it satisfies my requirement for very small differences between two therapies if I am to call them non-inferior to one another.

So, we have yet again a NIT which was designed upon precarious and perhaps untenable assumptions, but which, through luck or fate was nonetheless a success. I am beginning to think that this dabigatran drug has some merit, and I wager that it will be approved. But this does not change the fact that this and previous trials were designed in such a way as to allow a defeat of warfarin to be declared based on much more tenuous numbers.

I think a summary of sorts for good NIT design is in order:

• The pre-specified margin of non-inferiority should be smaller than the MCID (minimal clinically important difference), if there is an accepted MCID for the condition under study

• The pre-specified margin of non-inferiority should be MUCH smaller than statistically significant differences found in "successful" superiority trials, and ideally, the 95% CI in the NIT should NOT overlap with point estimates of significant differences in superiority trials

• NITs should disallow "asymmetry" of conclusions - see the last post on dabigatran. If the pre-specified margin of non-inferiority is a relative risk of 2.0 and the observed 95% CI must not include that value to claim non-inferiority, then superiority cannot be declared unless the 95% confidence interval of the point estimate does not cross -2.0. What did you say? That's impossible, it would require a HUGE risk difference and a narrow CI for that to ever happen? Well, that's why you can't make your delta unrealistically large - you'll NEVER claim superiority, if you're being fair about things. If you make delta very large it's easier to claim non-inferiority, but you should also suffer the consequences by basically never being able to claim superiority either.

• We should concern ourselves with Absolute rather than Relative risk reductions

Monday, September 21, 2009

The unreliable assymmetric design of the RE-LY trial of Dabigatran: Heads I win, tails you lose

I'm growing weary of this. I hope it stops. We can adapt the diagram of non-inferiority shenanigans from the Gefitinib trial (see http://medicalevidence.blogspot.com/2009/09/theres-no-such-thing-as-free-lunch.html ) to last week's trial of dabigatran, which came on the scene of the NEJM with another ridiculously designed non-inferiority trial (see http://content.nejm.org/cgi/content/short/361/12/1139 ). Here we go again.

These jokers, lulled by the corporate siren song of Boehringer Ingelheim, had the utter unmitigated gall to declare a delta of 1.46 (relative risk) as the margin of non-inferiority! Unbelievable! To say that a 46% difference in the rate of stroke or arterial clot is clinically non-significant! Seriously!?

They justified this felonious choice on the basis of trials comparing warfarin to PLACEBO as analyzed in a 10-year-old meta-analysis. It is obvious (or should be to the sentient) that an ex-post difference between a therapy and placebo in superiority trials does not apply to non-inferiority trials of two active agents. Any ex-post finding could be simply fortuitously large and may have nothing to do with the MCID (minimal clinically important difference) that is SUPPOSED to guide the choice of delta in a non-inferiority trial (NIT). That warfarin SMOKED placebo in terms of stroke prevention does NOT mean that something that does not SMOKE warfarin is non-inferior to warfarin. This kind of duplicitious justification is surely not what the CONSORT authors had in mind when they recommended a referenced justification for delta.

That aside, on to the study and the figure. First, we're testing two doses, so there are multiple comparisons, but we'll let that slide for our purposes. Look at the point estimate and 95% CI for the 110 mg dose in the figure (let's bracket the fact that they used one-sided 97.5% CIs - it's immaterial to this discussion). There is a non-statistically significant difference between dabigatran and warfarin for this dose, with a P-value of 0.34. But note that in Table 2 of the article, they declare that the P-value for "non-inferiority" is <0.001 [I've never even seen this done before, and I will have to look to see if we can find a precedent for reporting a P-value for "non-inferiority"]. Well, apparently this just means that the RR point estimate for 110 mg versus warfarin is statistically significantly different from a RR of 1.46. It does NOT mean, but it is misleadingly suggested that the comparison between the two drugs on stroke and arterial clot is highly clinically significant, but it is not. This "P-value for non-inferiority" is just an artifical comparison: had we set the margin of non-inferiority at a [even more ridiculously "P-value for non-inferiority" as small as we like by just inflating the margin of non-inferiority! So this is a useless number, unless your goal is to create an artificial and exaggerated impression of the difference between these two agents.

Now let's look at the 150 mg dose. Indeed, it is statistically significantly different than warfarin (I shall resist using the term "superior" here), and thus the authors claim superiority. But here again, the 95% CI is narrower than the margin of non-inferiority, and had the results gone the other direction, as in Scenarios 3 and 4, (in favor of warfarin), we would have still claimed non-inferiority, even though warfarin would have been statistically significantly "better than" dabigatran! So it is unfair to claim superiority on the basis of a statistically significant result favoring dabigatran, but that's what they do. This is the problem that is likely to crop up when you make your margin of non-inferiority excessively wide, which you are wont to do if you wish to stack the deck in favor of your therapy.

But here's the real rub. Imagine if the world were the mirror image of what it is now and dabigatran were the existing agent for prevention of stroke in A-fib, and warfarin were the new kid on the block. If the makers of warfarin had designed this trial AND GOTTEN THE EXACT SAME DATA, they would have said (look at the left of the figure and the dashed red line there) that warfarin is non-inferior to the 110 mg dose of dabigatran, but that it was not non-inferior to the 150 mg dose of dabigatran. They would NOT have claimed that dabigatran was superior to warfarin, nor that warfarin was inferior to dabigatran, because the 95% CI of the difference between warfarin and dabigatran 150 mg crosses the pre-specified margin of non-inferiority. And to claim superiority of dabigatran, the 95% CI of the difference would have to fall all the way to the left of the dashed red line on the left. (See Piaggio, JAMA, 2006.)

The claims that result from a given dataset should not depend on who designs the trial, and which way the asymmetry of interpretation goes. But as long as we allow asymmetry in the interpretation of data, they shall. Heads they win, tails we lose.