"When in doubt, cut it out" is one simplified heuristic (rule of thumb) of surgery. Extension (via inductive thinking) of the observation that removing a necrotic gallbladder or correcting some other anatomic aberration causes improvement in patient outcomes to other situations has misled us before. It is simply not always that simple. While it makes sense that arthroscopic removal of scar tissue in an osteoarthritic knee will improve patients' symptoms, alas, some investigators had the courage to challenge that assumption, and reported in 2002 that when compared to sham surgery, knee arthroscopy did not benefit patients. (See http://content.nejm.org/cgi/content/abstract/347/2/81.)
In a beautiful extension of that line of critical thinking, two groups of investigators in last week's NEJM challenged the widely and ardently held assumption that vertebroplasty improves patient pain and symptom scores. (See http://content.nejm.org/cgi/content/abstract/361/6/557 ; and http://content.nejm.org/cgi/content/abstract/361/6/569 .) These two similar studies compared vertebroplasty to a sham procedure (control group) in order to control for the powerful placebo effect that accounts for part of the benefit of many medical and surgical interventions, and which is almost assuredly responsible for the reported and observed benefits of such "alternative and complementary medicines" as accupuncture.
There is no difference. In these adequately powered trials (80% power to detect a 2.5 and a 1.5 point difference on the pain scales respectively), the 95% confidence intervals for delta (the difference between the groups in pain scores) were -0.7 to +1.8 at 3 months in the first study and -0.3 to + 1.7 at 1 month in the second study. Given that the minimal clinically important difference in the pain score is considered to be 1.5 points, these two studies all but rule out a clinically significant difference between the procedure and sham. They also show that there is no statistically significant difference between the two, but the former is more important to us as clinicians given that the study is negative. And this is exactly how we should approach a negative study: by asking "does the 95% confidence interval for the observed delta include a clinically important difference?" If it does not, we can be reasonably assured that the study was adequately powered to answer the question that we as practitioners are most interested in. If it does include such a value, we must assume that for us given our judgment of clinical value, the study is not helpful and essentially underpowered. Note also that by looking at delta this way, we can determine the statistical precision (power) of the study - powerful studies will result in narrow(er) confidence intervals, and underpowered studies will result in wide(r) ones.
These results reinforce the importance of the placebo effect in medical care, and the limitations of inductive thinking in determining the efficacy of a therapy. We must be careful - things that "make sense" do not always work.
But there is a twist of irony in this saga, and something a bit concerning about this whole approach to determining the truth using studies such as these with impeccable internal validity: they lead beguillingly to the message that because the therapy is not beneficial compared to sham that it is of no use. But, very unfortunately and very importantly, that is not a clinically relevant question because we will not now adopt sham procedures as an alternative to vertebroplasty! These data will either be ignored by the true-believers of vertebroplasty, or touted by devotees of evidence based medicine as confimation that "vertebroplasty doesn't work". If we fall in the latter camp, we will give patients medical therapy that, I wager, will not have as strong a placebo effect as surgery. And thus, an immaculately conceived study such as this becomes its own bugaboo, because in achieving unassailable internal validity, it estranges its relevance to clinical practice insomuch as the placebo effect is powerful and useful and desireable. What a shame, and what a quandry from which there is no obvious escape.
If I were a patient with such a fracture (and ironically I have indeed suffered 2 vertebral fractures [oh, the pain!]), I would try to talk my surgeon into performing a sham procedure (to avoid the costs and potential side effects of the cement).....but then I would know, and would the "placebo" really work?
This is discussion forum for physicians, researchers, and other healthcare professionals interested in the epistemology of medical knowledge, the limitations of the evidence, how clinical trials evidence is generated, disseminated, and incorporated into clinical practice, how the evidence should optimally be incorporated into practice, and what the value of the evidence is to science, individual patients, and society.
Showing posts with label statistical power. Show all posts
Showing posts with label statistical power. Show all posts
Tuesday, August 11, 2009
Monday, February 9, 2009
More Data on Dexmedetomidine - moving in the direction of a new standard
A follow-up study of dexmedetomidine (see previous blog: http://medicalevidence.blogspot.com/2007/12/dexmedetomidine-new-standard-in_16.html )
was published in last week's JAMA (http://jama.ama-assn.org/cgi/content/abstract/301/5/489 ) and hopefully serves as a prelude to future studies of this agent and indeed all studies in critical care. The recent study addresses one of my biggest concerns of the previous one, namely that routine interruptions of sedatives were not employed.
Ironically, it may be this difference between the studies that led to the failure to show a difference in the primary endpoint in the current study. The primary endpoint, namely the percentage of time within the target RASS, was presumably chosen not only on the basis of its pragmatic utility, but also because it was one of the most statistically significant differences found among secondary analyses in the previous study (percent of patients with a RASS [Richmond Agitation and Sedation Scale] score within one point of the physician goal; 67% versus 55%, p=0.008). It is possible, and I reason likely, that daily interruptions in the current study obliterated that difference which was found in the previous study.
But that failure does not undermine the usefulness of the current study which showed that sedation comparable to routinely used benzos can be achieved with dexmed, probably with less delirium, and perhaps with shorter time on the ventilator and fewer infections. What I would like to see now, and what is probably in the works, is a study of dexmed which shows shorter time on the ventilator and/or reductions in nosocomial infections as primary study endpoints.
But to show endpoints such as these, we are going to need to carefully standardize our ascertainment of infections (difficult to say the least) and also to standardize our approach to discontinuation of mechanical ventilation. In regard to the latter, I propose that we challenge some of our current assumptions about liberation from mechanical ventilation - namely, that a patient must be fully awake and following commands prior to extubation. I think that a status quo bias is at work here. We have many a patient with delirium in the ICU who is not already intubated and we do not intubate them for delirium alone. Why, then, should we fail to extubate a patient in whom all indicators show reaolution of critical illness, but who remains delirious? Is it possible that this is the main player in the causal pathway between sedation and extubation and perhaps even nosocomial infections and mortality? (The protocols or lack thereof for assessing extubation readiness were not described in the current study, unless I missed them.) It would certainly be interesting and perhaps mandatory to know the extubation practices in the centers involved in this study, especially if we are going to take great stock in this secondary outcome of this study.
Another thing I am interested in knowing is what PATIENT experiences are like in each group - whether there is greater recall or other differences in psychological outcomes between patients who receive different sedatives during their ICU experience.
I hope this study and others like it serve as a wake-up call to the critical care research community which has heretofore been brainwashed into thinking that a therapy is only worthwhile if it improves mortality, a feat that is difficult to achieve not only because it is often unrealistic and because absurd power calculations and delta inflation run rampant in trial design, but because of limitations in funding and logistical difficulties. This group has shown us repeatedly that useful therapies in critical care need not be predicated upon a mortality reduction. It's past time to start buying some stock in shorter times on the blower and in the ICU.
was published in last week's JAMA (http://jama.ama-assn.org/cgi/content/abstract/301/5/489 ) and hopefully serves as a prelude to future studies of this agent and indeed all studies in critical care. The recent study addresses one of my biggest concerns of the previous one, namely that routine interruptions of sedatives were not employed.
Ironically, it may be this difference between the studies that led to the failure to show a difference in the primary endpoint in the current study. The primary endpoint, namely the percentage of time within the target RASS, was presumably chosen not only on the basis of its pragmatic utility, but also because it was one of the most statistically significant differences found among secondary analyses in the previous study (percent of patients with a RASS [Richmond Agitation and Sedation Scale] score within one point of the physician goal; 67% versus 55%, p=0.008). It is possible, and I reason likely, that daily interruptions in the current study obliterated that difference which was found in the previous study.
But that failure does not undermine the usefulness of the current study which showed that sedation comparable to routinely used benzos can be achieved with dexmed, probably with less delirium, and perhaps with shorter time on the ventilator and fewer infections. What I would like to see now, and what is probably in the works, is a study of dexmed which shows shorter time on the ventilator and/or reductions in nosocomial infections as primary study endpoints.
But to show endpoints such as these, we are going to need to carefully standardize our ascertainment of infections (difficult to say the least) and also to standardize our approach to discontinuation of mechanical ventilation. In regard to the latter, I propose that we challenge some of our current assumptions about liberation from mechanical ventilation - namely, that a patient must be fully awake and following commands prior to extubation. I think that a status quo bias is at work here. We have many a patient with delirium in the ICU who is not already intubated and we do not intubate them for delirium alone. Why, then, should we fail to extubate a patient in whom all indicators show reaolution of critical illness, but who remains delirious? Is it possible that this is the main player in the causal pathway between sedation and extubation and perhaps even nosocomial infections and mortality? (The protocols or lack thereof for assessing extubation readiness were not described in the current study, unless I missed them.) It would certainly be interesting and perhaps mandatory to know the extubation practices in the centers involved in this study, especially if we are going to take great stock in this secondary outcome of this study.
Another thing I am interested in knowing is what PATIENT experiences are like in each group - whether there is greater recall or other differences in psychological outcomes between patients who receive different sedatives during their ICU experience.
I hope this study and others like it serve as a wake-up call to the critical care research community which has heretofore been brainwashed into thinking that a therapy is only worthwhile if it improves mortality, a feat that is difficult to achieve not only because it is often unrealistic and because absurd power calculations and delta inflation run rampant in trial design, but because of limitations in funding and logistical difficulties. This group has shown us repeatedly that useful therapies in critical care need not be predicated upon a mortality reduction. It's past time to start buying some stock in shorter times on the blower and in the ICU.
Monday, March 10, 2008
The CORTICUS Trial: Power, Priors, Effect Size, and Regression to the Mean
The long-awaited results of another trial in critical care were published in a recent NEJM: (http://content.nejm.org/cgi/content/abstract/358/2/111). Similar to the VASST trial, the CORTICUS trial was "negative" and low dose hydrocortisone was not demonstrated to be of benefit in septic shock. However, unlike VASST, in this case the results are in conflict with an earlier trial (Annane et al, JAMA, 2002) that generated much fanfare and which, like the Van den Berghe trial of the Leuven Insulin Protocol, led to widespread [and premature?] adoption of a new therapy. The CORTICUS trial, like VASST, raises some interesting questions about the design and interpretation of trials in which short-term mortality is the primary endpoint.
Jean Louis Vincent presented data at this year's SCCM conference with which he estimated that only about 10% of trials in critical care are "positive" in the traditional sense. (I was not present, so this is basically hearsay to me - if anyone has a reference, please e-mail me or post it as a comment.) Nonetheless, this estimate rings true. Few are the trials that show a statistically significant benefit in the primary outcome, fewer still are trials that confirm the results of those trials. This begs the question: are critical care trials chronically, consistently, and woefully underpowered? And if so, why? I will offer some speculative answers to these and other questions below.
The CORTICUS trial, like VASST, was powered to detect a 10% absolute reduction in mortality. Is this reasonable? At all? What is the precedent for a 10% ARR in mortality in a critical care trial? There are few, if any. No large, well-conducted trials in critical care that I am aware of have ever demonstrated (least of all consistently) a 10% or greater reduction in mortality of any therapy, at least not as a PRIMARY PROSPECTIVE OUTCOME. Low tidal volume ventilation? 9% ARR. Drotrecogin-alfa? 7% ARR in all-comers. So I therefore argue that all trials powered to detect an ARR in mortality of greater than 7-9% are ridiculously optimistic, and that the trials that spring from this unfortunate optimism are woefully underpowered. It is no wonder that, as JLV purportedly demonstrated, so few trials in critical care are "positive". The prior probability is is exceedingly low that ANY therapy will deliver a 10% mortality reduction. The designers of these trials are, by force of pragmatic constraints, rolling the proverbial trial dice and hoping for a lucky throw.
Then there is the issue of regression to the mean. Suppose that the alternative hypothesis (Ha) is indeed correct in the generic sense that hydrocortisone does beneficially influence mortality in septic shock. Suppose further that we interpret Annane's 2002 data as consistent with Ha. In that study, a subgroup of patients (non-responders) demonstrated a 10% ARR in mortality. We should be excused for getting excited about this result, because after all, we all want the best for our patients and eagerly await the next breaktrough, and the higher the ARR, the greater the clinical relevance, whatever the level of statistical significance. But shouldn't we regard that estimate with skepticism since no therapy in critical care has ever shown such a large reduction in mortality as a primary outcome? Since no such result has ever been consistently repeated? Even if we believe in Ha, shouldn't we also believe that the 10% Annane estimate will regress to the mean on repeated trials?
It may be true that therapies with robust data behind them become standard practice, equipoise dissapates, and the trials of the best therapies are not repeated - so they don't have a chance to be confirmed. But the knife cuts both ways - if you're repeating a trial, it stands to reason that the data in support of the therapy are not that robust and you should become more circumspect in your estimates of effect size - taking prior probability and regression to the mean into account.
Perhaps we need to rethink how we're powering these trials. And funding agencies need to rethink the budgets they will allow for them. It makes little sense to spend so much time, money, and effort on underpowered trials, and to establish the track record that we have established where the majority of our trials are "failures" in the traditional sence and which all include a sentence in the discussion section about how the current results should influence the design of subsequent trials. Wouldn't it make more sense to conduct one trial that is so robust that nobody would dare repeat it in the future? One that would provide a definitive answer to the quesiton that is posed? Is there something to be learned from the long arc of the steroid pendulum that has been swinging with frustrating periodicity for many a decade now?
This is not to denigrate in any way the quality of the trials that I have referred to. The Canadian group in particular as well as other groups (ARDSnet) are to be commended for producing work of the highest quality which is of great value to patients, medicine, and science. But in keeping with the advancement of knowledge, I propose that we take home another message from these trials - we may be chronically underpowering them.
Jean Louis Vincent presented data at this year's SCCM conference with which he estimated that only about 10% of trials in critical care are "positive" in the traditional sense. (I was not present, so this is basically hearsay to me - if anyone has a reference, please e-mail me or post it as a comment.) Nonetheless, this estimate rings true. Few are the trials that show a statistically significant benefit in the primary outcome, fewer still are trials that confirm the results of those trials. This begs the question: are critical care trials chronically, consistently, and woefully underpowered? And if so, why? I will offer some speculative answers to these and other questions below.
The CORTICUS trial, like VASST, was powered to detect a 10% absolute reduction in mortality. Is this reasonable? At all? What is the precedent for a 10% ARR in mortality in a critical care trial? There are few, if any. No large, well-conducted trials in critical care that I am aware of have ever demonstrated (least of all consistently) a 10% or greater reduction in mortality of any therapy, at least not as a PRIMARY PROSPECTIVE OUTCOME. Low tidal volume ventilation? 9% ARR. Drotrecogin-alfa? 7% ARR in all-comers. So I therefore argue that all trials powered to detect an ARR in mortality of greater than 7-9% are ridiculously optimistic, and that the trials that spring from this unfortunate optimism are woefully underpowered. It is no wonder that, as JLV purportedly demonstrated, so few trials in critical care are "positive". The prior probability is is exceedingly low that ANY therapy will deliver a 10% mortality reduction. The designers of these trials are, by force of pragmatic constraints, rolling the proverbial trial dice and hoping for a lucky throw.
Then there is the issue of regression to the mean. Suppose that the alternative hypothesis (Ha) is indeed correct in the generic sense that hydrocortisone does beneficially influence mortality in septic shock. Suppose further that we interpret Annane's 2002 data as consistent with Ha. In that study, a subgroup of patients (non-responders) demonstrated a 10% ARR in mortality. We should be excused for getting excited about this result, because after all, we all want the best for our patients and eagerly await the next breaktrough, and the higher the ARR, the greater the clinical relevance, whatever the level of statistical significance. But shouldn't we regard that estimate with skepticism since no therapy in critical care has ever shown such a large reduction in mortality as a primary outcome? Since no such result has ever been consistently repeated? Even if we believe in Ha, shouldn't we also believe that the 10% Annane estimate will regress to the mean on repeated trials?
It may be true that therapies with robust data behind them become standard practice, equipoise dissapates, and the trials of the best therapies are not repeated - so they don't have a chance to be confirmed. But the knife cuts both ways - if you're repeating a trial, it stands to reason that the data in support of the therapy are not that robust and you should become more circumspect in your estimates of effect size - taking prior probability and regression to the mean into account.
Perhaps we need to rethink how we're powering these trials. And funding agencies need to rethink the budgets they will allow for them. It makes little sense to spend so much time, money, and effort on underpowered trials, and to establish the track record that we have established where the majority of our trials are "failures" in the traditional sence and which all include a sentence in the discussion section about how the current results should influence the design of subsequent trials. Wouldn't it make more sense to conduct one trial that is so robust that nobody would dare repeat it in the future? One that would provide a definitive answer to the quesiton that is posed? Is there something to be learned from the long arc of the steroid pendulum that has been swinging with frustrating periodicity for many a decade now?
This is not to denigrate in any way the quality of the trials that I have referred to. The Canadian group in particular as well as other groups (ARDSnet) are to be commended for producing work of the highest quality which is of great value to patients, medicine, and science. But in keeping with the advancement of knowledge, I propose that we take home another message from these trials - we may be chronically underpowering them.
Subscribe to:
Posts (Atom)