Evaluating Extraordinary Claims: Mind Over Matter? Or Mind Over Mind?
A relative of mine recently went in for minor surgery and sent out an email that
asked for supportive thoughts during the operation and thoughtfully noted that
since the operation was early in the morning when I might be sleeping, that
Larry Dossey, M.D. It
doesn't matter, according to Larry Dossey, M.D. in
Healing
Words, whether you remember to do it at
the appropriate time or do it early or later. He says the action of mentally
projected thought or prayer is "non-local," i.e. not dependent on distance or
time, citing some 30+ experiments on human and non-human targets (including
yeast and even atoms), in which recorded results showed changes from average
or random to beyond-average or patterned even when the designated thought
group acted after the experiment was over.
I was perplexed. On the one hand, if there really was good evidence of
mind-over-matter (and operating backwards in time, no less) you'd think it would
be the kind of thing that would make the news, and I would have heard about it.
On the other hand, if there is no such evidence, why would seemingly sensible people like
Larry Dossey, M.D. believe there was? I had a vague idea that
there were some studies showing an effect of prayer and some showing no effect;
I thought it would be interesting to research the field. I was concurrently
working on an
essay
on experiment design, and this could serve as a good set of examples.
Our starting point will be a
January
2007 interview with Dossey. A sidebar in that interview highlights five
studies that Dossey puts forth as his best case that prayer and other
mental intentional acts can have an affect on the health of patients. The body of
the interview refers to three more studies after those five:
-
Achterberg, J et al.
"Evidence
for Correlations between Distant Intentionality and Brain Function in
Recipients: A Functional Magnetic Resonance Imaging Analysis."
Journal of Alternative and Complementary
Medicine. 2005; 11(6):965-971.
-
Byrd, R.
"Positive
therapeutic effects of intercessory prayer"
Southern Medical Journal. 1988;
81(7): 826-9.
-
Harris, W et al.
"A
randomized, controlled trial of the effects of remote, intercessory prayer
on outcomes in patients admitted to the coronary care unit."
Archives of Internal Medicine.
1999;159(19):2273-2278.
-
Tloczynski, J and Fritzsch, S.
"Intercessory
prayer in psychological well-being: using a multiple-baseline,
across-subjects design." Psychological
Reports. 2002;91(3 Pt 1): 731-41.
-
Cha, KY, et al.
"Does
prayer influence the success of in vitro fertilization-embryo transfer?
Report of a masked, randomized Trial."
J. Reproductive Medicine. September
2001; 46(9): 781-787.
-
Leibovici, L.
"Effects
of remote, retroactive intercessory prayer on outcomes in patients with
bloodstream infections: a controlled trial."
BMJ 2001;323: 1450-1.
-
Benor, DJ. "Survey of Spiritual Healing Research"
Complementary Healing Research. 1990; 4(3): 9-33.
-
Klingbeil, B and Klingbeil, J.
Unpublished manuscripts from the Spindrift Institute.
Dossey says there are other studies as well, but these apparently are his
favorites, so let's look at them first.
Study 1: Achterberg (Saybrook); Distant Intentionality
The
first
study is by Jeanne Achterberg of Saybrook Graduate School and
colleagues. She had eleven faith healers choose one subject each, who were
placed in an fMRI machine. Then the healers would alternately send and not
send positive energy to the recipients, in twelve periods of two minute
each. The experiment measured the recipients' brain activity and attempted
to determine whether there were differences between the "send" and "no send"
states. When asked if there are any possible flaws in the study, Dossey
said "I can't find any." Let's see
if we can (the "warning signs" refer to my essay on experiment design).
-
Warning
Sign I1: the experiment is not reproducible from the report. You can't
tell what is being measured or how those measurements are compared.
The experiment has not been reproduced, by Achterberg or by others.
-
A confusing point: the results sections says "One of the two clusters was
highly statistically significant (p=0.000127)". But if you look
at Table 1, you see that the other cluster is also statistically significant
(p=8.51E-09, or 0.00000000851). Why wasn't that mentioned? Is it a
typo? Did the authors think the number was too good to be believed, so they
tried to ignore it? In personal communication, Dr. Achterberg said that
it was a typo, that it should have been 8.51. She also says that the
p value for cluster 1 in Table 1 is a typo; it says 0.00127 but
it should be 0.000127, as it appears in the RESULTS section.
But if they are typos, then there are paired typos, because the next
column gives the logarithms of these numbers, which are consistent with
8.51E-09 and 0.00127, not with the numbers that Dr. Achterberg claims.
-
Warning
Sign D1: lack of randomized controls. The best control would be to put a
similar
number of patients in the fMRI with no sender involved. But that is expensive,
so the next best thing is what the authors have chosen: a block design where there
are blocks of time with one kind of stimulant interspersed with the other kind,
in this case send and no-send.
The problem is that the pattern of send/no-send intervals was
the same for all subjects. The lack of randomization
on a subject-by-subject basis is a fundamental error.
For example, suppose that patients were nervous, or somehow otherwise in
a different brain state in the first two minutes of the procedure. All
of this difference would be recorded for the "no-send" state (if the
block pattern
had been randomized it would have been spread across "send" and "no-send" states).
I did a simulation where I generated 480 random numbers per subject, representing
the 480 time slices in the experiment. (I generated a single number taken from
a uniform distribution rather than the many voxel measurements recorded by the fMRI
machine, but the idea is the same.) I repeated this generation of random numbers
20 times. I compared the send and no-send intervals
of random numbers using a T-test; none of the 20 trials found a significant
difference at the p=5% level. I then repeated the simulation under
the assumption that subjects would be anxious when first entering the fMRI
machine, that this anxiety would last for between one and two minutes (chosen
at random uniformly), and that during the anxious period the fMRI signal
would be chosen from the top 20% of output levels as compared to the
non-anxious period. I repeated this simulation 20 times, and 12 of the trials
showed a significant difference between the send and no-send intervals
at the p=5% level; 5 were significantly different at the p=0.1% level.
This result calls into question whether the Achterberg et al. study is
actually measuring an effect of the send/no-send conditions, or whether it
is measuring a difference between the initial-state/non-initial-state (or
equivalently, maybe subjects get bored and have different brain states near
the end of the experiment). This confound could have been avoided if only they
had randomized the block order.
-
Warning
Sign D6: it looks like the authors are trying to apply a statistical
model with one too many free parameters; a potentially fatal flaw. fMRI is a
great technology, but it relies on measuring changes in blood flow that take
a few seconds to register. This is called the
hemodynamic delay. Since we don't know for sure how long this delay will
be, we need to deal with it in some way. One standard way to deal with
the problem would be to ignore the first few seconds when we switch from
send to no-send or vice-versa. This is called a break period.
Also, another standard approach is to jitter
the stimulus by a second or so -- rather than start every stimulus exactly on
a boundary between recording slices, have some start in the middle of a slice.
Achterberg et al. apparently did not use breaks nor jitters. How did they
model the hemodynamic delay? The paper is not clear, but one sentence
bothers me: "A goodness of fit statistic (r squared) indicates the
degree of fit between the hemodynamic model and the actual brain activity
during the time course recorded." To me, this suggests that the length
of the hemodynamic delay is a parameter that is fit by some model;
their software computes the delay period that best fits the
model. Here's the flaw: with a free parameter, you can impose order on
a process where order does not exist.
To test this, I ran a simulation where I generated random numbers chosen
uniformly from the range -100 to +100. I generated 11 sequences of numbers,
corresponding to the 11 subjects in the experiment, and 480 numbers per
sequence to represent an fMRI reading once every 3 seconds of the 24-minute
experimental protocol. I then divided each sequence into twelve equal
parts, corresponding to the off-on parts of the experimental protocol.
I then computed the difference between the sum of the numbers in the on
parts and the sum of the numbers in the off parts. But I did this
allowing hemodynamic delays of from 0 to 30 seconds. By choosing one
delay or another, I got the difference between the means to vary between
0.09 and 3.30. A difference of 0.09 is not significant
(p = 0.95) while a difference of 3.30
is significant (p = 0.03). So, even
with a sequence of random numbers, I can get a significant difference (or
not), just by choosing my hemodynamic delay. If the authors' search
for the hemodynamic delay parameter was anything like my simulation, then it is
possible that the results reflect an artifact of the parameter choice
rather than any pattern in the data. Unfortunately, one can't tell from the
article exactly what they did.
-
Warning Sign D7:
overzealous data mining. It appears they were looking for changes in
any part of the brain. There are a
lot of parts of the brain; if you look at enough parts, you'd expect some to
differ just by chance. They should either decide ahead of time what
areas are likely receptors of the transmitted thoughts and look only at
those, or reproduce the study looking only at the areas found to be relevant
in a trial study. In personal communication, Dr. Achterberg writes:
We were "data mining," and doing absolutely the
appropriate study for a pilot endeavor. It seems we are in complete
agreement: this was the appropriate test to do first: try a few subjects,
and see which brain areas show a differential response. Where we
disagree is what to do next. If it were me,
I would say that the next step after the pilot
study is to do a real study and publish those results. If they
are positive, and can be replicated by other labs, then it is worth a
Nobel Prize in physics, or physiology, or both. If the results are
negative, at least it is a better paper. Dr. Achterberg and her colleagues
chose to pass on the Nobel, to not do the real study,
but to publish the pilot study instead.
-
Warning
Sign D8: lack of a theory. Since the experiment proposes an effect
that contradicts 2000 years of physics and physiology, there should be
some explanation of how the effect is achieved. The authors say that
"the results of this study may be interpreted as consistent with
the idea of entanglement in quantum mechanics theory." This is a
partial theory, but it raises more questions than it answers: how do
the particles of sender and subject get entangled? How does that
entanglement lead to a causal effect in the subject? How does a
causal effect in a particle of the subject lead to a macro-effect
on brain state? When we assess whether there is anything to
this study, we need to assess whether there are answers to these questions.
-
The final flaw: the results do not directly address the question of how the
period of time involving "sending" compares to "no sending".
Here is the entire RESULTS section, verbatim (my emphasis
added):
The FSL software produces a quantitative
table of cluster results that includes: cluster size, probability for each
cluster, z scores, x, y, z coordinates of the cluster in Talaraich space
and contrast of parameter estimates (see Table 1). If a cluster is
significant in a group analysis it means that there were
specific brain regions in which the
combined subjects had enough activation to raise the z score above the
noise level threshold. In other words, if all of the subjects had
random activation at different places in
the brain, then there would be no group activation. One of the two
clusters was highly statistically significant (p=0.000127).
Significant areas of apparent
activation in the group analysis and total number of pixels activated for
the group are reported in Table 2. A scan representing the group
activation as a whole appears in Figure 2.
It is clear what was done: add up and average all the activations in
the "send" condition, and subtract the activations from the "no send"
condition. Then you get a single set of summed activations, and look
for deviations from zero. Clearly, there were deviations, but that does
not really address the questions. We want to know how the "send" and
"no send" conditions compare; by lumping them all together, we've lost the
ability to do that. Consider Fig. 2, for example. We see a picture of total
activation regions in two areas of the brain, but what we really want to see
is a comparison: what does a brain (or an average over several brains) look like
in the send condition, and how is that different from the no-send condition?
The result section is saying that the
subjects had activation at "specific brain regions" and "different places in
the brain", but says nothing about how those activations are correlated with
the time intervals of the conditions. Table 1, Table 2, and Figure 2
all mention spatial regions, but nothing about time intervals. In the
DISCUSSION section, they do say "the results show significant activation of
brain regions coincident with DI [distant intentionality] intervals."
But they don't say whether there was also significant activation coincident
with non-DI intervals. They correctly point out that "if all of the subjects
had random activation at different places in the brain, then there would be no
group activation." So they have proved that the activation in the brain is not
random. But they have not proved that it is non-random because of the
send vs. no-send conditions. It would have been easy to do a simple T-test
or other test comparing the send vs. no-send condition, but they elected not to
do that.
So not only have they failed to prove their
premise; they haven't even attempted to
address it in the results sections. In the abstract they do
say "Significant differences between experimental (send) and control (no
send) procedures were found." So either the abstract is wrong or they
forgot to mention their main result in the body of the paper. Either way, we
can't evaluate the paper because we don't know what it is claiming.
Study 2: Byrd (UCSF); Prayer for Cardiac Patients
Dossey says "The most famous prayer study was conducted by Dr. Randolph Byrd."
(Actually, I (and many
others)
think Benson's study is most famous. But Byrd's is certainly well known.)
Byrd's study is
Positive
therapeutic effects of intercessory prayer in a coronary care unit
population, and was published in the Southern Medical Journal in 1988 (after
being rejected by two other journals). Byrd studied 393 patients in a coronary
care unit, split them into a control groups, and an experimental group for which
a team of intercessors were given the subjects names and told to pray for "a
rapid recovery and prevention of complications and death." Byrd then
measured the death rates and three variables related to "rapid recovery" (length
of hospital stay, length of coronary care unit stay, and re-admissions).
There were no statistical differences between the two groups for "rapid recovery" or for "death".
As for "prevention of complications," the story is more complicated. Byrd
measured 24 other variables (see
table
2 in the article); of these, 18 variables (such as requiring major surgery,
angina, and gastrointestinal bleeding) showed no significant difference at the
p=5% level, and 6 variables (such as
heart failure, pneumonia and requiring antibiotics) did. But Byrd didn't declare
before the study that these were the important variables. So it is a clear
Warning
Sign D7, overzealous data mining, to pick out the 6 significant variables
after the fact. Blow the whistle, wave the penalty flag; you can't do
that. Certainly you can use these 6 variables to inform another study, but
you can't count the results from this study. Furthermore, it is
Warning
Sign D6 to treat variables as independent when they are not. For
example, congestive heart failure automatically leads to a need for diuretics;
Byrd counts these as two separate significant results. That's like saying
you had a positive result on both "height in inches" and "height in
centimeters."
Byrd goes on to get a positive result with
p=0.1%, but he does this by inventing a
scoring method for adding up the various measures. This is
Warning
Sign D7 again: Byrd made up the
scoring method after the data was in; he should have made it part of the
hypothesis before the data was gathered. This is also a
Warning
Sign D2, lack of a double-blind study. Although the patients and Byrd were
blinded, Byrd was not blinded when he made up this scoring method, and his
assistant, Janet Greene, was not blinded throughout, even though she interacted
with the patients and did data entry.
This study appears to be professionally done, with good randomization and
controls, although some problems with blinding. The results are inconclusive.
When we discount the results from overzealous data mining there are no
significant differences left between the control and the prayer groups, but
there are some promising pointers for future research.
Study 3: Harris (University of Missouri); Prayer for Cardiac Patients
Given an inconclusive study like Byrd, a good idea is to try to reproduce it.
Fortunately, an attempt was made by
Harris
et al. They replicated all the cases where Byrd found no significant
difference, such as length of hospital stay and mortality. For the
variables that Byrd found a significant difference, Harris found no significant
difference. In fact, of the 35 variables listed in Harris' Table 3, only one,
"Swan-Ganz catheter" showed a significant difference at the
p=5% level. By random chance, you'd
expect 1.75 variables to show significance at the
p=5% level. But, Harris also commits
Warning
Sign D7, over-zealous data mining, to come up with a different
scoring method from Byrd that does show a significant difference, at the
p=4% level, and by ignoring two other
scoring methods that do not show significance.
Tessman
and Tessman have a further analysis.
Here's another curiosity about the Harris study: Nicholas Humphrey
points
out that there were some patients who did so well that they recovered and
checked out of the hospital before any prayers could be organized for
them. It turns out that four times as many of the "to be prayed for" group
recovered quickly, compared to the "not to be prayed for" group. This is
significant at the p=0.1% level.
What conclusion can you draw from this? (1) someone was trying to slip more of
the healthier patients into the "to be prayed for" group; (2) God preferentially
heals people who are about to be prayed for, thereby causing them not to be
prayed for; or (3) if you collect enough numbers and do enough data mining,
you can find some statistically significant results in either direction. I
prefer (3), but you take your choice.
Although this study is cited in the sidebar of Dossey's interview as one of the
best studies, Dossey himself says in his book
Healing
Words that
"this study has missed the mark. . . . [W]e
would expect greater evidence than a few small percentage points of improvement.
We would want to see statistically significant life-or-death effects, which
simply did not occur." I think that's an accurate assessment of
this study, as it stands on its own. It should also be noted that this
study fails to replicate Byrd's findings, so we can add
Warning
Sign I1: lack of reproducibility to both Byrd and Harris's problems. Taken
individually, Byrd and Harris are both professionally-done studies with mixed
results: they suggest prayer may have an effect on some variables but not on the
seemingly most important ones such as death rates. Taken together, they
show that we don't yet have any single variable for which intercessory prayer
reliably works. For those who believe in intercessory prayer to a responsive omnipotent
being, this is difficult to explain: God can't affect death rates nor speed of
recovery; all he can do is make you 5% less likely to need
antibiotics? It's like Woody Allen said: "If it turns out there is a
God ... the worst that you can say about him is that basically he's an
underachiever."
Study 4: Tloczynski (Bloomsburg Univ.); Prayer for Cardiac Patients
Unfortunately, all I could find was an
abstract
of this study, not the full paper. But even from the abstract, we immediately
see a
Warning
Sign D3: too few subjects. There were only eight subjects in
the study. More serious is another
Warning
Sign D1 problem: lack of a control group. In lieu of a real control
group, all eight subjects start out in the control group, and then every two
weeks two more are moved to the prayed-for group. The conclusion was that
the prayed-for subjects had less anxiety, as measured on a standardized
test. This result could have come about because prayer was effective in
reducing the patients' anxiety. But it could also have come about if the
subjects as a whole felt less anxious over time, for whatever reason (perhaps
they were nervous at the start of the experiment, but then got used to
it). The graph at the right mirrors my understanding of the experimental
conditions. The weeks are along the x-axis, and the blue line represents
the increasing number of patients in the prayed-for group; the red line the
decreasing number in the control group. For this graph we assume that we are
tracking calmness on some scale, and for simplicity, every subject starts out
with an identical score of 10 and every subject increases their score by exactly
5% every week. The yellow-orange line tracks the weekly score of every
subject. The green line tracks the mean score of the prayed for group
while the grey line tracks the mean score of the control group. Why does
the prayed-for group do better, if each week every subject reports exactly the
same score as every other subject? It is because the scores are generally
trending up (yellow-orange line) and over time the prayed-for group gets more of
these higher-scoring subjects and the control group gets fewer.
I also did a simulation where each subject's anxiety score was generated
randomly from a uniform distribution, such that the mean increased by 5% each
week (for every subject). I repeated this simulation 30 times; in every one of
the 30 trials the prayed-for group did significantly better. Then I tried
30 more times without the 5% increase (that is, every week's score had a mean of
10.0). This time the prayed-for group did better 17 times and the control
group did better 13 times; the difference was not significant. Unfortunately I
don't know for sure if this effect is present in the Tloczynski study, because I
couldn't find the full paper.
Ladies and Gentlemen, you need a control group! If you think you have a
clever experimental design that gets around the need for a control group,
carefully write down all the reasons you think it works on a piece of
paper. Study the paper carefully. Then throw away the paper and get a
control group anyways.
Study 5: Cha (Cha Hospital, Korea); Prayer for Pregnancy
The
next
study, by Cha, Wirth, and Lobo in 2001, claimed that prayer
doubled pregnancy rates in IVF patients,
at p = 0.1%. This would be a remarkable
achievement, because no other techniques are known to double pregnancy rates
like that. However, the validity of the study has been called into question:
-
The authors did not obtain informed consent; neither the patients nor their
doctors were told about the experiment. This by itself does not dispute the
results of the study, but it is considered unethical, and launched an
investigation by the home institution, Columbia University.
-
In the wake of this controversy the lead author, Dr. Lobo, claimed that he
had not even heard of the study until 6 to 12 months after it had been
completed. Lobo withdrew his name from the paper.
-
The second author, Daniel Wirth, a parapsychologist with no medical
training, was the one who actually executed the experiment. In May
2004 he pleaded guilty in a Pennsylvania court to
thirty-five
counts of mail fraud, bank fraud, and other felonies. He fraudulently
obtained over $3 million using false identities. He was sentenced to five
years in federal prison. His convicted co-conspirator hung himself in
prison.
-
The Journal of Reproductive Medicine
withdrew the paper in May 2004, but reinstated it in November 2004.
-
In February 2007, the third author, Cha, was
censured
by the Journal Fertility and Sterility
for plagiarizing, almost word for word, a paper published by
Jeong-Hwan Kim in another journal, and perjuring himself by signing a
statement that the work was original. Cha was banned from publishing in
Fertility and Sterility for three
years and has left Columbia.
So we're left with zero authors associated with the study who have not been
jailed or penalized for fraud. Believing this study looks like a
Warning
Sign I8: believing liars and cheats. There is a legitimate question
whether the study was actually done at all--did Wirth just make up the data the
way he made up false identities and financial instruments?
Study 6: Leibovici (Rabin Medical Center, Israel); Retroactive Prayer
Now let's move on to the final study cited by Dossey as an "amazing
example." This is a
study
by Dr. Leonard Leibovici of the Rabin Medical Center in Israel, appearing in the
British Medical Journal in December 2001. In 2000, Leibovici looked at
patients admitted to the hospital for brief stays in 1990-96. He randomly
assigned them to one of two groups, and had prayers said for the members of one
group. The control group got no treatment. Mortality rates showed no
difference, but subjects in the prayed-for group had less fever and shorter
hospital stays, significant at the p=4%
level. Note that the praying was all done 4 to 10 years
after the patients had either recovered
or died. So Leibovici is making the extraordinary claim that prayers are
altering the past.
I think the most interesting
comment
on this is by Martin Bland of the University of York. He essentially has a
logical proof that an ethical study proving the
efficacy of retroactive prayer is logically impossible. The proof goes
like this (my words):
According to Clause 30 of
the
Declaration
of Helsinki,
"at the conclusion of the study, every patient entered into the study should
be assured of access to the best proven prophylactic, diagnostic and
therapeutic methods identified by the study." Now suppose you have done
a study proving retroactive prayer works. If you don't offer retroactive
prayer to the control group, you're being unethical. If you do offer it,
then the control groups should be retroactively cured. Thus, in the end
there should be no difference between the control group and the treatment
group, and therefore the study cannot show an effect.
Dossey and Brian Olshansky rebut this point by saying that prayers cannot be
withheld as a treatment, because anyone can pray for any patient at any time.
There is something to this argument, but overall it is weak; if the best
treatment in an experiment turned out to be 8 glasses of water a day, and it was
not offered to all patients, I don't think the experimenters would get off the
hook by claiming that anyone might drink 8 glasses of water on their own.
In any case, it is interesting that Leibovici's purpose in publishing this
article was actually tongue-in-cheek. It was published in the Christmas
issue, which traditionally includes articles of a light-hearted or humorous
nature (such as the December 2003 paper recommending randomized controlled trials to establish
the efficacy of parachutes for treating "gravitational challenge"). Leibovici has stated:
The purpose of the article was to ask the
following question: Would you believe in a study that looks
methodologically correct but tests something that is
completely out of people's frame (or model) of the physical
world--for example, retroactive intervention or badly distilled water
for asthma?
Leibovici answers his own question as follows:
if the
pre-trial probability is infinitesimally low, the
results of the trial will not really change it, and the trial
should not be performed. This, to my mind, turns the article into a
non-study.
Here Leibovici is saying that his whole purpose in publishing the study was to
point out what I called
Warning
Sign I4 (confusing P(H|E) with
P(E|H)). His reasoning is as follows: what's the prior probability (before
the experiment) that thoughts can influence events in the past? We have no
evidence of it ever occurring, we have no theory for how it could occur, but we
have copious evidence every day and a fabulously accurate theory (called
physics) that says that it does not happen. So reasonable values for the
prior probability could be anywhere from one in a million to one in a
googolplex. If we believe the experiment, which comes in at a
p=4% level, we multiply the prior
probability by 25, but the result is still infinitesimally small, so Leibovici's
point is that the experiment doesn't matter.
Dossey disagrees with Leibovici's take on his own experiment. Dossey says
"Leibovici's auto-rejection
brings a dangerous level of arbitrariness to the
scientific process. Why disqualify one study and not another,
when both had acceptable methods?" I agree with Dossey that
there is no need to disqualify a study. We should accept the results of
this study for what they are worth--but no more. If before the study
Leibovici's belief in retroactive prayer was 1 in a trillion, then after the
study (which was significant at p=4%) he
should update his belief, not discard the study. If he believes there are no
systematic biases in the study, he should update to about 25 in a
trillion. If he believes there is a high probability of bias of one kind
or another, the update should be less, perhaps to 1.01 in a trillion. But it is
important not to discard the study completely, because if we have 10
independent, unbiased confirmatory studies at the p=4% level (with no
non-confirming studies), we should update all the way from 1 in a trillion to
about 1 in a hundred.
The other mistake Dossey makes is what I call
Warning
Sign D8 (lack of a theory). First Dossey claims that it is ok not to have a
theory:
No mechanism known today can account for the
effects of remote, retroactive intercessory prayer said for a group of
patients with a bloodstream infection. However, the significant results and
the flawless design prove that an effect was achieved. To quote Harris et al:
"when James Lind, by clinical trial, determined that lemons and limes cured
scurvy aboard the HMS Salisbury
in 1753, he not only did not know about ascorbic acid, he did not even
understand the concept of a `nutrient.' There was a natural explanation for
his findings that would be clarified centuries later, but his inability to
articulate it did not invalidate his observations."
He has a very good point that this experiment by Lind -- one of the first
applications of the scientific method to medicine -- was done without a modern
theory of vitamins or nutrients. However, as I point out in
D8,
there was a perfectly good partial theory
that said that what you eat can affect your health; Lind was following this
partial theory. He knew, for example that what you eat now can affect your health in the
next hours or days, but does not affect your health in the past.
Dossey goes on to propose his own partial theory for retroactive
causality: "In one of the most profound discoveries in science, a new class
of phenomena was recognized: "non-local events," in which distant happenings are
eerily linked without crossing space, without decay, and without delay."
Here he is using the theory of Quantum Electrodynamics as a metaphor for
something eerie, but he's not saying exactly what. Dossey is correct in
pointing out that experiments on subatomic particles are consistent with the
idea of nonlocality, which Einstein called "spooky action at a distance."
Does quantum theory license Dossey's idea that anything goes--mind over matter,
prayer working backwards in time, whatever; it's all nonlocal? Dossey
seems to think that there's plenty of slop in Quantum theory; it all seems so
strange, so anything's possible; why couldn't thoughts affect a patient's health
a decade before?
Dossey's invocation of nonlocality may work
as a metaphor, but it doesn't work as physics. It is true that photons
act nonlocally, but what connection does that have to consciousness?
Or to macroscopic causation going backwards in time? There is
absolutely no evidence nor any theory that would account for Dossey's
claims. Furthermore, Quantum Electrodynamics is actually the exact
opposite of "anything goes." It is the most precise, most predictable
theory ever invented in any field. For example, before Quantum theory
the magnetic moment of the electron was defined as 1 Bohr
magnetron. From the theory you can see that this is actually an
approximation, and the more precise value can be calculated as
1.0011596525 Bohr magnetons; this matches the experimentally derived
value of 1.0011596522 magnetons to 10 digits of accuracy. Compare that
to the gravitational constant, 6.67428 × 10-11 N
m2/kg2, which has only 6 digits of
accuracy. According to physicists, Dossey would have a
ten-thousand-times better argument if he invoked gravity rather than
quantum theory as his metaphor. After all, gravity is nonlocal
as well. Two objects with mass are "eerily linked without crossing
space, ... and without delay" by gravity. But of course, gravity
does not sound as mysterious as quantum forces, so it would sound
silly to say "because of the nonlocal effects of gravity, thoughts can
influence events in the past." Nobel laureate physicist Steven
Weinberg said "quantum mechanics has been overwhelmingly important to
physics, but I cannot find any messages for human life in quantum
mechanics that are different in any important way from those of
Newtonian physics."
What nonlocality actually means is that to predict what happens to a particle
such as a photon or electron you need to consider a probability distribution
over several possible paths (as shown in the
Feynman
diagrams to the right). To compute how much light reflects off a piece
of glass, you have to consider that each photon might reflect off the top
surface, might reflect off the bottom surface, or might pass all the way
through. You can't look at just one of the options, you have to add up the
probability distributions for all three. Furthermore, when two particles
interact, their probability distributions are no longer independent.
That's all there is to it, except for some details (which you can learn from
seven years of advanced mathematics or from
a
fun popular book with no equations at all). There is no mystery at all
as to what actually happens--that can all be predicted out to the tenth decimal
place. The only "eerie" or "spooky" part is
why it happens that way. As Feynman
put it: "While I am describing to you how Nature
works, you won't understand why Nature works that way. But you see, nobody
understands that. I can't explain why Nature behaves in this peculiar way.
... Does that mean that physics, a science of great exactitude, has been reduced
to calculating only the probability of an event, and not predicting exactly what
will happen? Yes. That's a retreat, but that's the way it is: Nature
permits us to calculate only probabilities. Yet science has not
collapsed."
Study 7: Benor (Wholistic Healing Publications); Spiritual Healing
Dossey also mentions a meta-study by Daniel J. Benor M.D. titled
Survey of Spiritual Healing Research from the journal
Complementary Healing Research. Benor looks at 131 studies and
finds that 56 have positive results at a significance level of
p=1%, another 21 at p=5%, and the remaining 54 show no
significant results. Does that constitute positive or negative
evidence? If, like Gary Posner, you're a stickler for
credentials, you'd say it constitutes no evidence at all, because
these are all either unpublished student theses or from parapsychology
journals, and not from established scientific journals. If you think
that spiritual healing constitutes a new physical phenomenon (rather
than just a medical treatment) then you, like Victor Stenger, would see a Warning Sign I6: accepting the wrong p value, and
insist on the p=0.01% level of physics journals and would say
that these are 131 out of 131 negative results. If you are like
Leibovici, you would say that these results do not change our minds,
given the prior probabilities, and you would argue that they should
not have been done. If you're Benor or Dossey, you say the jury is
still out, we need more studies, but the fact that 56+21 is more than
half of 131 means that the preponderance of evidence favors spiritual
healing. If you're a serious statistician, you would note that Benor
uses a simplistic and error-prone technique called vote-counting. This
is a clear Warning Sign D6: the wrong
statistics. (See this overview, which states "conclusions
from vote-counting can be very misleading.")
Here's how I
look at a meta-analysis like Benor's. For the moment, assume all the
studies are fair, well-done studies with no systematic bias.
Furthermore, assume there is one thing called Spiritual Healing
that either works or doesn't. I know that's an over-simplification,
but bear with it. If spiritual healing doesn't work, the odds are
astronomically low that 77 fair studies would report it did, and if it
does work, the odds are also astronomically low that 54 fair studies
would say it does (I'd need to have more data to say exactly how low,
but a rough guess is less than 10-30.). It seems reasonable
to reject both these possibilities.
So we have to
question our assumptions. One assumption was that there is a single
thing called Spiritual Healing. We could reject that
assumption, and insist that the meta-study be split: make a claim that
one type of healing works and another kind doesn't. That's a good way
out of the dilemma, but requires a theory about what works and what
doesn't.
The other assumption was that the studies
have no systematic bias. It seems reasonable to question that
assumption, since 6 out of 6 of Dossey's best studies had
pervasive problems. Here's one possibility: assume that 90% of these
papers are flawed to the point of being unreliable in some way, and
the other 10% are flawless. 10% of the 56 papers at p=1% is
about 6 and 10% of the 21 papers at p=5% is 2. So now there are
about 8 reliable studies to explain. You'd need 600 negative studies
to balance the 6 at p=1%, and 40 negative studies to balance
the 2 at p=5%. Now assume that in addition to the 54 published
papers with negative results, there are 586 other studies with
negative results that went unpublished due to the file-drawer effect
or publication bias. Under these assumptions my model fits exactly the
number of positive and negative results you would expect by random
chance if there were no effect of spiritual healing. Now I realize I
made many assumptions in my model, and I don't really believe in my
model much; maybe it has a one-in-a-million chance of being
accurate. But that's 1024 times better than the alternatives
I outlined above.
Study 8: Klingbeil (Spindrift Institute); Yeast and Atoms
What about those "30+ experiments on human and
non-human targets (including yeast and even atoms), in which recorded results
showed changes from average or random to beyond-average or patterned even when
the designated thought group acted after the experiment was over"?
As far as I can tell, these are all the works of Christian Scientists Bruce and
John Klingbeil, who founded the
Spindrift
Institute in Oregon in 1969. They did experiments where they prayed
for yeast, seeds, and other things. They had a very good idea -- eliminate the
possibility of a placebo effect by using non-sentient subjects. Unfortunately
their results were never peer-reviewed or published in scientific journals,
making it difficult to evaluate them. The Klingbeils apparently did
self-publish some of their experiments, shortly before they committed suicide in
1993, but I have been unable to find copies. It appears these studies have not
been replicated by any other lab.
Other Studies
Now let's look at some studies that Dossey didn't mention. The first three
are often cited as the largest and best-controlled studies of intercessory
prayer; they all happen to be on prayer for cardiac patients:
-
Benson H, et al. (2006).
Study
of the therapeutic effects of intercessory prayer (STEP) in cardiac bypass
patents: A multicenter randomized trial of uncertainty and certainty of
receiving intercessory prayer. American Heart Journal
151:934-942.
-
Aviles JM et al. (2001).
Intercessory
prayer and cardiovascular disease progression in a coronary care unit
population: a randomized controlled trial. Mayo Clinic
Proceedings 76(12):1192-8.
-
Sloan RP, Ramakrishnan R (2005)
The
MANTRA II Study The Lancet 366(9499):1769-70
The next two are recent meta-analyses of intercessory prayer:
-
Masters S et al. (2006)
Are
there demonstrable effects of distant intercessory prayer?
Annals of Behavioral Medicine
Aug;32(1):21-6.
-
Hodge D (2007)
A
Systematic Review of the Empirical Literature on Intercessory Prayer
Research on Social Work Practice, Vol. 17, No. 2, 174-187
Study 9: Benson (Harvard, Templeton Foundation); Prayer for Cardiac Patients
The
John
Templeton Foundation financed
this
study, intending it to be the definitive answer on the efficacy of
prayer. They spent $2.4 million and enlisted 1.7 million people to pray. The
study looked at 1802 heart patients from several prestigious medical centers.
Like other studies this one had double-blinded prayed-for and non-prayed-for
groups. But they also added a third group, which was explicitly told
they would receive prayer. Intercessors were instructed to pray "for a
successful surgery with a quick, healthy recovery and no complications." By
all accounts the study was properly controlled and blinded (except for the
third group, who knew they would be prayed for). The conclusion of the
study was:
Intercessory prayer itself had no effect on complication-free recovery from
CABG [coronary artery bypass graft surgery], but certainty of receiving
intercessory prayer was associated with a higher incidence of
complications.
In other words, there was no significant difference between the blinded
prayed-for and not-prayed for groups. However, the group that knew they
were going to be prayed for did worse, at a statistically significant
level. The experimenters speculate the patients may have felt that they
must be seriously sick, if they needed
prayer, and responded poorly because of that.
Study 10: Aviles (Mayo Clinic); Prayer for Cardiac Patients
This
study of 799 subjects was similar to the Benson study above. There was
no third group, but they did separate both the prayed-for and non-prayed-for
groups into low-risk and high-risk. The study appears to be
well-controlled and blinded. The conclusion was:
As delivered in this study, intercessory prayer had no
significant effect on medical outcomes after hospitalization in a coronary care
unit.
Study 11: Sloan (Duke); Prayer for Cardiac Patients
This
study of 748 heart patients differs from the other two in that it enlisted
12 different denominations to do the prayer: Jews, Muslims, Buddhists, and
various Christian denominations. The study also considered soothing music,
imagery and touch therapy, known as MIT. All together there were four
groups: prayer or non-prayer crossed with MIT or non-MIT. The conclusion:
Neither therapy alone or combined showed any measurable treatment effect on the primary composite endpoint of major adverse cardiovascular events at the index hospital, readmission, and 6-month death or readmission.
Study 12: Masters (Syracuse); Meta-Analysis of Prayer
This
meta-analysis looked at 14 prayer studies. They looked for all the
studies published before August 2005 matching the terms "intercessory prayer" in
the PsycInfo and Medline databases, and also looked at references from these
articles and from review articles. The conclusion was:
There is no scientifically discernible effect of IP [intercessory prayer]
as assessed in controlled studies.
Furthermore, of the (non-significant) effect that did exist, 88% of it
disappears when Masters removes the discredited Cha/Wirth study.
Study 13: Hodge (Arizona); Meta-Analysis of Prayer
This
meta-analysis looked at 17 studies, including many of the same studies as
Masters. Hodge found that there was a significant difference in favor of
prayer, but that the difference is not clinically important:
Overall, the meta-analysis indicates that
prayer is effective. Is it effective enough to meet the standards of the
American Psychological Association's Division 12 for empirically validated
interventions? No. Thus, we should not be treating clients suffering with
depression, for example, only with prayer.
One serious problem with this meta-analysis is that it includes the discredited
Cha/Wirth study. If removing that study eliminated 88% of the effect in
Masters' study, it seems likely that removing it from Hodge's study would yield
no significant difference on the remaining studies. But I don't have
access to the numbers so I can't know that for sure.
Study 14: Galton (Royal Geographical Society); Prayer for British Royal Family
Francis Galton
OK, so this prayer study,
Statistical
Inquiries into the Efficacy of Prayer, isn't one of the biggest, nor one of
the best, but it is the first, so I
couldn't resist slipping it in. It was done by a famous statistician,
Francis
Galton (1822-1911) and published in the
Fortnightly Review in 1872. Galton says:
There is not a single instance, to my
knowledge, in which papers read before statistical societies have recognized
the agency of prayer either on disease or on anything else. ...
Had prayers for the sick any notable effect,
it is incredible but that the doctors, who are always on the watch for such
things, should have observed it."
Then Galton proposes an experiment of his own. He notes that British
subjects frequently say prayers for the health of the Queen and other
royals. Are those prayers effective? He shows a chart of the average
lifespans of various groups:
|
|
Number
|
Avg. Life
|
|
Members of Royal houses
|
97
|
64.0
|
|
Clergy
|
945
|
69.5
|
|
Lawyers
|
294
|
68.1
|
|
Medical Profession
|
244
|
67.3
|
|
English aristocracy
|
1,179
|
67.3
|
|
Gentry
|
1,632
|
70.2
|
|
Trade and commerce
|
513
|
68.7
|
|
Officers in the Royal Navy
|
366
|
68.4
|
|
English literature and science
|
395
|
67.6
|
|
Officers of the Army
|
659
|
67.1
|
|
Fine Arts
|
239
|
66.0
|
According to this table, the royals have the lowest life expectancy, so it
appears that all those prayers are not working. Galton's study has an
important place in the history of scientific reasoning, but it has several
serious flaws. First, Royals at the time, although generally receiving
good nutrition and health care, were subject to certain health
risks, such as hemophilia and beheadings, at a greater rate than the rest of
the population. Galton did not control for these variables.
More importantly, there is a selection bias
(Warning
Sign D4): a certain percentage of children are sickly from birth. If
such a child is a Royal s/he counts as one from birth, and will likely result
in an early death that brings down the Royal's average. In contrast a
sickly child born to a doctor, lawyer, or officer will likely not be able to
follow in his father's footsteps, and thus will not bring down the average of
those groups. So we can say that Galton's study does not provide a serious
result (perhaps he himself didn't think it was very serious either), while
still admiring it as one of the first examples of evidence-based medicine.
There have been other studies along similar lines to Galton's. For
example,
a
study by W.F. Simpson in 1989 found that graduates of
Principia
College, a school for Christian Scientists, who advocate prayer rather
than medical treatment, had significantly higher death rates than similar
students from secular colleges. This study does not have the confounds
of Galton's study; if anything you would expect that the Christian Scientist
students would have lower rates of dangerous practices such as drinking and
smoking and thus you would expect them to live longer.
Another
study by Asser and Swan looked at 172 children who died after their
parents refused medical care, preferring to rely on prayer. Asser and
Swan found that, if you assumed the typical cure rates of medical treatments
for the stated causes of death, at least 135 of these children would have
recovered. However, this conclusion is invalid because of selection
bias: it considers only the children who did die, and we don't have
recovery rates on the children who did not die.
Conclusion 1: Assessing the Evidence for Intercessory Prayer
Before doing the research for this essay, I had had a vague idea from
reading various newspaper reports that studies of the medical efficacy
of prayer were mixed: some studies showed a positive effect, some
not. In fact, after actually reading the 14 studies above, a different
picture emerges. We can
grade each study as positive (P), negative (N), or flawed (F):
-
(F) Achterberg clearly shows that brains do not have completely random
activation while encased in an fMRI machine for 24 minutes. However,
the variation is more likely due to anxiety or boredom or some other
cause, and not due to distance intentionality.
-
(P,N) Byrd is a mostly well-done study that is negative on most variables; on all variables when you combine
with Harris.
-
(P,N) Harris is a well-done study that is negative on most variables; on all variables when you combine
with Byrd.
-
(F) Tloczynski is poorly-designed and probably reflects an artifact of the
lack of controls rather than a real effect.
-
(F) Cha is the work of a felony fraudster and a plagiarist and can't be
taken seriously.
-
(F) Leibovici was intended to be taken as a joke and is best interpreted
that way.
-
(F) Benor is reporting on unpublished studies that are difficult to
evaluate, using the flawed methodology of vote-counting.
-
(F) Klingbeil never published any peer-reviewed studies.
-
(N) Benson is a well-done study with all negative results.
-
(N) Aviles is a well-done study with all negative results.
-
(N) Sloan is a well-done study with all negative results. Together these are
the three best.
-
(N) Masters is a meta-analysis with negative results.
-
(F) Hodge is a meta-analysis with some positive results, but the author
claims the results are not clinically important. Also, it seems likely that
the positive results stem solely from the discredited Cha study.
-
(F) Galton, like Leibovici, is best treated as a joke.
In sum we find 8 flawed studies (from which we can draw no solid
conclusion) and 4 clear negative studies (which say that for the
conditions they studied, there is no solid evidence that intercessory
prayer performs better than no prayer). There are also two good
studies, Byrd and Harris, which individually show a positive effect of
prayer on some variables, but taken together are negative: they show
that no variable has a consistent, repeatable effect. It certainly looks like
intercessory prayer, when measured in a scientific experiment, is not
effective. Why is the impression from actually reading the studies so
different from my original impression from reading newspaper reports?
It may be because, as I write
elsewhere, some reporters are more interested in giving "both
sides of the story" than in doing a little work to discover if there
is actually compelling evidence to support a point of view.
Happily, there's something in the results of these studies to please
everyone, regardless of their religious beliefs. The Biblical
Theist can point to Deuteronomy
6:16, which states "You shall not put the Lord your God to the
test" and claim that of course prayer doesn't work when part of an
experiment, but it does work otherwise. The Theist can also say that God
works in mysterious ways, and perhaps making the patient feel better right
away is just not in God's plan. The Deist can say the results
are consistent with a God who does not answer prayers but passively
oversees the Universe. And the Atheist can use the results as evidence
against any God. Note that the Sloan study, which used 12 religious
denominations, could have reported an outcome where one
religion works and the others don't. But instead it reported that all
12 are equally ineffective. You can take that as evidence that God is
great and loves all people equally, or as evidence that God is not
listening, or not responding, or does not exist. Statistics does not
help in deciding among the possibilities.
Conclusion 2: Assessing Dossey
Larry Dossey M.D. seems like a a sensible, intelligent, nice guy.
How could he look at the same studies we have covered here
and see an overwhelming
preponderance of evidence for the efficacy of prayer, rather than the
reverse?
How could he say he found "no flaws" in Achterberg, when we could so easily
find six flaws? (I recognize that Achterberg is a co-author with Dossey's
wife.)
How could Dossey support the Cha/Wirth/Lobo study in
January 2007, when he must have known by then that Lobo had repudiated
the study and
Wirth had been convicted of fraud? How could Dossey omit Benson,
Aviles, and Sloan, the three biggest, most prestigious, and widely-publicized
studies on the topic of intercessory prayer? I realize
the
interview was a short piece, but citing the six studies Dossey does and
failing to mention Benson, Aviles and Sloan would be like an historian
arguing that Germany
is the greatest military power in the history of the world by citing the Seven Year's War and the
Battle of Leipzig -- and neglecting to mention World War I and World War II.
How could Dossey do that?
After reading Tavris and Aronson's book Mistakes Were Made (but not by me), I understand
how. Dossey has staked out a position in support of efficacious
prayer and mind-over-matter, and has invested a lot of his time and
energy in that position. He has gotten to the point where any
challenge to his position would bring cognitive dissonance: if his
position is wrong, then he is not a smart and wise person; he
believes he is smart and wise; therefore his position must be correct
and any evidence against it must be ignored. This pattern of
self-justification (and self-deception), Tavris and Aronson point
out, is common in politics and policy (as well as private life), and
it looks like Dossey has a bad case. Ironically, Dossey is able to
recognize this condition in other people -- he has a powerful essay that criticizes George W. Bush
for saying "We do not torture" when confronted with overwhelming
evidence that in fact Bush's policy is to torture. I applaud this essay, and I agree that Bush has
slipped into self-deception to justify himself and ward off cognitive
dissonance. Just like Dossey. Dossey may have a keen mind, but his mind has
turned against itself, not allowing him to see what he doesn't want to see.
This is a case of mind over mind, not mind over matter.
Conclusion 3: Final Lessons
-
If you aren't trained to recognize the types of errors I outline here and in
my other
essay, you can easily interpret a claim that "pneumonia was
significantly less in the prayed-for group at the
p=5% level" as meaning "there is a
95% chance that prayer is effective in reducing cases of pneumonia." After
reading these essays you should realize that the two statements are not at
all equivalent.
-
I really like that Dossey said "We need a single standard where we subject
both conventional and alternative medicine to the same high
standards." I agree with Dossey that the standard for publication in
medical journals should be more strict. Perhaps Victor Stenger is right in
saying that studies should be accepted only with
p closer to 0.1% rather than 5%, but I don't
think that relying solely on the p
value is the right way to decide.
-
In fact, I'm questioning the whole idea of
p values, or at least the idea that
they should be so prominent in publications. A
recent
article by J. Scott Armstrong says "I
briefly summarize prior research showing that tests of statistical
significance are improperly used even in leading scholarly journals.
Attempts to educate researchers to avoid pitfalls have had little success.
Even when done properly, however, statistical significance tests are of no
value." By that he means that it is always better to give confidence
intervals rather than significance levels. If I say that X is better than Y
with significance p=5%, you haven't
learned much about the difference between X and Y. But if I say the
95% confidence interval for X is 85-110 and the corresponding interval for Y
is 110-150, then you have a much better idea of the magnitude of the
difference.
-
I think it would be great if there were an online international registry of
experiments: when you start an experiment, you register your hypothesis and
methodology and get a timestamp. Your submission could be kept secret
for a year or two if you desire. When you go to publish, you need to show
that your experiment was properly registered ahead of time. If you fail to
publish, researchers still have a record of file-drawer effect experiments.
-
Anyone can learn to be a better judge of evidence. This essay and its
companion
attempt to teach the basics.
By the way, my relative pulled through the surgery with no
complications. Hmm, were all you readers praying for a good outcome?
Maybe there is something to this retroactive intercessory prayer thing after
all...
Peter Norvig