In our work to improve sleep and focus using audio content, we try to strike the perfect balance between science and art.
On the side of art, we have a brilliant audio team who somehow manage to put out multiple pieces of beautiful music every month.
On the science side, it’s my job as Lead Researcher to find studies that apply to our work. Then, I figure out a way to make the useful information in those studies available to the composer, voice artists, and developers.
I’m often asked how I find relevant scientific research, and how I know it’s the ‘real thing,’ meaning, how do I know it’s science and not pseudoscience.
The aim of this blog is to give a general overview of how I find real experimental research.
I like to begin my searches in databases that do some of the work for me.
Search engines like google scholar may have access to high volumes of information, but there is no telling how that information is selected.
So I start in places like PsycINFO, which is the American Psychological Association’s database, or PubMed which is the National Institute of Health’s Database.
Because these are the research libraries of foundational scientific organizations, the studies I find here tend to be of a higher caliber.
These databases often screen for things like peer review, which means that studies must be closely examined by a team of fellow scientists before publication.
Peer review is a staple of scientific research. It insures that a basic standard of methodology is met. Part of the scientific culture, exemplified in the peer review process, is the professional understanding that careful scrutiny and harsh criticism from peers is not only expected, but encouraged.
A good scientist expects that their methods and assumptions will be tested, and only the soundest of studies will make it to publication.
The peer review process brings us to the next line of defense against pseudoscience. This is the publishers and the journals themselves.
Certain publishers have a reputation for excellence which they have earned over time, and so seeing their logo as a header of a study is a sign I’m in the right place.
Now this is not to say that a study published by a less established source is automatically untrustworthy, it just means that it does not have the seal of approval of a institution that is known for only putting out quality research.
You can still read research from lesser-known publishers, but just proceed with a little extra caution.
Next comes the journal that the research is published in.
Some journals, like some publishers, have reputations that precede them. Certain journals have long histories of publishing the work of highly qualified and rigorous researchers, and these journals tend to be regularly read by other researchers and clinicians in the field.
This may seem like a vague distinction, but this kind of impact can actually be quantified.
The Impact Factor is a number that indicates the ratio of citations to publications for any journal.
If the ratio is 1:1 this implies that the number of times a study from the journal is cited is exactly the same as the number of studies published. This is not ideal.
We would hope that the studies that are put out by a journal are impactful and useful enough that they are cited extensively in further research.
Going back to basic math, remember that ratios can be represented as fractions, and if the numerator (number of citations) is bigger than the denominator (number of studies published) you will get a value larger than one.
So we want the impact factor of a journal we’re reading to be above 1, indicating that on average each of the journal’s studies are cited more than once.
Here is a page from the International Journal of Nursing’s website that shows their impact factor (scroll down, it’s on the left). You can also see, in the top right corner, the sigil for Elsevier, a reputable publisher.
Like the publisher, the journal builds a reputation over time, and this is not always quantifiable, but requires years of work in a particular field.
A journal may not be cited as often, but still produce quality work. These are guidelines, not laws, no amount of criteria will substitute for experience in the field.
Another way of examining a journal is to look at the editors. If the editors are well established in their field, having worked for many years, and themselves produced high quality and impactful research, this is a good sign.
You can see, on the page linked above, for the International Journal of Nursing that the Editor and Chief is highly qualified (if you want, check them out for yourself).
The authors are themselves another way of verifying the legitimacy of research. Ideally you want at least one author, normally the lead (the one whose name is listed first or whose name has a number one next to it) to be a researcher who has done quality work in the field.
Their previous work should have ideally been published in journals that meet the aforementioned criteria, and at least some of their research should be cited in other authors’ works in the field.
The way scientists are trained, they almost always begin in the lab of an established researcher. This means that normally at least one person doing a study is trained, experienced, and qualified enough to have their own lab.
Now sometimes there are groundbreaking theses (the final works done by a graduate student before achieving their degree and status as a fully fledged researcher).
However, these often require follow ups and replication because thesis research often has low sample sizes and exploratory methods. Again, guidelines not rules.
A body of research refers to all the studies that examine one general theory and/or topic.
For instance, when I was examining research on music and sleep, it was immediately evident that there were a large number of studies done on the topic. This gave me confidence that general trends could be found, and that I would not have to base conclusions on any one lone study.
If there is a large enough body of quantative research, than there will probably be at least one meta-analysis.
Meta-analyses take multiple studies and treat them each as one data point in a larger, combined analysis.
After condensing the data, statistical analyses are run across all results to find trends, and actually quantify these general trends.
Meta-analyses are really great for another reason, mainly that they weed out studies that have faulty methodology and do not include them in the analysis.
This means that the studies included in a meta-analysis can be a very good start when examining individual studies. If the studies are included in a meta-analysis (and the meta-analysis meets the above publishing criteria), it’s a good indication that these studies employ sound methodology.
As an example, here is meta-analysis of the effects of music on sleep, from a reputable publisher, an impactful journal, and written by highly qualified researchers (for example, this PhD epidemiologist who also just so happens to be a professor of evidence-based practice in mental-health care).
There is also replication. Ideally, a study will be reproduced to confirm its findings, showing that the results were not a statistical fluke (more on this later).
Failing exact reproduction, a study should hopefully be consistent with other studies that examine very similar questions, just in slightly different ways.
Take a look at two studies examining the use of music to improve sleep for older adults (here, and here).
The fact that both studies show the same result, that music is helpful for sleep disturbances, adds weight to this conclusion. Each study confirms the other’s results.
Notice the publishers, the journals, the authors, all add legitimacy to the validity of these studies.
If there are multiple conflicting studies, this is a red flag, and means thorough investigation is required. Meta-analyses can be helpful here, because they aggregate the results of as many quality studies as they can find.
If a study is put out by a well established publisher, in a reputable journal with a high impact factor and respectable editors, and the study itself is performed by someone who has done previous meaningful work in the same field, and there is a strong body of research around the same topic (possibly including a meta-analysis), it’s a sign you are on the right track.
However, here’s the paradox. Research that is put out under these ideal conditions can still be seriously flawed, and research that does not meet these standards can still be excellent.
All we do by examining these questions of authorship, authority, and authenticity is increase the chances of a study being of high quality, but this is no guarantee.
So what do we do?
If a good study may come from relatively unknown sources, and a questionable study can come from sources with great reputations, where do we go from here? And what about brand new research that is done by relatively unknown authors?
Well, to answer these question, you’ve got to examine the methodology of the study itself.
Is the study reproducible?
This question is asking, can other scientists test the reliability of the results by reading the research, and then doing the same or similar studies themselves?
When reading research, it should be crystal clear how everything was done, down to the exact technology used, time spent, and environmental conditions of the experiment. This should all be laid out in the Methods and/or Procedure sections of the study.
Are validated and reliable measures used?
This question is about what tests are being done, and what is being tested. If you want to know if music improves sleep, then the question should immediately arise: how do we measure sleep?
If researchers use a measure like the Pittsburgh Sleep Quality Index (PSQI), then you should be able to find studies that validate this measure, meaning that they have compared this measure to other ways of measuring the same or related factors.
Here’s a study validating this measure. The PQSI has been translated and validated in at least ten languages.
The measure should also be reliable, in the sense that if it is supposed to measure something stable, the results should be stable.
If I have you answer questions about your sleep last night, then I ask the same questions a few hours later, and the answers have changed, the test probably does not lead to reliable answers.
Scientists can test and then retest later using the same measure to check its reliability.
For an example of the PQSI’s reliability, check here. I was a little thrown off by the bright pink web page layout, but I checked the impact factor, publisher, and editors here (again, scroll down, it’s on the left).
An experiment as a whole needs to meet another standard of validity: ecological validity.
This means that a study will actually be useful outside of the lab. Studies that require intravenous drugs, or expensive machinery, may only be ecologically valid for hospitals but not for people in their homes. This becomes important when dealing with something like sleep.
Studies must randomize participants.
Who gets the pill, and who gets the placebo has to be random, otherwise differences between participants might account for the results of the experiment. If all the people who got the the pill already slept better, than the results of the study are based on pre-existing differences.
Randomly selecting participants for different experimental conditions means that these kinds of pre-existing differences are distributed throughout the conditions.
Another way to prevent participant differences from affecting results, and giving the researchers false impressions, is by controlling other factors.
Controls take different forms depending on the topic being studied, but the reasoning is the same across the board. We want to make sure that there are not third factors (confounding variables) that are influencing the results.
Factor A may seem to cause Factor C, but Factor B may actually be causing both. Without proper controls, you cannot isolate causal relationships.
Controls also mean that there needs to be a comparison group, a condition where the intervention is not introduced.
If you find that people sleep well with music, but you never examine how participants slept compared to people who did not listen to music, then your results are meaningless. You have to have something to compare your results to, and this is the control or placebo group.
This study found that listening to music at betime improved sleep for older adults, by helping them fall asleep faster, wake-up fewer times in the night, and feel more rested during the day. The researchers used random selection.
They also controlled for eighty different variables that might confound the results. This meanis that the researchers excluded people with characteristics and behaviors that might influence sleep, and they used statistics to account for other factors. If these factors were not controlled for, it would be unclear what was having the effect of improving sleep.
The researchs also instructed subjects to listen to music in their homes at bedtime, making the study ecologically valid, which means that it tested a ‘real world’ and practical outcome.
Now we’ve all heard a lot about the placebo effect. One element that contributes to this phenomenon is the expectancy effect, which refers to the phenomenon of participants beliefs about what will happen actually influencing the outcomes.
This makes sense in some ways. If a participant believes that a given intervention will work, they tend to behave in ways that will promote the results they expect.
So ideally the research has an active control group. Don’t just compare music to silence, because then it will be clear to the music group that the music is being tested, and the silence group will be pretty sure they are the control group. You need something active, that might reasonably encourage expectations of change, to compare your intervention to.
For instance, comparing music to audiobooks or white noise, is a good start.
Blinding refers to the process of hiding from participants which condition they are in, so that their expectations will not bias the results.
A study also needs to have a large enough number of participants, known as sample size. For statistical tests to be meaningful, there has to be a large enough amount of data.
Here is an example of a study that meets all the criteria we’ve discussed before (notice the reputable publisher, impactful journal, qualified author, validated tests, large sample size, as well as the randomized, controlled, and blinded methods).
The researcher’s expectations also can influence the results, so we need a way for the researches to not interfere in their own study. This is done by hiding which condition is which from the scientists who interact with the participants, and those who are analyzing the raw data. This is the second blind. A study that prevents both the participant and the researcher expectancy effects, is called a double blind study.
So we’ve done everything to prevent bias and confounding variables. We move now to how researchers analyzes results. The most basic standard is statistical significance. What this means is that a finding is most likely not by chance.
This statistical test examines how often, out of one hundred hypothetical tests, these results would appear. The normal cut-off is .05 or 5%, meaning that the results are very unlikely to be by chance. The smaller this number, known as the P Value, the less likely this result is to be a fluke (but remember, we still want replications and even meta-analyses to confirm.)
In the last study cited the P Value was (P < 0.0001). This, combined with the randomized, active controlled, and blinded methods, and the large sample size, indicate that it is very unlikely these results were by chance.
So the result is not random chance, this does not necessarily mean that it’s useful. A result may not be random, and yet the effect can still too small to be helpful in understanding or solving a problem.
So to measure the actual size of the difference between the active control condition (say the group that listens to an audiobook at bedtime) and the experimental condition (let’s say the group that listens to music at bedtime) we look at something called an effect size.
The effect size indicates the amount of overlap, and thus the amount of difference, between two distributions.
Sometimes individual studies won’t do this analysis, but a meta-analysis will give you an effect size across multiple studies, which is very useful. Small, moderate, and large effect sizes can be found, and depending on the situation, an effect size that is somewhere between small and moderate starts to be useful or meaningful (see the meta-analysis linked above).
So, to sum up, in two run-on compound sentences, how I tell if research is legitimate:
Ideally you want a study published by a reputable publisher, in a journal with a high impact factor and experienced editors; It must be written by qualified authors in the field, who perform proper randomization, controls, blinding, and statistical analysis (P Value for statistical significance, and effect size or Cohen’s D for usefulness). A study should also build on a body of research, and should be reproducible.
I hope this helps. Keep in mind that not all of these criteria are laws of nature, instead they are general guidelines to help verify the quality of what is being read.
Try your hand at finding relevant research. Feel free to send it my way if you have questions or find something interesting (especially in the sleep space!).
Try Pzizz. We design soothing audio that’s clinically proven to help you get better rest. Available on iOS and Android.
More Stories