What do we know?
You’re at the zoo and you see this black and white striped animal:
You say, “That’s a zebra.” I’m standing next to you and I ask, “Do you know that’s a zebra?” Slightly irritated, you say, “Yes, of course. We’re in a zoo and the only black and white animal that looks like that in a zoo is a zebra. Duh.” At these low stakes, you certainly have enough information and context to be confident that you’re looking at a zebra.
I ask, “How do you know it’s not a donkey painted and trimmed to look like a zebra?” Now, you’re irritated and stuck: You do not have enough information to rule out a cleverly disguised donkey, at least, not at this distance. You’d have to rule out the relevant alternative that it is a cleverly disguised donkey and that would take additional information, including the rules for how to test that relevant alternative claim. But, standing where you are, you do know:
If it is a zebra, then it is not a donkey
If it is not a donkey, then it is also not a cleverly disguised donkey
What you don’t know is whether it is a cleverly disguised donkey. As soon as I raise the possibility of that, however, that changes the context away from “zoo” to other unforeseen situations and that raises the requirements and standards for what counts as knowledge. Now, real zebras and fake donkeys are pretty small potatoes, as far a knowledge goes. But the more you have at stake, the more evidence you need in order for something to count as knowledge. You can have a true belief that’s right but not knowledge: Looking at a clock stopped at 10:07 when it is, in fact, 10:07, doesn’t make it knowledge; claiming that it is a very reliable clock is justification for your true belief of the time. Knowledge, however, is more than justification and true belief.
And how do we know it?
Sciences get enough data to extrapolate from specific samples to general populations. They use statistics to describe, analyze, and predict. Medicine uses statistical (and summarized) data to apply general principles to specific cases. The statistics inform the diagnosis and additional tests can confirm that initial diagnostic estimate. Forensic science*, for the most part, can’t use generalizations because any entity (person, place, or thing) that becomes evidence is a member of a set with an unknown number of members. The world is too big. Except for DNA (ahem), we can’t say the chance of finding another source entity like this one is X in Y or that this entity occurs this often at random or whatever.
To get around this limitation, likelihood ratios (LRs) have been suggested. Basically, the LR measures the strength of support for the hypothesis that two entities share a common source given the alternative hypothesis that they do not. But, as Koehler and co-authors note:
At an abstract level, the LR is an appealing way to report forensic science evidence. In practice, however, it raises a set of challenges. Aside from a relative dearth of data, a significant obstacle to employing LRs to assess evidentiary weight is that it often is not obvious what values to use for the LR numerator and denominator. Even when LRs are computed using reliable data, human judgment usually plays a significant role. For example, reasonable people might disagree about the size and composition of the reference population used to inform the denominator of the LR. Consequently, the size of the LR may vary, sometimes by orders of magnitude.
We don’t have good numbers to put into LR formulas as a “seed” to base the strength of our testing outcomes on. Garbage in, garbage out. Let’s say you have a shoe and a shoe print that you think was made by that shoe: What number do you start with? Where do you get that number? How many other shoes are there out in the wild that might meet these criteria? How would you know? Are there donkeys painted to look like zebras? Whoops.
Verbal scales are also used in some forensic sciences* and elsewhere but these are fraught with potential issues, as well (Koehler, et al):
if studies show that people treat, say, a 10,000:1 LR as if it were a 100:1 LR when the term “more likely” is used, then a different qualitative phrase is needed. It is not appropriate to simply assign verbal labels to LRs without knowing how people interpret those labels.
Words matter, especially everyday words used as technical terms. Lacking statistical backing, forensic examiners use words like “rare,” “common,” “typical,” and so on in their interpretation of results. This can be misleading and present the belief that the results are more justified than they actually are.
I’ve always characterized “reliability” as “accuracy with precision.” Reliability, however, is necessary but not sufficient: Error rates are only part of the equation. Likewise, transparency and honesty are not enough:
…consider the practical consequences for a researcher who eagerly accepts the message of ethical and practical values of sharing and openness, but does not learn about the importance of data quality. He or she could then just be driving very carefully and very efficiently into a brick wall…To learn from errors, we want a system of science that facilitates and provides incentives for such learning; we don’t want an attitude that automatically links error to secrecy or dishonesty. [emphasis added]
This is especially true in the forensic sciences*. What do we say, then, on the stand? How do we accurately convey our findings and their strength (or weakness) without statistics? It’s not (necessarily) that forensic methods are inaccurate or fanciful: We either don’t (not really) or can’t (yet) know how reliable they are.
Interpretation of forensic science* outcomes is ripe for some real innovation in research but we’d rather investigate how to lower already low detection limits in existing methods. I guess that’s safer, huh?