23andMe's State-of-the-Art Geographic Ancestry Analysis
Ancestry Composition is a powerful, well-tested system for analyzing your ancestry based on your DNA. We believe it sets a new standard in the industry for rigor. Here we'll try to explain how the analysis works in an accessible way, and present some key test results. If you have questions, please post a question in the 23andMe Community. You might also like to check out our more technical ASHG poster on the system.
All DNA ancestry analyses rely on the same signal to produce their results — they differ only in how they go about capturing this signal. The signal that they use is the association between a DNA marker and a geographic location. DNA markers vary widely in how strongly they are associated with a geographic location.
Here's an example to illustrate the idea. This image shows the frequency of the maternal haplogroup H around the world. You can see that H is very common in Europe, is found in Africa and Asia, and never seen in Australia or the Americas.
The association between this marker and geographic location works both ways: If you know you have European ancestry, we'd know that there's a decent chance you have the H haplogroup. On the other hand, if you have the H haplogroup, we'd know that it's unlikely that you're Native American.
Just on the basis of this one DNA marker, we wouldn't be able to locate your ancestry with much precision. Fortunately, there are quite a few DNA markers available on the 23andMe platform. If we combine the evidence from many markers like this haplogroup, each of which offers a little bit of information about where in the world you're from, we can develop a clear overall picture.
Wrinkle #1: People Usually Have Multiple Ancestries
If all your DNA was from one place in the world, figuring out where you're from would be very easy, and very accurate. Recent research has suggested that, when you're looking at someone of European ancestry that you know to have all their ancestry from one place, you can locate their ancestral home within a range of around 100 miles!
For most people, though, it just isn't true that all their ancestry comes from one place. The technical word for this is "admixture" — the genetic mixing together of previously-separate populations. It's common for European customers to be mixes, with contributions from all around Europe. Latino customers typically have Native American, European, and often some African DNA.
Wrinkle #2: We Don't Know Which DNA Comes From Mom and Dad
Recall that for each of your 23 chromosome pairs, one chromosome in each pair comes from your mom, and the other from your dad. Genotyping chips like the one 23andMe uses don't capture the information about which markers came from which parent.
Here's a quick example to illustrate. Say, for a short stretch of Chromosome 1, you inherited the following genotypes:
from Dad: A-T-C from Mom: G-T-A
When we get your data, when we look at your genotype in this spot on Chromosome 1, we'll see the following:
You: A/G T/T A/C
The markers on the ends are jumbled up. Now there are two possibilities that are consistent with the data, and we don't know which it is. It could be:
which happens to be wrong, or it could be:
which is right. The technical term for knowing which markers are on the same chromosome together is "phase." DNA data like on our genotyping chip is called "unphased." There are methods to infer the phase, however.
So what? This matters because runs of SNP markers are more
informative about geography than are individual SNP markers. In
the above example, the run
A-T-C will generally say
more about ancestry than do the A, T and C considered
Ancestry Composition Overview
There are several different ways to deal with admixture and unknown phase. The approach we take at 23andMe is simple, but powerful.
We start off by phasing your chromosomes using an in-house adaptation of the well-known program BEAGLE. That's how we handle phasing.
Then we break up the chromosomes into short, non-overlapping, adjacent windows, like boxcars in a train. The idea is that you inherited all the DNA in that window from a single parent, grandparent, great-grandparent, etc., going back quite a few generations. Using these short windows is how we handle admixture.
We compare the DNA you have in each window to the DNA in the same window in our reference dataset, and assign the DNA in each window to the population whose DNA it's most similar to.
Then we process those assignments further, "smoothing" them out. For instance, if you have a long run of assignments from population A, interrupted by an assignment to population B, this process can correct that B to an A.
Finally, we calibrate the results to ensure the results are accurate at the confidence levels we report.
Prep 1: The Dataset
You need a lot of reference data in order to make an analysis like this work. We compiled a set of 10,418 people with known ancestry, from within 23andMe and from public sources. That's over 20,000 chromosomes, since every individual contributes a chromosome from both their mother and their father. This a big jump over the 210 individuals that powered our original Ancestry Painting feature.
Most of the reference dataset comes from 23andMe members just like you. When someone tells us that they have four grandparents all born in the same country, and the country isn't a colonial nation like the US, Canada or Australia, they become candidates for inclusion in the reference dataset. We filter out all but one of any set of closely-related people, since they can distort the results. And we remove "outliers," people whose genetic ancestry doesn't seem to match up with their survey answers. To ensure a clean dataset, we filter fairly aggressively — nearly ten percent of reference populations candidates are removed.
The public reference datasets we've drawn from include the Human Genome Diversity Project, HapMap, and the 1000 Genomes project. We perform the same filtering on these public reference datasets as we do on the customer dataset.
Prep 2: Population Selection
Although customers report their ancestry at the country level, the populations we use in Ancestry Composition typically refer to several modern countries, as with "British and Irish."
We select populations in Ancestry Composition by studying the reference individuals, choosing candidate populations that appear to cluster together, and then evaluating whether we can distinguish the groups in practice. Using this method we refined the candidate reference populations until we arrived a set that worked.
Here's an example of one of these diagnostic plots. The European reference set is laid out using principal components analysis, solely on the basis of their genetic distances to each other. We apply different plotting symbols and colors after the fact, based on their known ancestry. You can see that people from the same population tend to cluster together well. Some populations, like the blue-triangled Finns on the left, are relatively isolated from the other populations. Because Finns are so distinct from other populations, they actually get their own reference population in Ancestry Composition. Most country-level populations overlap to some degree, though. That's where we experimented with different groupings of the country-level populations.
Populations may be inherently difficult to distinguish because of historical mixing, or we might not have had enough data to tell them apart. As we obtain more data, populations will become easier to distinguish.
Step 1: Phasing
We phase customer and reference data using our own version of Brian Browning's BEAGLE software. With a tip of the hat to Darwin, we named our version "Finch." Finch uses statistical analysis to separate each parent's contribution to a person's DNA, without requiring the parent's DNA. It doesn't say which DNA is from your mother, and which is from your father — for that you do need a parent's DNA.
We wrote our own version of BEAGLE so it would work smoothly in our production environment. Because Finch and BEAGLE use the same underlying algorithm, Finch achieves phasing accuracy consistent with that of BEAGLE.
There's one important difference between Finch and BEAGLE. BEAGLE makes the assumption that all of the individuals that need to be phased are available when the program is run. That assumption is not true for the 23andMe database, since new customers join every day. To avoid the computational costs of re-running the analysis from scratch, we modified BEAGLE to efficiently handle customers that weren't present in the initial sample.
Step 2: Window Classification
After phasing customer genetic data, we segment the chromosomes into consecutive windows of about 100 markers. There are between 5,000 and 40,000 markers on a chromosome on the 23andMe platform, so this equals 50 to 400 windows depending on the chromosome's length. We then take each of those windows in turn and compare it against the reference populations to determine what populations that window is most likely coming from.
There are many ways to approach this classification, and we tried several. The best-performing option was a well-known classification tool called a support vector machine, or SVM. An SVM can "learn" different classifications based on a set of training examples, and then assign new objects to a learned category.
In the case of Ancestry Composition, we supply the SVM with strings of DNA and tell it where those strings are from, e.g. "Finland," based on our reference database. Then, when we look at the DNA from a customer with unknown ancestry, we can ask the SVM to classify it for us based on the examples it has seen in the past.
We chose SVMs because they performed the best among techniques that we tried. SVMs are also very fast, which is critical for a large and growing database.
Step 3: Smoothing
The SVM classifies each window for us independently, giving us a "first draft" version of the customer's ancestry. For example, suppose we just had three populations: X, Y, and Z. What comes out of the SVM might look like this:
chromosome 1, parent 1: X - X - Y - X - Z - Z - Z - X - Z chromosome 1, parent 2: Z - Z - Z - Z - X - Y - X - X - X
The smoother's job is to try to correct or "smooth" two kinds of mistakes, both of which we can see in this example.
The first kind of mistake is an unusual assignment amid a run of
similar assignments. In the first line above, there's a run of
Z's, interrupted by a single X:
Z - Z - Z - X - Z.
It may have been that that lone X was a close call
between X and Z, that went the wrong way. But in context, we can
see that it probably should have been a Z. The smoother can
correct this to
Z - Z - Z - Z - Z.
The second kind of mistake is one inherited from the phasing step. Finch can make a mistake known as a "switch error," where it mixes up the DNA of one parent with another. The smoother can switch the ancestry assignments back between your two versions of a given chromosome. In this case, it looks like there's a switch error after the fourth window, that it would be better to keep the runs of X's together, and the runs of Z's together.
The smoother uses a version of another well-known mathematical tool called a Hidden Markov Model. The Hidden Markov Model is useful for analyzing sequential data, such as biological sequences or recorded speech.
The smoother will output something like this:
chromosome 1, parent 1: Z - Z - Z - Z - Z - Z - Z - Z - Z chromosome 1, parent 2: X - X - X - X - X - X - X - X - X
This simplified example illustrates the purpose of the smoother. With real human data the picture is messier, and the answers are rarely so clean. So instead of recording a single population in each window as above, we record the smoother's estimates of the probabilities of each population occurring in each window. The following picture should make this more concrete:
This is the output of the smoother for an African-American customer's two copies of chromosome 2. The top panel shows the smoother's estimates for one parent's contribution — the chromosome the individual got from that parent. The bottom panel shows the other parent's contribution.
Let's look at the bottom panel. It starts off with a run of pink, then a run of green, then another run of pink. Pink is our color for African, and green is our color for Native American. The y-axis runs from 0 to 1, and is our estimate of the probability that the DNA in that region of the chromosome is from that population. These pink and green regions fill entire vertical space, indicating that we are 100 percent confident that the DNA in these respective regions is African and Native American, respectively.
In the next region to the right, we encounter a stretch of multi-colored blue. The thickest strip is the dark teal corresponding to Britain and Ireland meaning that it has the highest probability associated with it. We give it somewhere between a 50 percent chance and a 60 percent chance of being from Britain and Ireland. Smaller strips corresponding to Italy, Iberia, and France/Benelux/Germany are also present. If you think back to the haplogroup example above, this is not unexpected. Haplogroup B is found in lots of places, just more commonly in some places, and less commonly in others. Here, the system is saying that it's found the DNA the customer has on this chromosome all over Europe. We can very confidently say here that this stretch of DNA is of European origin, but the evidence doesn't allow us to say more further than that.
By contrast, in the first "European" stretch of DNA from the left in the top panel, we are more confident that this DNA is of British origin, around 70 percent.
Step 4: Calibration
These plots are illuminating, but how do we know they are correct? This is what calibration of the results accomplishes.
First, we ran some tests just to establish whether there were any systematic biases. We simulated a large set of admixed individuals, so that we would know the "true" ancestry of each part of their genome. Then we ran these simulated individuals through the entire pipeline, and compared the results with the actual answers.
We found that most of the reference populations were already fairly well-calibrated, in the sense that we predicted ancestry in proportion to how often it actually occurred in the simulated dataset. But there were a few populations, in particular the Scandinavian and Balkans reference populations, that we found to be overrepresented relative to our simulated data.
In light of this finding, we developed a fairly simple recalibration scheme that adjusts the ancestry proportions seen in the previous section so that over the entire set of simulations of admixed individuals, each of the assignments are reported in proportion to how often they actually occurred.
Step 5: Aggregation & Reporting
As a last step we summarize the results for display. The way we do this is to apply a threshold to the probability plot, like so:
Here, we're showing a 70 percent confidence threshold. We'll run across the chromosome from left to right, and ask whether any population has confidence exceeding that threshold. You can see that with the exception of the blue European stretch, we exceed this threshold over the majority of the chromosome. These regions will contribute in proportion to their size to the overall ancestry percentages: For example, consider the green Native American segment near the right end, and that it works out to be 0.26 percent of the entire genome. Even though we have some probability that the segment comes from other regions, the Native American proportion exceeds the current 70 percent threshold, and so we'll add 0.26 percent Native American to the overall ancestry percentages.
In the case of the European segment, no single population exceeds our 70 percent threshold, so we won't report that DNA as coming from any of those populations. In this case, we refer to our hierarchy of reference populations. For example, we have a "Northern European" group that contains three reference populations: Britain & Ireland, Scandinavia, and France & Germany. We'll add up the contributions of each of these subgroups, and see if the total "Northern European" contribution exceeds the 70 percent threshold. If it did, then we'd report the region as "Broadly Northern European".
In this particular case, the Northern European reference populations do not exceed 70 percent. So we go up another step, and see if the contributions of all the European populations exceed 70 percent. They do, and so we would report this region as "Broadly European."
You can see the entire hierarchy in the Ancestry Composition table by clicking "Show all populations."
In regions where we go all the way to the top of our hierarchy, but no group of populations exceeds the threshold in place, we'll report "Unassigned."
We've built in three confidence thresholds to Ancestry Composition. These are Speculative (51 percent), Standard (75 percent), and Conservative (90 percent).
Using Close Family Members
Ancestry Composition gets even more powerful if your parents or children are in the 23andMe database. You get higher-resolution results with either of these relatives added, and if you have a parent in the system you get an additional view into your results.
To enable this, you'll need to be sharing with your parents and/or children. Learn more about sharing.
When you have a parent or child in the system, you'll get a very high-quality phasing from Finch. That translates into better Ancestry Composition results, in the sense that you'll tend to see more assignment to the finer-resolution populations: more "Scandinavian," less "Northern European."
Why is that? When you think back to our smoother example, we had two kinds of mistakes to correct. We had mistakes along the chromosome, and we had mistakes between the chromosomes. When you're phased on the basis of a parent or child, it allows us to ignore mistakes between the chromosomes because they essentially do not occur, so we can be more confident in correcting the along-chromosome mistakes.
You get that boost in resolution with either a child or a parent in the system. It's slightly better to use a parent than to use a child, and it's slightly better to use two parents than one parent — however, you get the majority of the benefit with the first relative you add. The system will automatically use the best available configuration, and it will also automatically update your results if you share with a new relative. (As of this writing, it can take up to 48 hours for the results to update.)
If you have one or both parents in the system, you get an extra view. You'll be able to see "Split View," where we show the contribution of your mother to your ancestry on one side, and the contribution of your father to your ancestry on the other. We can't provide this view if you don't have a parent available, because a parent is necessary to orient the results.
Here's an example of what you can do with this unprecedented view into your ancestry. Say that you find you have a small amount of Ashkenazi Jewish ancestry in your results. When you turn to split view, you'll be able to see immediately which parent that came from.
Testing & Validation
Ancestry Composition has a lot of steps, and each step has to be tested. We've discussed a few of those tests along the way. In this section, we want to share some actual test results to give a sense of how accurate Ancestry Composition is. We emphasize the final test in the chain, because that integrates the performance of each of the steps into an overall picture.
We have to introduce a couple of technical terms here, "precision" and "recall." They both arise pretty naturally when you're trying to test how well a prediction system works, and are the standard measures that researchers use. "Precision" corresponds to the question "When the system predicts that a piece of DNA is from population A, how often is it actually from population A?" "Recall" corresponds to the question "Of the pieces of DNA that actually were from population A, how often did the system predict that they were from population A?"
Why is the balance of precision and recall important? A high-precision, low-recall system will be extremely picky about assigning, say, Scandinavian ancestry. It will only assign it when it's very, very sure. That will yield high precision since the assignment of Scandinavian is always correct, but low recall, because a lot of true Scandinavian ancestry is left unassigned.
With a low-precision, high-recall system the opposite problem exists. In this case, the system is liberal with assignments of Scandinavian ancestry. Any time it was indicated that a piece of DNA was Scandinavian, it would be assigned as such. This will yield high recall, as all genuine Scandinavian DNA will be assigned, but low precision, because Scandinavian ancestry will often be assigned incorrectly.
Clearly the ideal system has both high precision and high recall. Let's see how Ancestry Composition does. What we've done here is to set apart 20% of the reference database, about 1500 individuals of known ancestry. Then, we train and run the entire Ancestry Composition pipeline, on the other 80% of the reference individuals - these 20% of individuals will be like new 23andMe customers, with DNA that the system hasn't seen before. Since they have known ancestry, we can check to see how accurate their results are after running the test. We run this test five times, with a different 20% held out each time, and then average across the five tests to give the following results:
|Central & South African||1.00||0.89|
|East Asian & Native American||0.99||0.99|
|British & Irish||0.90||0.39|
|French & German||0.78||0.08|
|Middle Eastern & North African||0.95||0.83|
As you can see, our precision numbers are very high across the board, mostly above 90%, in a few instances dipping down to and below 80%. That means that when they system assigns an ancestry to a piece of DNA, it is very likely to be accurate. You can also see that as you move up from the sub-regional level (e.g. Britain and Ireland) to the regional level (e.g. Northern Europe) to the continental level (e.g. Europe), the precision approaches 100%.
Ancestry Composition's recall is also impressive. The numbers tend to be slightly lower than the precision. That means that it can be reticent to assign ancestry to pieces of DNA, when it's not sure enough. In the worst case, the French & German population, the recall is 8%, meaning that 92% of the actual French & German DNA was not labeled as such. Again, note that as with the precision, the recall values get better and better as you move from sub-regional to regional to continental.
Sometimes, poor recall doesn't mean bad results. Some populations, like Sardinian, are just hard to tell apart from others. It's important to note that when Ancestry Composition fails to assign Sardinian DNA, this does not necessarily mean that it incorrectly assigns it to something else, like Italian. If it were, that would show up as poor precision for the Italian population. Instead, Ancestry Composition will label this Sardinian DNA "Broadly Southern European" or "Broadly European."
Ancestry Composition's Future
Ancestry Composition, as you've seen, has a modular design. This was intentional, because it makes it possible to improve the components of the system independently. We can upgrade Finch's phasing reference database, or the SVM's reference database of people of known ancestry.
Most of us have become accustomed to the idea of semi-regular software updates, and we hope to apply this same model to Ancestry Composition. When we improve some component of the system or upgrade one of the reference databases, your results will automatically be updated and you'll see a note about what has changed and why.
Updated 29 May 2014