23andMe's State-of-the-Art Geographic Ancestry Analysis
23andMe Ancestry Composition is a powerful, well-tested system for analyzing ancestry based on DNA, and we believe it sets a standard for rigor in the genetic ancestry industry. We wrote this document to explain how our analysis works and to present some quality control test results.
Your Ancestry Composition report shows the percentage of your DNA that comes from each of 31 different ancestry populations worldwide. We calculate your Ancestry Composition by comparing your genome to the genomes of over 10,000 people with known ancestry. When a segment of your DNA matches the DNA from one of the 31 populations with high probability, we assign that ancestry to that segment of your DNA. We calculate the ancestry for individual segments of your genome separately, and then we add them together to get your overall Ancestry Composition.
DNA and Ancestry
Different DNA markers are found in different places across the world, and every marker has its own characteristic pattern of geographical variation. The 23andMe Ancestry Composition algorithm combines information about these patterns with the unique set of DNA markers in your genome to estimate your genetic ancestry.
Here's an example of a haplogroup, one special kind of DNA marker, that illustrates the idea. This map shows the frequency of the maternal haplogroup H around the world. Haplogroup H is very common in Europe, is also found in Africa and Asia, and is never seen in people native to Australia or the Americas.
The association between this marker and geographic location works in two ways: If you know you have European ancestry, we know that there's a decent chance you have the H haplogroup. And if you have the H haplogroup, we know that your genetic history likely includes at least one European ancestor.
Just on the basis of this one DNA marker, we can't locate your ancestry with much precision. Fortunately, we measure hundreds of thousands of DNA markers on the 23andMe platform. If we combine the evidence from many markers, each of which offers a little bit of information about where in the world you're from, we can develop a clear overall picture.
Wrinkle #1: People Usually Have Multiple Ancestries
If all of your DNA came from one place in the world, figuring out where you're from would be very easy and very accurate. Recent research has suggested that, for a European person whose whole family comes from the same place, genetic analysis can locate their ancestral home within a range of around 100 miles!
For most people, though, it just isn't true that all their ancestry comes from one place. The technical word for this is admixture — the genetic mixing together of previously separate populations. For example, it's common for people of European descent to have ancestry from all around Europe. Another example is Latino people, who typically have Native American, European, and often some African DNA.
Our Ancestry Composition algorithm handles the challenge of admixture by breaking up your chromosomes into short, non-overlapping, adjacent windows, like boxcars in a train. If the windows are small enough, then it is safe to assume that you inherited all the DNA in each window from a single parent, grandparent, great-grandparent, etc., going back many generations.
Wrinkle #2: We Don't Know Which DNA Comes From Which Parent
Recall that for each of your 23 chromosome pairs, one chromosome in each pair comes from your mom and the other from your dad. Genotyping chips like the one 23andMe uses don't capture information about which markers came from which parent.
Here's a quick example to illustrate this point. Say, for a short stretch of Chromosome 1, you inherited the following genotypes at three consecutive DNA markers:
from Dad: A-T-C from Mom: G-T-A
When we look at your raw 23andMe data in this spot on Chromosome 1, we'll see the following:
You: A/G - T/T - A/C
The genotypes where you inherited different variants from mom and dad — in this case, the markers on the ends — are jumbled up. There are two possible genotypes that are consistent with the raw data, and we don't know which one is your actual DNA sequence. It could be:
which happens to be wrong, or it could be:
which is right. The technical term for knowing which markers belong on the same chromosome together is phasing. DNA data like our raw data is called unphased.
So what? This matters because we can learn more from long
runs of many DNA markers together than we can learn from
individual DNA markers alone. In the above example, the
A-T-C will generally say more about
your ancestry than the A, T, and C say when they are
considered separately. Luckily, we can use statistical
methods — in this case, an in-house adaptation of the
well-known computational program BEAGLE — to estimate the
phasing of your chromosomes. After your raw data are phased,
the algorithm for calculating Ancestry Composition is executed
separately on each phased chromosome.
The Setup: Defining Ancestry Populations
Prep 1: The Datasets
The Ancestry Composition algorithm calculates your ancestry by comparing your genome to the genomes of people whose ancestries we already know. To make this work, we need a lot of reference data! Our reference datasets include genomes from 10,418 people who were carefully chosen to reflect populations that existed before transcontinental travel and migration were common (at least 500 years ago).
Most of the reference individuals are 23andMe customers who have consented to participate in research. When a 23andMe research participant tells us that they have four grandparents all born in the same country — and the country isn't a colonial nation like the US, Canada, or Australia — that person becomes a candidate for inclusion in the reference data. We filter out all but one of any set of closely related people, since including closely related relatives can distort the results. And we remove outliers, people whose genetic ancestry doesn't seem to match up with their survey answers. To ensure a clean dataset, we filter aggressively — nearly ten percent of reference dataset candidates don't make the cut.
The public reference datasets we draw from include the Human Genome Diversity Project, HapMap, and the 1000 Genomes project. We perform the same filtering on these public reference data as we do on the 23andMe customer data.
Prep 2: Population Selection
The 31 Ancestry Composition populations are defined by genetically similar groups of people with known ancestry. We select Ancestry Composition populations by studying the reference datasets, choosing candidate populations that appear to cluster together, and then evaluating whether we can distinguish those groups in practice. Using this method, we refined the candidate reference populations until we arrived at a set that worked.
Here's an example of one of the diagnostic plots we use to select populations. The genomes in the European reference datasets are plotted using principal component analysis, which shows their overall genetic distance from each other. Each point on the plot represents one person, and we labeled the points with different symbols and colors based on their known ancestry. You can see that people from the same population (labeled with the same symbol) tend to cluster together. Some populations, like the Finns (the blue triangles on the left), are relatively isolated from the other populations. Because Finns are so genetically distinct, they have their own reference population in Ancestry Composition. Most country-level populations overlap to some degree, though. In those cases, we experimented with different groupings of country-level populations to find combinations that we could distinguish with high confidence.
Some genetic ancestries are inherently difficult to tell apart because the people in those regions mixed throughout history or have shared history. As we obtain more data, populations will become easier to distinguish, and we will be able to report on more populations in the Ancestry Composition report.
Historically, biomedical research has disproportionately focused on participants of European descent. Due to this bias, and to the fact that a large proportion of 23andMe customers have unmixed European ancestry, we have the most reference data from European populations, and we are able to distinguish more sub-populations from Europe than from any other continent.
In light of this inequity, the 23andMe Research team is constantly working to get new data from diverse populations. Our mission at 23andMe is to help people access, understand, and benefit from the human genome. The best way we can do that for underserved populations is to include their genetic data in our research and in our Ancestry features — maximizing the granularity of Ancestry Composition for all of our customers and helping to combat disparities in genetic science. We have worked proactively to reduce bias in genetics research by initiating projects like the African Genetics Project and our NIH-funded genetic health resource for African Americans. The genetic information we collect through these initiatives and others like them will help to improve features like Ancestry Composition and will benefit the scientific community at large.
The Ancestry Composition Algorithm
There are several different ways to estimate your ancestry using your DNA. The approach we take at 23andMe is simple but powerful.
As described above, we use a computational method to estimate the phasing of your chromosomes. Next, we break up the chromosomes into short windows, and we compare your DNA sequence in each window to the DNA in the same window in our reference datasets. We assign your DNA to the ancestry whose reference DNA it's most similar to, and then we process those assignments computationally to "smooth" them out. Finally, we calibrate the results to ensure that they are accurate. All of the steps in this process are described in more detail in the following sections.
Step 1: Phasing
We use our own version of Brian Browning's BEAGLE software to phase your genome. With a tip of the hat to Darwin, we named our phasing software Finch. Finch uses statistical analysis to separate each parent's contribution to your DNA. It can't say which DNA you inherited from your mother, and which you inherited from your father — for that you do need to compare your genome to your parents' — but it does allow us to estimate phase for your chromosomes.
We wrote Finch to work smoothly in our production environment. Because Finch and BEAGLE use the same underlying algorithm, Finch achieves phasing accuracy consistent with that of BEAGLE.
There's one important difference between Finch and BEAGLE. BEAGLE makes the assumption that all of the individuals who need to be phased are available the first time the program is run. That assumption is not true for the 23andMe database, since new customers join every day. To avoid the computational costs of re-running the analysis from scratch every time a new person joins 23andMe, we modified BEAGLE to efficiently handle new genomes as they are added to our database.
Step 2: Window Classification
After phasing your chromosomes, we segment them into consecutive windows containing about 100 genetic markers each. We measure between 5,000 and 40,000 markers per chromosome, which equals 50 to 400 windows, depending on the chromosome's length. We look at each window in turn and compare your DNA against the reference DNA to determine what ancestry your DNA most likely came from.
There are many ways to assign ancestry to DNA segments based on reference data, and we tried several. The best-performing option was a well-known classification tool called a support vector machine, or SVM. An SVM can "learn" different ancestry classifications based on a set of training examples and then assign new DNA segments to a learned category.
In the case of Ancestry Composition, we train the SVM with reference DNA sequences and tell it which ancestry population those sequences are from. Then, when we look at the DNA from a 23andMe customer with unknown ancestry (like you), we can ask the SVM to classify your DNA for us based on the reference datasets.
We chose an Ancestry Composition algorithm based on SVMs because it performed the best out of all the techniques that we tried. SVMs are also very fast, which is critical for a large and growing database.
Step 3: Smoothing
The SVM classifies each window of your genome independently, creating a "first draft" version of your ancestry result. We use another computational process, called the smoother, to smooth this raw SVM output. The smoother uses a version of a well-known mathematical tool called a Hidden Markov Model to correct, or "smooth," two kinds of mistakes. Hidden Markov Models are used to analyze sequential data, like biological sequences or recorded speech. As an example, suppose we had three ancestry populations: X, Y, and Z. An example of output from the SVM might look like this:
chromosome 1, parent 1: X — X — X — Z — Z — Z — Y — Z chromosome 1, parent 2: Z — Z — Z — X — X — X — X — X
The first kind of mistake the smoother corrects is an unusual
assignment in the middle of a run of similar assignments. In
the first line above, there's a run of Z's, interrupted by a
Z — Z — Z — Y — Z. It's possible that the
lone Y was a close call between Y and Z that went the wrong
way. If that's the case, then the smoother can correct it to
Z — Z — Z — Z — Z.
The second kind of error corrected by the smoother comes from the phasing step. Finch can make a phasing mistake known as a switch error, where it mixes up the DNA of one parent with another. The smoother can switch the ancestry assignments between your mother and your father if it detects one of these errors. In this example, there may be a switch error after the fourth window. If the switch were reversed, then the runs of X's and the runs of Z's would stay together. In our simplified example, the smoother might output something like this:
chromosome 1, parent 1: Z — Z — Z — Z — Z — Z — Z — Z chromosome 1, parent 2: X — X — X — X — X — X — X — X
This example illustrates the purpose of the smoother. But with real data the picture is much messier, and the answers are rarely so clean. So instead of assigning a single ancestry to each window like we did in this example, the smoother estimates the probabilities of each Ancestry Composition population matching each window of DNA. The following picture shows a concrete example:
This is the output of the smoother analysis of one copy of chromosome 2. Starting on the left, there is a short run of pink, then a wider run of green, then another run of pink. In this chart, pink is the color for Sub-Saharan African ancestry, and green is the color for Native American. The y-axis runs from 0 to 100 percent, and it shows the probability that the DNA in that region of the chromosome comes from each Ancestry Composition population. These pink and green regions fill the entire vertical space of the graph, which means that we are 100 percent confident that the DNA in those regions has Sub-Saharan African and Native American genetic ancestry, respectively.
The next region to the right — between positions 50 and 100 on the x-axis — is a stretch of multi-colored blue. The thickest strip at the bottom is dark teal, which is the color for British & Irish. This segment of DNA has somewhere between a 50 percent chance and a 60 percent chance of reflecting from British & Irish ancestry. The other shades of blue show that the same DNA segment also has a chance of reflecting Italian, Iberian, or French & German ancestry. If you think back to the haplogroup example above, this result makes sense: it is normal for a DNA marker to match reference DNA from lots of places, even if it matches some places better than others. In this example, the result shows that this DNA segment matches reference DNA from all over Europe. We can very confidently conclude that this stretch of DNA reflects European ancestry, but the evidence isn't strong enough to assign it to one specific region of Europe with high confidence.
Step 4: Re-calibration
This plot shows a lot of information, but how do we know it's correct? We use a calibration step to correct for systematic bias in our results.
First, we ran some tests just to establish whether we needed to correct for any bias at all. We simulated a large set of admixed individuals. Because we simulated them ourselves, we knew the "true" ancestry of each part of their genome. Then we ran these simulated individuals through the entire Ancestry Composition pipeline, and we compared their results with their true ancestries.
We found that most of the reference populations were already fairly well calibrated, which means we predicted ancestry in proportion to how often it actually occurred in the simulated dataset. But there were a few populations, in particular the Scandinavian and Balkan reference populations, that were overrepresented in the results, relative to our simulated data.
To correct for this bias, we developed a re-calibration step that adjusts the ancestry proportions produced by the smoother so that each population is assigned in proportion to how often it actually occurs.
Step 5: Aggregation & Reporting
As a last step, we summarize the results to display them in your Chromosome Painting. The way we do this is to apply a threshold to the probability plot:
The horizontal line in this image shows a 70 percent confidence threshold, which we will talk about for this example. You can view your own Chromosome Painting at different confidence thresholds ranging from 50 percent (speculative) to 90 percent (conservative).
We look across the entire chromosome and ask whether any ancestry has an estimated probability exceeding the specified threshold (in this case 70 percent). In this example, with the exception of the blue European stretch, the ancestry estimates exceed 70 percent over the majority of the chromosome. Each region contributes to your overall Ancestry Composition in proportion to its size: For example, the green Native American segment near the end of this plot makes up about 0.26 percent of the entire genome. Even though there is some probability that the segment comes from a different population, the Native American proportion exceeds the 70 percent threshold, and so we add 0.26 percent Native American to the overall Ancestry Composition at this threshold.
In the case of the European segment, no single ancestry exceeds the 70 percent threshold, so we don't assign that DNA to any fine-grained ancestries. Instead, we refer to our hierarchy of ancestries. There is a Broadly Northern European ancestry that includes four fine-level ancestries: British & Irish, Scandinavian, Finnish, and French & German. When we add up the contributions of each of these subgroups, if the total Broadly Northern European contribution exceeds the 70 percent threshold, then we will report the region as Broadly Northern European.
In this example, the Broadly Northern European reference populations still don't exceed the 70 percent threshold, but the combined probabilities of all the European populations do. So this region is assigned Broadly European ancestry.
We use broad Ancestry Composition categories to avoid making assumptions about your ancestry when your DNA matches several different country-level populations. In regions where no ancestry — including the broad ancestries — exceeds the specified threshold, we report Unassigned ancestry. You can see the entire ancestry hierarchy in your Ancestry Composition report by clicking "See all 31 tested populations."
Using Close Family Members
Ancestry Composition is even more powerful if you have a biological parent or child who is also in the 23andMe database. You get higher resolution results when you connect with these close family members, and if you connect with a parent, you also get additional results. Click here to learn more about connecting with family and friends.
When you connect with a biological parent or child, you'll get a very high-quality chromosome phasing result from Finch. That translates into better Ancestry Composition results, in the sense that you might see more assignment to the fine-resolution ancestries: more Scandinavian, less Northern European.
Why is that? Remember, the smoother — which generates your final Ancestry Composition estimate — has to correct two kinds of errors. There are mistakes along the chromosome, and there are mistakes between the chromosomes. When your chromosomes are phased using genetic information from your parent or child, mistakes between the chromosomes (switch errors) are extremely rare, so the smoother can be more confident in correcting the along-chromosome mistakes.
Your results get that boost in resolution if you connect with either a child or a parent. It's slightly better to use a parent than to use a child, and it's slightly better to use two parents than one parent — however, you get the majority of the benefit with the first relative you add. Your Ancestry Composition report will automatically use the best available configuration, and your results will also automatically update when you share with a new relative. (As of this writing, the update usually takes three to five business days.)
If you connect with one or both of your biological parents, you will get an extra result. You'll be able to see the Parental Inheritance view, which shows the contribution of your mother to your ancestry on one side and the contribution of your father to your ancestry on the other. We can't provide this view if you don't have a parent connected because we need at least one of your parents to orient the results. Here's an example of what you can learn from Inheritance View: say that your Ancestry Composition includes a small amount of Ashkenazi Jewish ancestry. When you look at your Inheritance View, you'll be able to see from which parent you inherited your Ashkenazi ancestry.
Testing & Validation
Ancestry Composition includes a lot of steps, and each step has to be tested. We've discussed a few of those tests already while explaining our algorithm. In this section, we want to share some actual test results to give a sense of how well Ancestry Composition works. This section focuses on the final test we run, because that integrates the performance of each of the steps into an overall picture.
This test looks at two classic measures of model performance, called precision and recall. These are the standard measurements that researchers use to test how well a prediction system works. Precision answers the question "When the system predicts that a piece of DNA comes from population A, how often is the DNA actually from population A?" Recall answers the question "Of the pieces of DNA that actually are from population A, how often does the system correctly predict that they are from population A?"
There is a tradeoff between precision and recall, so we have to strike a balance between them. A high-precision, low-recall system will be extremely picky about assigning, say, Scandinavian ancestry. The system will only assign DNA as Scandinavian when it's very, very sure. That will yield high precision — since the assignment of Scandinavian is always correct — but low recall, because a lot of true Scandinavian ancestry is left unassigned.
With a low-precision, high-recall system the opposite problem exists. In this case, the system is liberal with assignments of Scandinavian ancestry. Any time a piece of DNA might be Scandinavian, it is assigned that ancestry. This will yield high recall, as all genuine Scandinavian DNA will be assigned correctly, but low precision, because non-Scandinavian DNA will often be assigned as Scandinavian incorrectly.
The ideal system has both high precision and high recall, but that may be impossible in real life. Let's see how Ancestry Composition performs on these metrics. For this quality control test, we set apart 20 percent of the reference database, about 1500 individuals of known ancestry. We trained and ran the entire Ancestry Composition pipeline on the other 80 percent of the reference individuals. Then we treated the other 20 percent like new 23andMe customers and used our Ancestry Composition pipeline to calculate their ancestry. Since we know these people's true ancestries, we can check to see how accurate their Ancestry Composition results are. We ran this test five times, with a different 20 percent held out each time, and then averaged across the five tests to give the following results:
|Population||Precision (%)||Recall (%)|
|Central & South African||100||89|
|East Asian & Native American||99||99|
|British & Irish||90||39|
|French & German||78||08|
|Middle Eastern & North African||95||83|
This table shows that our precision numbers are very high across the board, mostly above 90 percent, and rarely dipping below 80 percent. That means that when the system assigns an ancestry to a piece of DNA, that assignment is very likely to be accurate. You can also see that as you move up from the sub-regional level (e.g. British & Irish) to the regional level (e.g. Northern European) to the continental level (e.g. European), the precision approaches 100 percent.
Ancestry Composition's recall is also impressive. Because we were willing to sacrifice some recall to ensure excellent precision, the recall numbers tend to be slightly lower than the precision numbers. That means that the system won't assign fine-level ancestry to pieces of DNA when it's not sure. In the worst case, the French & German recall is 8 percent. This means that 92 percent of real French & German DNA is not assigned to the French & German population. Again, it's important to note that — just like with the precision — the recall values get better and better as you move from sub-regional to regional to continental populations.
It's also important to realize that poor recall doesn't mean bad results. Some populations, like Sardinian, are just hard to tell apart from others. When Ancestry Composition fails to assign Sardinian DNA, this doesn't mean that DNA is incorrectly assigned to something else, like Italian. If it were, then the Italian population would have poor precision. Instead, Ancestry Composition often assigns Sardinian DNA to the Broadly Southern European or Broadly European populations.
The Future of Ancestry Composition
Ancestry Composition has a modular design. This was intentional, because it allows us to improve individual components of the system — like Finch's phasing reference database or the SVM's reference populations — without affecting any of the other steps in the analysis pipeline.
Most common software is updated regularly, and we hope to apply the same model to Ancestry Composition. When we improve some component of the system or upgrade the reference datasets, your results will automatically be updated. You will be able to see a list of those updates in the Change Log at the bottom of your Ancestry Composition Scientific Details.
Updated July 2017
- Novembre J et al. (2008). Genes mirror geography within Europe. Nature. 456(7218):98-101.
- Durand EY et al. Ancestry Composition: A novel, efficient pipeline for ancestry deconvolution. (2014). bioRxiv 010512.
- Do CB et al. (2012). A scalable pipeline for local ancestry inference using thousands of reference individuals. ASHG 2012 Annual Meeting.
- BEAGLE Genetic Analysis Software Package
- 23andMe Blog: African Genetics Project
- 23andMe Blog: Improving Diversity in Genetics Research