суббота, 13 июля 2019 г.

Getting the most out of the Global25

The first thing you need to know about the Global25 is that I update the relevant datasheets regularly, usually every week or two, but they’re always at these links:

Global25 datasheet (scaled) Global25 pop averages (scaled) Global25 datasheet Global25 pop averages

Each sample has a population code and an individual code. The population codes represent the countries, ethnic groups and/or archeological affinities of the samples, and I often modify these codes to suit my needs. On the other hand, the individual codes are unique to most of the samples and I usually don’t change them. So if you’d like to know more details about the samples try searching for their individual codes via a decent online search engine. Basic information about many of the samples is also available in the «anno» files here. The main purpose of the Global25 is to provide data for mixture modeling. In other words, for estimating ancestry proportions, and in particular ancient ancestry proportions. This can be done on your computer with the R program and the nMonte R script, or online with the Global25 nMonte Runner, which I discuss below. If you don’t have R installed on your computer, you can get it here, while nMonte is available here. For this tutorial please download nMonte and nMonte3, and store them in your main working folder (usually My Documents). Once you have R set up, make sure its working directory is the same place where you stored nMonte. You can check this in R by clicking on «File» and then «Change dir». Additionally, you’ll need two nMonte input files in the working directory titled «data» and «target». Examples of these files are available here. We’ll be using them to test the ancient ancestry proportions of a sample set from present-day England. Before you can begin the analysis you need to first call the nMonte script by typing or copy pasting source(‘nMonte.R’) into the R console window, and then hitting «enter» on your keyboard. This is what you should see in the R console window afterwards.

To start the mixture modeling process, type or copy paste getMonte(‘data.txt’, ‘target.txt’) into the R console window, hit «enter», and wait for the results. After a short time, probably less than a minute or two, you should see this output.

The data and target files contain population averages, and, as you can see, the results that these population averages produced were in line with what one would expect from such a model focusing on the genetic shifts in Europe and surrounds during the Late Neolithic. Very similar ancient ancestry proportions have been reported for the English and other Northern Europeans recently in scientific literature. However, when focusing on exceptionally fine-scale genetic variation that isn’t reflected too well in the Global25 population averages, a more effective strategy might be to use multiple individuals from each reference population and let nMonte3 aggregate and average the inferred ancestry proportions. This is often the case when attempting to model ancestry proportions for more recent periods, such as the Middle Ages. So let’s try this with the English sample set using a modified data file, which is available here. Replace the old data file with the new one in your working directory, and, like before, copy paste into the R console window the following two commands, hitting «enter» after each one: source(‘nMonte3.R’) and getMonte(‘data.txt’, ‘target.txt’). This is what you should eventually see.

It’s difficult to say how accurate these estimates are. But they look more or less correct considering the limited and less than ideal reference samples. For instance, the individuals labeled SWE_Viking_Age_Sigtuna are supposed to be stand ins for Danish and Norwegian Vikings, but they’re a relatively heterogeneous group from Sweden, possibly with some British or Irish ancestry, so they might be skewing the results. However, I’ll be adding many more ancient samples to the Global25 datasheets as they become available, including lots of new Vikings, which should greatly improve the accuracy of these sorts of fine-scale mixture models. An alternative to the R-based approach is the online Global25 nMonte Runner [LINK]. This is a free tool, and easy to work with via several drop down menus, but users must become sponsors to unlock all of its available features. To run an analysis follow these three steps:

1) use the first drop down menu to pick the reference populations of your choice (up to four are allowed for free users) 2) move down to the second set of the drop down lists and either pick a test population that is already in the system or copy paste a set of Global25 coordinates into the space labeled «Enter/Paste Sets of Coordinates — Scaled and Comma-separated» 3) feel free to experiment with the additional options if you’re game and willing to part with a little cash to help pay for the site.

However, it’s important to note that the Global25 is a Principal Component Analysis (PCA), so it makes good sense to also use it for producing PCA graphs. To do this just plot any combination of two or three of its Principal Components (PCs) to create 2D or 3D graphs, respectively. This can be done with a wide variety of programs, including PAST, which is freely available here. To produce a 2D graph, open a Global25 datasheet in PAST, choose comma as the separator, highlight any two columns of data, click on the «Plot» tab and, from the drop down list, pick «XY graph». Below is a series of graphs that I created in exactly this way. I also color coded the samples according to their geographic origins. This was done by ticking the «Row attributes» tab.

PAST can also be used to run PCA on subsets of the Global25 scaled data to produce remarkably accurate plots of fine-scale population structure. To try this create a new text file with your choice of populations from the Global25 scaled datasheet, open it with PAST and choose Multivariate > Ordination > Principal Components Analysis. I’ve already put together several datasheets limited to European, Northern European, West Eurasian and South Asian populations. They’re available at the links below along with more details on how to run them with PAST.

Global25 workshop 1: that classic West Eurasian plot Global25 workshop 2: intra-European variation Global25 workshop 3: genes vs geography in Northern Europe The South Asian cline that no longer exists

And if you’re fond of tree-like structures as a means to describe fine-scale genetic variation, please check out this blog post…

Global25 workshop 4: a neighbour joining tree

Комментариев нет: