Genome scan - Complete workflow

Performing a genome scan for G_st

In this section, we will see how to calculate various profiles along the genome. We will focus during the practicals on only one chromosome (chromosome IV), but this method can be applied to all chromosomes to obtain a global picture of the genome profile.

Two types of data can be used as input:

presence data: we have locations of markers of interest along the genome, and we are interested in their kernel density profile, i.e. their distribution
quantitative data: we have locations of markers of interest along the genome, and for each marker we have an associated numerical value (e.g. Gst). We are not interested in the distribution of markers per se, but in the smoothed profile of the quantitative variable along the genome. In this case we perform a kernel smoothing.

Density smoothing: consensus distribution

TODO Distribution of consensus along chromosome IV

Here we simply plot the positions of consensus along chromosome IV. There are several ways to do it with R:

# R script

# Load the table
d = read.table("blastn-vs-3spine-dna.bestHits", header = T, sep = "\t")

# Keep the unambiguous matches
d2 = subset(d, d$blastndeltaLogBestEvalues >= 5)

# Keep only matches on groupIV
d3 = subset(d2, d2$blastnsseqid == "groupIV")
dim(d3)

# Plot the ordered positions with a cumulative curve
positions = d3$blastnsstart
plot(sort(positions), 1:length(positions), pch =".")

# Rug plot
rug(positions)

TODO Calculation of the kernel density profile

The density profile can be generated by adding, for each marker location, a small gaussian-shape curve on that location. The gaussian curves for each marker will add up in areas where markers are clustered and produced a high profile. Isolated markers will only produce a small bump on the profile.

The kernel is the shape of the distribution added for each marker. In this case, we use a gaussian kernel, but other kernels exist.

# Kernel density profile
prf = density(d3$blastnsstart, kernel = "gaussian")
plot(prf)

# Change the bandwidth
prf = density(d3$blastnsstart, kernel = "gaussian", bw = 10000)
plot(prf)

# Try other bandwidths
prf = density(d3$blastnsstart, kernel = "gaussian", bw = 100000)
plot(prf)

# Try different kernels on sparse data
x = runif(5) # generater random data
plot(density(x, kernel = "gaussian", bw = 0.05))  # gaussian
rug(x)                                             # see where the markers are
plot(density(x, kernel = "triangular", bw = 0.05)) # triangular
plot(density(x, kernel = "triangular", bw = 0.01)) # triangular, small bw

Try different bandwidths with the consensus data on chromosome IV. How does the bandwidth influence the result? Try different kernels for the consensus data. How does the kernel choice influence the result?

We have plotted the distribution of consensus. We could also plot the distribution of SNP (present in a subset of the consensus). How would you test for SNP-rich and SNP-poor regions?

Calculation of Gst

We pooled the freshwater and the marine populations to obtain two groups that we would like to compare. We calculated rough Gst estimates for each SNP based on the pool data.

Disclaimer: there are ways to do it much more rigorously! Using the quality score information all the way to genotype calling and allele frequency estimations, using likelihood or Bayesian methods. But mainly: by increasing the coverage!

We now want to trace the Gst profile along the genome, and test for regions with aggregation of high or of low Gst values. But first, we will work on an R example dataset to illustrate the methods we will use.

TODO Kernel smoothing on geyser data

We get the x and y values from the geyser dataset. Note that it makes little sense to perform a kernel smoothing on this dataset: we just want to use the numerical values, without associating it with any physical meaning.

data(geyser, package = "MASS") # make the data accessible
x = geyser$duration
y = geyser$waiting
plot(x, y)

A kernel smoothing is a non-parametric regression which is similar to using a sliding-window average. However, when using a kernel, instead of assigning weights of 0 and 1 to the data points depending on whether or not they are in the window, each point has a weight which is determined by the shape of a kernel.

Let's perform a simple kernel smoothing with a gaussian kernel. We will use locpoly from the KernSmooth package which produces a local polynomial regression with a kernel weight.

library(KernSmooth)
plot(x, y)

# Wide bandwidth
ks = locpoly(x, y, bandwidth = 5)
lines(ks, col = "red", lwd = 2)

# Narrow bandwidth
ks = locpoly(x, y, bandwidth = 0.1)
lines(ks, col = "blue", lwd = 2)

# Intermediate
ks = locpoly(x, y, bandwidth = 0.2)
lines(ks, col = "purple", lwd = 2)

How would you choose the bandwidth? What is the issue with heterogeneous distribution of "markers"?

We used a fixed bandwidth above. However, there are methods for which the bandwidth is locally adapted. We can use the lokerns function from the lokern package.

library(lokern)
plot(x, y)

# Locally adapted bandwidth
lk = lokerns(x, y)
lines(lk)

How does this compare with the fixed bandwidth approach? Which one seems the most appropriate? Note: the lokerns approach sometimes gives some very wiggly profiles depending on the data.

TODO Permutation profiles on geyser data

We can now calculate a smoothed profile of our data along one dimension. We would like to be able to test for regions with a high or low profile compared to what could be expected by chance. We do not have any a priori data concerning the expected distribution of profiles. However, we can use the available data that we have to perform permutation testing.

We can shuffle the location of the data along the x axis and generate the corresponding smoothed profile. By generating numerous random profiles, we can build an empirical distribution of the profiles that can be observed from our data.We can then compare the real, observed profile with the distribution of random profiles.

Let's generate one random profile:

# Observed profile
plot(x, y, pch = 16, col = "cornflowerblue")
lk = lokerns(x, y)
lines(lk, col = "cornflowerblue", lwd = 2)

# Shuffle the data along the x axis
shuffled_x = sample(x)
points(shuffled_x, y, pch = 16, col = "salmon")
shuffled_lk = lokerns(shuffled_x, y)
lines(shuffled_lk, col = "salmon", lwd = 2)

Let's do a few more permutations:

# Shuffle the data along the x axis
shuffled_x = sample(x)
points(shuffled_x, y, pch = 16, col = "purple")
shuffled_lk = lokerns(shuffled_x, y)
lines(shuffled_lk, col = "purple", lwd = 2)

# Shuffle the data along the x axis
shuffled_x = sample(x)
points(shuffled_x, y, pch = 16, col = "green")
shuffled_lk = lokerns(shuffled_x, y)
lines(shuffled_lk, col = "green", lwd = 2)

It would be convenient to have one function to do the permutation and generate the profile for us:

# Function
permute_lokerns = function(x, y) {
  x_shf = sample(x)
  lk_shf = lokerns(x_shf, y)
  return(lk_shf)
}

# Original plot
plot(x, y)
lk = lokerns(x, y)
lines(lk)

# Shuffled profiles
# (we don't plot the shuffled points, only the resulting profiles)
lines(permute_lokerns(x, y), col = "red")
lines(permute_lokerns(x, y), col = "orange")
lines(permute_lokerns(x, y), col = "pink")

Interesting, but we can do it even better if we put the permutation step in a loop:

# Original plot
plot(x, y)
lk = lokerns(x, y)
lines(lk)

# Many permutations
n = 1000
for (i in 1:n) {
  lines(permute_lokerns(x, y), col = rgb(i/n, 0, 1 - i/n))
}

# Original profile again, on top
lines(lk, lwd = 2)

Do the same with 1000 permutations. Can you see any special pattern in the distribution of the profiles from permutations? How does the variability of the profiles relate to the original distribution of points? Can you distinguish regions of "high statistical power" and of "low statistical power"?

How would you use the permutation information to estimate p-values for high/low profile regions?

TODO Permutation testing with the geyser data

We can calculate p-value of high/low profile regions by counting how many permutation profiles are below or above the reference curve.

Let's write some R code to do that:

permutation_test = function(x, y, nperm) {

  # Calculate the observed profile
  obs_lk = lokerns(x, y)

  # Store the coordinates of the profile
  x_obs_lk = obs_lk$x.out
  p_obs_lk = obs_lk$est

  # Initialize variables to count the number of profiles below or above the
  # observed profile
  perm_higher = vector(length = length(x_obs_lk))
  perm_lower = vector(length = length(x_obs_lk))

  # Loop to perform the permutations
  for (i in 1:nperm) {
    
    # Permute the values
    x_shf = sample(x)

    # Calculate the profile
    lk_shf = lokerns(x_shf, y)

    # Extract the coordinates of the profile
    x_shf_lk = lk_shf$x.out
    p_shf_lk = lk_shf$est

    # Check that the x are consistent
   stopifnot(all(x_obs_lk == x_shf_lk))

    # Compare and count
    higher = as.numeric(p_shf_lk >= p_obs_lk)
    lower = as.numeric(p_shf_lk <= p_obs_lk)
    perm_higher = perm_higher + higher
    perm_lower = perm_lower + lower
  }

  # Calculate the p-values
  pval_high_region = perm_higher / nperm
  pval_low_region = perm_lower / nperm

  # Return
  return(data.frame(x = x_obs_lk, obs = p_obs_lk, 
         pval_high = pval_high_region,
         pval_low = pval_low_region))    
}

This is a big function! Let's try it and see if we can find some regions in the example dataset in which high or low value points are significantly aggregated:

# Observed data
plot(x, y)
lk = lokerns(x, y)
lines(lk)

# Perform the test
p = permutation_test(x, y, 100)

# Prepare for 3 plots
par(mfrow = c(3, 1))
par(mar = c(3, 2.5, 1, 1))

# Plot observed data
plot(p$x, p$obs, type = "l")

# Plot some permutations
# (not exactly the same as the ones used for p-values!)
n = 1000
for (i in 1:n) {
  lines(permute_lokerns(x, y), col = rgb(i/n, 0, 1 - i/n, 0.5))
}

# Add the original points on top
points(x, y, pch = 16)

# Plot the p-value profiles
plot(p$x, -log10(p$pval_high), type = "l")
plot(p$x, -log10(p$pval_low), type = "l")

TODO Kernel smoothing on Gst data

Add the data files

TODO Permutation testing on Gst data

Differences and complementarity with outlier tests

Note: Kernel smoothing and permutation is possible only when location is known. Can be real or in cM, but we have to understand what we do!

Difference in philosophy with the outlier tests

Next: GO enrichment tests

Performing a genome scan for Gst