# Scaling of SNP coefficients when simulating phenotypes

When developing a new statistical method in the context of genetic association testing, it’s common to simulate phenotypes on which to test the new method. This generally involves the following steps:

1. Taking real genotype data (or simulating some from scratch, for example with msprime);
2. Choosing a set of genetic variants which will be used to build the phenotype;
3. Simulating coefficients for these genetic variants; and
4. Adding noise to achieve a desired level of heritability.

This post will focus on step 3, that of simulating coefficients for the variants (SNPs) chosen. More specifically, we will consider the question of whether and how per-SNP heritability should vary with the SNP’s minor allele frequency (MAF).

# 1. The problem: per-SNP heritability increases with MAF

Suppose we have chosen our set of SNPs in step 2 and now want to simulate their coefficients. A simple approach would be to simply draw them independently from a standard Normal distribution (mean zero, variance one). The problem with this approach is that it ignores an important piece of information about each SNP: their minor allele frequency (MAF).

In real populations, some mutations are more common than others. Moreover, because of natural selection, more common mutations generally have smaller effect sizes that rarer mutations. This has been observed empirically in GWAS (see for example this post by Sasha Gusev on Twitter) and can be explained theoretically by a model in which a trait is assumed to be at its optimal value in the population – then, any mutations that arise will be harmful and so the only mutations which are allowed to increase in frequency and become common are those with small effect sizes.

Since this inverse relationship between allele frequency and effect size is a key feature of real-world assocations between genotypes and phenotypes, it’s important that we mimic it in our simulations. Let’s now compute the per-SNP heritability (that is, the expected contribution of each SNP to the variance of the trait) to examine what this relationship looks like if we simply simulate coefficients from independent standard Normals.

We can compute the per-SNP heritability as follows:

\begin{align*} \text{Var}(\beta_j x_j) &= \text{Var}(\beta_j) \text{Var}(x_j) + \text{Var}(\beta_j) {\left[\mathbb{E}(x_j)\right]}^2 + {\left[\mathbb{E}(\beta_j)\right]}^2 \text{Var}(x_j). \end{align*}

Since $$\mathbb{E}(\beta_j) = 0$$, the final term goes away. And by mean-centring the genotypes (which we haven’t done, but just for simplicity here) the second term is also zero. We continue:

\begin{align} \text{Var}(\beta_j x_j) &= \underbrace{\text{Var}(\beta_j)}_{=1} \text{Var}(x_j) \label{eq:varbx} \tag{1}\\ &= 2p_j(1-p_j), \nonumber \end{align}

where $$p_j$$ is the minor allele frequency of $$x_j$$, since $$x_j$$ is binomial.

This implies that the heritability explained by each SNP varies across SNPs and increases linearly with the MAF (i.e., more common SNPs explain more heritability on average).

Since we expect rarer SNPs to have larger effect sizes (so a larger value for $$\text{Var}(\beta_j)$$), we would like the linear, increasing relationship between heritability and frequency above to actually be flat (i.e., all SNPs explain the same amount of heritability in expectation) or at least less pronounced (as it is, we have a one-to-one relationship between MAF and heritability explained).

This leads us to the most common approach to simulate coefficients, which is to make them have constant per-SNP heritability regardless of frequency.

# 2. The most common approach: constant per-SNP heritability

The most common way to simulate coefficients when building artificial phenotypes is to make them have a fixed contribution to heritability which doesn’t depend on the MAF. This can be achieved in several ways.

## 2.1 Scaling the variance of the Normal distribution

We can draw our coefficients from a Normal distribution whose variance is scaled by the reciprocal of the variance of the corresponding genotype:

\begin{align*} \beta_j \sim \text{Normal}\left(0, \frac{1}{2p_j(1-p_j)}\right). \end{align*}

Equation \eqref{eq:varbx} then becomes:

\begin{align*} \text{Var}(\beta_j x_j) &= \frac{1}{2p_j(1-p_j)} 2p_j(1-p_j) = 1, \end{align*}

achieving constant heritability explained across SNPs regardless of MAF.

This is the approach used for the simulations in the LD Score regression paper,1 in which per-allele effect sizes are drawn from a $$\text{Normal}(0, h^2 / (2Mp_j(1-p_j)))$$ distribution, where $$h^2$$ is the desired level of heritability2 and $$M$$ the number of causal SNPs.

## 2.2 Drawing the coefficients from a standard Normal and then rescaling them

An alternative and equivalent way to obtain a fixed heritability per SNP is to first draw the coefficients from a standard normal as in section 1 and then rescale them by the square root of the reciprocal of the variance. This amounts to replacing the coefficients $$\beta_j$$, drawn from a standard Normal, with scaled coefficients $$\delta_j$$:

\begin{align*} \delta_j = \frac{1}{\sqrt{2p_j(1-p_j)}} \beta_j. \end{align*}

We can then write:

\begin{align*} \text{Var}(\delta_j x_j) &= \text{Var}\left(\frac{1}{\sqrt{2p_j(1-p_j)}}\beta_j\right) \text{Var}(x_j)\\ &= \frac{1}{2p_j(1-p_j)}2p_j(1-p_j) = 1. \end{align*}

This is the setup used in the LDpred-funct paper,3 which incorporates functional priors to improve the accuracy of LDPred.

## 2.3 Drawing the coefficients from a standard Normal and normalising the genotypes instead

Finally, we can achieve the same result by simply normalising the genotypes (i.e., dividing them by $$\sqrt{2p_j(1-p_j)}$$) while still drawing our coefficients from a standard Normal. Writing $$g_j = x_j / \sqrt{2p_j(1-p_j)}$$ as the standardised genotypes, equation \eqref{eq:varbx} now becomes:

\begin{align*} \text{Var}(\beta_j g_j) &= \text{Var}(\beta_j) \text{Var}\left(\frac{x_j}{\sqrt{2p_j(1-p_j)}}\right)\\ &= \frac{1_j}{2p_j(1-p_j)} \cdot \underbrace{\text{Var}(x_j)}_{=2p_j(1-p_j)} = 1. \end{align*}

This approach seems the most common of the three and is used, for example, in the GCTA paper,4 in the BOLT-LMM paper5 and in the REGENIE paper.6

# 3. A more realistic approach: a scaled relationship between heritability and MAF

More recently, Doug Speed et al. proposed the LDAK model under which the expected (relative) heritability contribution of each SNP is given by the following formula:7

\begin{align} \mathbb{E}(h_j^2) \propto {[f_j(1-f_j)]}^{1+\alpha} \cdot w_j r_j, \label{eq:ldak} \tag{2} \end{align}

where the relationship between heritability and the MAF is controlled by the parameter $$\alpha$$, $$w_j$$ is a SNP weight computed based on local levels of LD (it’s higher for SNPs in regions of low LD) and $$r_j \in [0, 1]$$ is an information score measuring confidence in a SNPs imputation ($$r_j$$ is 1 for directly genotyped SNPs and is otherwise an estimate of the squared correlation between the true and imputed genotypes).

## 3.1 Deriving the heritability expression

Equation \eqref{eq:ldak} arises from the following model for the phenotype $$Y_i$$:

\begin{align*} Y_i = \sum_k \theta_k Z_{ik} + \sum_j \beta_j X_{ij} + e_i, \end{align*}

with $$\beta_j \sim \text{Normal}(0, r_j w_j\sigma_g^2 / W)$$, $$e_i \sim \text{Normal}(0, \sigma_e^2)$$ and $$W = \sum_j r_j w_j [2f_j(1-f_j)]^{1 + \alpha}$$. The variables $$Z_{ik}$$ are fixed-effect covariates and the $$X_{ij}$$ are the centred and scaled allele counts (i.e., $$X_{ij} = (S_{ij} - 2f_j) \cdot {[2f_j(1-f_j)]}^{\alpha/2}$$, with $$S_{ij}$$ being the raw allele count).

Under this model, we can derive the expected (relative) heritability of SNP $$j$$ as follows:

\begin{align*} \mathbb{E}(h_j^2) &= \frac{\text{Var}(\beta_jX_j)}{\text{Var}(Y)} \\ &= \frac{\mathbb{E}(\beta_j^2) \text{Var}(X_j)}{\text{Var}(Y)}\\ &= \frac{r_jw_j\sigma_g^2/W \cdot [2f_j(1-f_j)]^\alpha [2f_j(1-f_j)]}{\text{Var}(Y)}\\ &= \frac{r_jw_j\sigma_g^2/W \cdot [2f_j(1-f_j)]^{1 + \alpha}}{\text{Var}(Y)}. \end{align*}

Since $$\sigma_g^2$$, $$W$$ and $$\text{Var}(Y)$$ are the same for all SNPs, we can consider them scaling constants and so obtain equation \eqref{eq:ldak}.

## 3.2 Relation to previous approaches

Modelling the contribution of imputation quality (through $$r_j$$) and the levels of local LD (through $$w_j$$) is new and so does not fit within the framework of section 2. If we ignore these two factors (and so set $$r_j = w_j = 1$$), we see that Speed et al.’s approach is equivalent to the constant heritability model for $$\alpha = -1$$.

In their paper, Speed et al. fitted their model to data for 42 quantitative traits assuming different values for $$\alpha$$ ($$-1.25, -1, -0.75, -0.5, -0.25, 0$$ and $$0.25$$) to determine which value led to the best model fit (likelihood). They found $$\alpha = -0.25$$ to provide the best fit overall, although a slightly better fit could be achieved for some traits with other values.

With $$\alpha = -0.25$$, equation \eqref{eq:ldak} becomes:

\begin{align*} \mathbb{E}(h_j^2) \propto {[f_j(1-f_j)]}^{0.75} \cdot w_j r_j. \end{align*}

As can be seen in the figure below, this implies a positive relationship between MAF and per-SNP heritability which grows more slowly as the MAF rises.

Relationship between heritability and MAF (Fig. 2a from Speed et al.)

Caption in the paper: 'The parameter $$\alpha$$ specifies the assumed relationship between heritability and MAF: in human genetics, $$\alpha = -1$$ is typically used (solid blue line), while in animal and plant genetics $$\alpha = 0$$ is more common (orange); we instead found that $$\alpha = -0.25$$ (red) provides a better fit to real data. The gray bars report (relative) estimates of the per-SNP heritability for SNPs with $$\text{MAF} < 0.1$$ and $$\text{MAF} \geq 0.1$$, averaged across the 19 GWAS traits (vertical lines provide 95% confidence intervals); the dashed lines indicate the per-SNP heritability predicted by each $$\alpha$$ value.'

Comparing this setup to the simplest approach in section 1, we see that they are similar, with only a slightly smaller coefficient here. The two approaches would be equivalent for $$\alpha=0$$, again ignoring $$r_j$$ and $$w_j$$.

# 4. Which approach should I use?

As mentioned above, the second approach of making heritability independent of MAF (equivalent to setting $$\alpha = -1$$ in the LDAK framework) is the most commonly used in practice. However, in light of Speed et al.’s more recent empirical analysis, it may make sense to build phenotypes using different values for $$\alpha$$ to ensure the simulated scenarios cover a range of plausible architectures.

This is the approach taken by Brian Zhang et al. in their recent paper proposing ARG-Needle, a new method to infer ancestral recombination graphs from large samples. In their simulations, they set $$\alpha \in \{0, -0.5, -1\}$$ (see Figure S6 in the Supplementary Material) thus covering the two scenarios in sections 1 and a 2 as well as an intermediate setting.

# 5. Summary

We have seen how the naive approach of simply simulating coefficients from identical distributions that don’t take their allele frequency into account leads to more common SNPs contributing disproportionately more to the heritability of a trait (i.e., expected heritability explained increases one-to-one with allele frequency).

We then saw how the common alternative to this setup in simulation studies in the literature is to have the expected heritability explained be constant regardless of MAF, and the multiple ways in which this can be achieved.

Lastly, we considered a newer model in which heritability does increase with frequency but in a less pronounced way. This model appears to be favoured by empirical data.

Going forward, one could consider building different simulation settings for multiple values of $$\alpha$$ in the LDAK framework as some researchers in the community have already started doing!

1. See section titled Simulations with polygenic genetic architectures in the main text.

2. By drawing the residuals from a distribution with variance $$(1 - h^2)$$ – see Section 1.1 of the Supplementary Note, although it’s a theory section and doesn’t refer to the simulations – the SNP heritability under this setup is $$h^2 / (h^2 + 1 - h^2) = h^2$$.

3. See section titled Simulations in the main text.

4. See section titled GWAS Simulation in the main text.

5. See section 5.1.2 Simulated phenotypes in the Supplementary Text.

6. See section titled Data simulation in the main text.

7. We adopt the paper’s notation and now denote the MAF by $$f_j$$.