Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals

Gene discovery

perform within-family association analyses that probe the robustness of our results.Our biological annotation analyses, which focus on the results from the autosomal GWAS, reinforce the main findings from earlier GWAS in smaller samples, such as the role of many of the prioritized genes in brain development.However, the newly identified SNPs also lead to several new findings.For example, they strongly implicate genes involved in almost all aspects of neuron-toneuron communication.
We found that a polygenic score derived from our results explains around 11% of the variance in educational attainment.We also report additional GWAS of three phenotypes that are highly genetically correlated with educational attainment: cognitive (test) performance (n = 257,841), self-reported math ability (n = 564,698) and hardest math class completed (n = 430,445).We identify 225, 618 and 365 lead SNPs, respectively.When we jointly analyze all four phenotypes using a recently developed method 11 , we found that the explanatory power of polygenic scores based on the resulting summary statistics increases, to 12% for educational attainment and 7-10% for cognitive performance.

Results
Primary GWAS of educational attainment.In our primary GWAS, we study educational attainment, which is measured as the number of years of schooling that individuals completed (EduYears).All association analyses were performed at the cohort level in samples restricted to European-descent individuals.We applied a uniform set of quality-control procedures to all cohort-level results.Our final sample-size-weighted meta-analysis produced association statistics for around 10 million SNPs from phase 3 of the 1000 Genomes Project 12 .
The quantile-quantile plot of the meta-analysis (Supplementary Fig. 1) exhibits substantial inflation (λ GC = 2.04).According to our linkage disequilibrium (LD) score regression 13 estimates, only a small share (approximately 5%) of this inflation is attributable to bias (Supplementary Fig. 2 and Supplementary Table 1).We used the estimated LD score intercept (1.11) to generate inflationadjusted test statistics.
Figure 1 shows the Manhattan plot of the resulting P values.We identified 1,271 approximately independent (pairwise r 2 < 0.1) SNPs at genome-wide significance (P < 5 × 10 −8 ), 995 of which remain if we adopt the stricter significance threshold (P < 1 × 10 −8 ) proposed in a recent study 14 (Supplementary Table 2, see Methods for a description of the clumping algorithm).The results from a conditional-joint analysis 15 are reported in the Supplementary Note and Supplementary Table 3.
We used a Bayesian statistical framework to calculate winner'scurse-adjusted posterior distributions of the effect sizes of the lead SNPs (Methods).We found that the median effect size of the lead SNPs corresponds to 1.7 weeks of schooling per allele; at the 5th and 95th percentiles, 1.1 and 2.6 weeks, respectively.We also examined the replicability of the 162 single-SNP associations (P < 5 × 10 −8 ) that were reported in the combined discovery and replication sample (n = 405,073) of the largest previous study 10 .In the subsample of our data (n = 726,808) that did not contribute to the analyses of the previous study, the SNPs replicate at a rate that closely matches theoretical projections derived from our Bayesian framework (Supplementary Fig. 3).
Within-family association analyses.We conducted within-family association analyses in four sibling cohorts (22,135 sibling pairs) and compared the resulting estimates to those from a meta-analysis that excluded the siblings (n = 1,070,751).The latter association statistics were adjusted for stratification bias using the LD score intercept.Figure 2 shows the observed sign concordance for three sets of approximately independent SNPs, selected using P value cutoffs of 5 × 10 −3 , 5 × 10 −5 and 5 × 10 −8 .The concordance is substantially greater than expected by chance but weaker than predicted by our Bayesian framework, even after we extend the framework to account for inflation in GWAS coefficients owing to assortative mating.In a second analysis based on all SNPs, we estimate that within-family effect sizes are roughly 40% smaller than GWAS effect sizes and that our assortative-mating adjustment explains at most one third of this deflation.(For comparison, when we apply the same method to height, we found that the assortative-mating adjustment fully explains the deflation of the within-family effects.) The Supplementary Note contains analyses and discussion of the possible causes of the remaining deflation we observe for EduYears.Although the evidence is not conclusive, it suggests that the GWAS effect-size estimates may be biased upward by correlation between educational attainment and a rearing environment conducive to educational attainment.Consistent with this hypothesis, a recent paper 16 reports that a polygenic score for EduYears based entirely on the non-transmitted alleles of the parents is approximately 30% as predictive as a polygenic score based on transmitted alleles.(For height, the analogous estimate is only 6%.)The non-transmitted alleles affect the educational attainment of the parents but can only influence the educational attainment of the child indirectly.If greater parental educational attainment positively influences the rearing environment, then GWAS that control imperfectly for rearing environment will yield inflated estimates.The LD score regression intercept does not capture this bias because the bias scales with the LD score in the same way as a direct genetic effect.Heterogeneous effect sizes.Because educational institutions vary across places and time, the effects of specific SNPs may vary across environments.Consistent with such heterogeneity, for the lead SNPs, we reject the joint null hypothesis of homogeneous cohortlevel effects (P = 9.7 × 10 −12 ; Supplementary Fig. 4).Moreover, we found that the inverse-variance-weighted mean genetic correlation of EduYears across pairs of cohorts in our sample is 0.72 (s.e.= 0.14), which is statistically distinguishable from one (P = 0.03).
Our finding of an imperfect genetic correlation replicates earlier results from smaller samples 17,18 .This imperfect genetic correlation is an important factor to consider in power calculations and study design.In the Supplementary Note, we report exploratory analyses that aim to identify specific sources of measurement heterogeneity or gene-environment interactions that may explain the imperfect genetic correlation.Unfortunately, the estimates are noisy, and the only robust finding was that SNP heritability was smaller in cohorts for which the measurement of EduYears was derived from questions with fewer response categories.X-chromosome GWAS results.We supplemented our autosomal analyses with association analyses of SNPs on the X chromosome.We first conducted separate association analyses of males (n = 152,608) and females (n = 176,750) in the UK Biobank.We found a male-female genetic correlation close to unity.We also found nearly identical SNP heritability estimates for men and women, which is consistent with partial dosage compensation (that is, on average the per-allele effect sizes are smaller in women) and indicates that any contribution of common variants on the X chromosome to sex differences in the normal-range variance of cognitive phenotypes 19 is quantitatively negligible.
Next, we conducted a large (n = 694,894) meta-analysis of summary statistics from mixed-sex analyses (Supplementary Fig. 5).We identified 10 lead SNPs and estimated a SNP heritability due to the X chromosome of approximately 0.3% (Supplementary Table 4).This heritability is lower than that expected for an autosome of similar length (Supplementary Fig. 6 and Supplementary Table 5).We cannot distinguish whether the lower heritability is due to smaller per-allele effect sizes for SNPs on the X chromosome or to the combination of haploidy in males and (partial) X inactivation in females.

Biological annotation.
For biological annotation, we focus on the results from the autosomal meta-analysis of EduYears.Across an extensive set of analyses (see Supplementary Fig. 7 for a flow chart), all major conclusions from the largest previous GWAS of EduYears 10 continue to hold but are statistically stronger.For example, we applied the bioinformatic tool DEPICT 20 and found that, relative to other genes, genes near our lead SNPs were overwhelmingly enriched for expression in the central nervous system (Fig. 3a and Supplementary Table 6).
There are also many novel findings associated with the large number of genes newly implicated by our analyses.At the standard false discovery rate (FDR) threshold of 5%, the bioinformatic tool DEPICT 20 prioritizes 1,838 genes (Supplementary Table 7), a tenfold increase relative to the DEPICT results from an earlier GWAS of EduYears 10 .In the following paragraphs, we distinguish between the 1,703 'newly prioritized' genes and the 135 'previously prioritized' genes.
An extensive analysis of many of the newly prioritized genes and their brain-related functions are described in the Supplementary Note.Here we highlight two especially noteworthy regularities.First, whereas previously prioritized genes exhibited especially high expression in the brain prenatally, the newly prioritized genes show elevated levels of expression both pre-and postnatally (Fig. 3b).Many of the newly prioritized genes encode proteins that carry out neurophysiological functions such as neurotransmitter secretion, the activation of ion channels and metabotropic pathways, and synaptic plasticity (Supplementary Fig. 8).
Second, even though glial cells are at least as numerous as neurons in the human brain 21 , gene sets related to glial cells (astrocytes, myelination and positive regulation of gliogenesis) are absent from those identified as positively enriched (Supplementary Table 8).Furthermore, using stratified LD score regression 22 , we estimated relatively weak enrichment of genes highly expressed in glial cells (Supplementary Table 9): 1.08-fold for astrocytes (P = 0.07) and 1.09-fold for oligodendrocytes (P = 0.06) versus 1.33-fold for neurons (P = 2.89 × 10 −11 ).Because myelination increases the speed with which signals are transmitted along axons 23 , the absence of enrichment of genes related to glial cells may weigh against the hypothesis that differences across people in cognition are driven by differences in transmission speed.
The results also raise a number of possible targets for functional studies.Among SNPs within 50 kb of lead SNPs, 127 of them are identified by the fine-mapping tool CAVIARBF 24 as likely causal SNPs (posterior probability > 0.9; Supplementary Table 10).Eight of these are non-synonymous, and one of these eight (rs61734410) is located in CACNA1H (Supplementary Fig. 9), which encodes the pore-forming subunit of a voltage-gated calcium channel that has been implicated in the trafficking of N-methyl-D-aspartate receptors 25 .
Polygenic prediction.Polygenic predictors derived from earlier GWAS of EduYears have proven to be a valuable tool for researchers, especially in the social sciences 6,7 .We constructed polygenic scores for individuals of European ancestry in two prediction cohorts: the National Longitudinal Study of Adolescent to Adult Health (Add Health, n = 4,775), a representative sample of American adolescents; and the Health and Retirement Study (HRS, n = 8,609), a representative sample of Americans over the age of 50.We measure prediction accuracy by the 'incremental R 2 ' statistic: the gain in the coefficient of determination (R 2 ) when the score is added as a covariate to a regression of the phenotype on a set of baseline controls (sex, birth year, their interaction and 10 principal components of the genetic relatedness matrix).
All scores are based on the results from a meta-analysis that excluded the prediction cohorts.Our first four scores were constructed from sets of LD-pruned SNPs associated with EduYears at various P-value thresholds: 5 × 10 −8 , 5 × 10 −5 , 5 × 10 −3 and 1 (that is, all SNPs).In both cohorts, the predictive power is greater for scores b, Sign concordance for LD-pruned SNPs reaching P < 5 × 10 −5 (4,594 SNPs).c, Sign concordance for LD-pruned SNPs reaching P < 5 × 10 −8 (that is, lead SNPs; 1,318 SNPs).Each panel compares the observed sign concordance between within-family and GWAS estimates to the distributions expected by chance alone (pink); according to a Bayesian framework that adjusts the GWAS estimates for bias due to winner's curse (green); and according to the same framework with an additional adjustment for bias due to assortative mating (blue).These results are based on a GWAS sample size of 1,070,751 individuals and a within-family sample of 22,135 sibling pairs (44,270 individuals).
constructed with less stringent thresholds (Supplementary Fig. 10).The sample-size-weighted mean incremental R 2 increases from 3.2% at P < 5 × 10 −8 to 9.4% at P ≤ 1.Our fifth score was generated from HapMap3 SNPs using the software LDpred 26 .Rather than removing SNPs that are in LD with each other, LDpred is a Bayesian method that weights each SNP by (an approximation to) the posterior mean of its conditional effect, given other SNPs.This score was the most predictive in both cohorts, with an incremental R 2 of 12.7% in AddHealth and 10.6% in HRS (and a sample-size weighted mean of 11.4%).
To put the predictive power of this score in perspective, Fig. 4a shows the mean college completion rate by polygenic-score quintile.The difference between the bottom and top quintiles in Add Health and HRS is, respectively, 45 and 36 percentage points (see Supplementary Fig. 11 for analogous analyses of high school completion and grade retention).Figure 4b compares the incremental R 2 of the score to that of standard demographic variables.The score is a better predictor of EduYears than household income and a worse predictor than the educational attainment of the mother or father.Controlling for all the demographic variables jointly, the score's incremental R 2 is 4.6% (Supplementary Fig. 12).
We also found that the score has substantial predictive power for a variety of other cognitive phenotypes measured in the prediction cohorts (Supplementary Fig. 13).For example, it explains 9.2% of the variance in overall grade point average in Add Health.
Because the discovery sample used to construct the score consisted of individuals of European ancestry, we would not expect the predictive power of our score to be as high in other ancestry groups 7,27,28 .Indeed, when our score was used to predict EduYears in a sample of African-Americans from the HRS (n = 1,519), the score only has an incremental R 2 of 1.6%, implying an attenuation of 85%.The Supplementary Note shows that this amount of attenuation is typical of what has been reported in previous studies.

Related cognitive phenotypes and multi-trait analysis of GWAS.
We performed GWAS on three complementary phenotypes: cognitive performance (n = 257,841), self-reported math ability (n = 564,698), and highest math class taken (highest math, n = 430,445).For cognitive performance, we meta-analyzed published results from the COGENT consortium 29 with results based on new analyses of the UK Biobank (UKB), as did another study 30 .For the two math phenotypes, we studied new genome-wide analyses using samples of research participants from 23andMe.We identified 225, 618 and 365 genome-wide significant SNPs for cognitive performance, math ability and highest math, respectively (Supplementary Figs.14-16 and Supplementary Tables 11-13).
We conducted a multi-trait analysis of EduYears and our supplementary phenotypes to improve polygenic prediction accuracy.These phenotypes are well suited to joint analysis because their pairwise genetic correlations are high, in all cases exceeding 0.5 (Supplementary Table 14).We applied a recently developed method, multi-trait analysis of GWAS (MTAG) 11 to summary statistics for the four phenotypes from meta-analyses that exclude the prediction cohorts.For all four phenotypes, MTAG increases the number of lead SNPs identified at genome-wide significance (Supplementary Figs.17-20 and Supplementary Table 15).Figure 4c shows the incremental R 2 for the polygenic scores based on GWAS and MTAG association statistics (but otherwise constructed using identical methods) when the target phenotype is either EduYears (left panel) or cognitive performance (right panel).
In Add Health, in which our measure of cognitive performance is the respondent's score on a test of verbal cognition, the incremental R 2 values of the GWAS and MTAG scores are 5.1% and 6.9%, respectively.To obtain a better measure of prediction accuracy for cognitive performance, we used an additional validation cohort, the Wisconsin Longitudinal Study (WLS), which administered a cognitive test with excellent retest reliability and psychometric properties that were similar to those used in our discovery GWAS of cognitive performance.In the WLS, the MTAG score predicts 9.7% of the variance in cognitive performance, a substantial improvement over the 7.0% predicted by the GWAS score and approximately double the prediction accuracy reported in three recent GWAS analyses of cognitive performance [30][31][32] .Fig. 3 | Tissue-specific expression of genes in DEPICT-defined loci.a, We took microarray measurements from the Gene Expression Omnibus 20 and determined whether the genes overlapping EduYears-associated loci (as defined by DEPICT) are significantly overexpressed (relative to genes in random sets of loci) in each of 180 tissues or cell types.These types are grouped by first-level terms according to the medical subject headings (MeSH).The y axis is the one-sided P value from DEPICT on a -log 10 scale.The 28 dark bars correspond to tissues or cell types in which the genes are significantly overexpressed (FDR < 0.01), including all 22 classified as part of the central nervous system (see Supplementary Table 6 for identifiers of all tissues and cell types).b, Whereas genes prioritized by DEPICT in a previous analysis based on a smaller sample 10 tend to be more strongly expressed in the brain prenatally (red curve), the 1,703 newly prioritized genes show a flat trajectory of expression across development (blue curve).Both groups of DEPICTprioritized genes show elevated levels of expression relative to protein-coding genes that are not prioritized (gray curve).Analyses were based on RNA-sequencing data from the BrainSpan Developmental Transcriptome 35 .These results are based on the full GWAS sample of 1,131,881 individuals.
Error bars represents 95% confidence intervals.RPKM, reads per kilobase of transcript per million reads mapped.

Articles
Nature GeNetics

Discussion
The results of this study illustrate what the advocates of GWAS anticipated: as sample sizes get larger, thousands of lead SNPs will be identified, and polygenic predictors will attain non-trivial levels of predictive power.However, theoretical projections that failed to consider heterogeneity of effect sizes were optimistic 4 .Our and others' findings 17,18 suggest that imperfect genetic correlation across cohorts will be the norm for phenotypes, such as educational attainment, that are environmentally contingent.For research at the intersection of genetics and neuroscience, the set of 1,271 lead SNPs that we identify is a treasure trove for future analyses.For research in social science and epidemiology, the polygenic scores that we construct-which explain 11-13% and 7-10% of the variance in educational attainment and cognitive performance, respectively-will prove useful across at least three types of applications.
First, by examining associations between the scores and highquality measures of endophenotypes, researchers may be able to disentangle the mechanisms by which genetic factors affect educational attainment and cognitive phenotypes.Such studies are already being conducted with polygenic scores from earlier GWAS of educational attainment 6,7 , but they can now be well powered in samples as small as those from laboratory experiments.For example, if our polygenic score explains 10% of the variance in an endophenotype, then its effect can be detected at a 5% significance threshold with 80% power in a sample of only 75 individuals.Second, the polygenic scores can be used as control variables in randomized controlled trials (RCTs) of interventions that aim to improve academic and cognitive outcomes.Given the current levels of predictive power of the scores, such use can now generate non-trivial gains in statistical power for the RCT.For example, if adding the polygenic score to the set of control variables in an RCT increases their joint explanatory power from 10% to 20%, then the gain in power from including the polygenic score is equivalent to increasing the sample size of the RCT by 11% (for such calculations, see the supplementary online material of a previous study 4 ).Third, the polygenic scores can be used as a tool for exploring gene-environment interactions 33 , which are known to be important for genetic effects on educational attainment and cognitive performance 1,34 .
Our results also highlight two caveats to the use of the polygenic scores in research.First, our within-family analyses suggest that GWAS estimates may overstate the causal effect sizes: if educational attainment-increasing genotypes are associated with parental educational attainment-increasing genotypes, which are in turn associated with rearing environments that promote educational attainment, then failure to control for rearing environment will bias GWAS estimates.If this hypothesis is correct, some of the predictive power of the polygenic score reflects environmental amplification of the genetic effects.Without controls for this bias, it is therefore inappropriate to interpret the polygenic score for educational attainment as a measure of genetic endowment.
Second, we found that our score for educational attainment has much lower predictive power in a sample of African-American individuals than in a sample of individuals with an European ancestry, and we anticipate that the score would also have reduced predictive power in other samples of individuals with a non-European ancestry.Therefore, until polygenic scores are available that have as much predictive power in other ancestry groups, the score will be most useful in research that is focused on samples of individuals with an European ancestry.

Fig. 1 |
Fig. 1 | Manhattan Plot for GWAS of EduYears.The P value and mean χ 2 value are based on inflation-adjusted test statistics.The x axis is chromosomal position and the y axis is the significance on a -log 10 scale.The dashed line marks the threshold for genome-wide significance (P = 5 × 10 −8 ) (n = 1,131,881).

Fig. 2 |
Fig.2| Sign concordance in within-family association analyses.a, Sign concordance for LD-pruned SNPs reaching P < 5 × 10 −3 (14,670 SNPs).b, Sign concordance for LD-pruned SNPs reaching P < 5 × 10 −5 (4,594 SNPs).c, Sign concordance for LD-pruned SNPs reaching P < 5 × 10 −8 (that is, lead SNPs; 1,318 SNPs).Each panel compares the observed sign concordance between within-family and GWAS estimates to the distributions expected by chance alone (pink); according to a Bayesian framework that adjusts the GWAS estimates for bias due to winner's curse (green); and according to the same framework with an additional adjustment for bias due to assortative mating (blue).These results are based on a GWAS sample size of 1,070,751 individuals and a within-family sample of 22,135 sibling pairs(44,270 individuals).

Fig. 4 |
Fig. 4 | Prediction Accuracy.a, Mean prevalence of college completion by EduYears polygenic score (PGS) quintile.Data are mean ± 95% confidence interval.b, Incremental R 2 of the EduYears PGS compared to that of other variables.c, Incremental R 2 of the PGS for EduYears and cognitive performance constructed from the respective GWAS or MTAG summary statistics.Error bars for the R 2 values show bootstrapped 95% confidence intervals with 1,000 iterations each.Sample sizes are n = 4,775 for Add Health and n = 8,609 for HRS.