Ethics statement

All data for this research project were from individuals who provided prior informed consent to participate in AncestryDNA’s Human Diversity Project, as reviewed and approved by our external institutional review board, Advarra (formerly Quorum). All data were deidentified before use.

Study population

Self-reported COVID-19 outcomes were collected through the Personal Discoveries Project, a survey platform available to AncestryDNA customers via the web and mobile applications. The COVID-19 survey ranged from 39 to 71 questions, depending on the initial COVID-19 test result reported. Supplementary Fig. 1 describes the flow of the topics assessed in each section of the survey, and the complete set of questions is in the Supplementary Note 1. Analyses presented here were performed with data collected between 22 April and 3 August 2020.

To participate in the COVID-19 survey, participants were required to meet the following criteria: 18 years of age or older, resident of the United States, existing AncestryDNA customer who has consented to participate in research and able to complete a short survey. The survey was designed to assess self-reported COVID-19 positivity and severity, as well as susceptibility and known risk factors, including community exposure and known contacts with individuals diagnosed with COVID-19.

Binary phenotype definitions

In total, we assessed eight phenotypes, which are summarized in Table 2. The full survey is available in Supplementary Note 1, but key definitions for this work include testing positive or negative, hospitalization, asymptomatic cases and cohabitant exposure. COVID-19 positivity or negativity was assessed by the question ‘Have you been swab tested for COVID-19, commonly referred to as coronavirus?’. Hospitalization due to COVID-19 illness was used as one binary measure of severity and was assessed with the question ‘Were you hospitalized due to these symptoms?’. Asymptomatic individuals were defined as those that were positive for COVID-19 and either answered ‘No’ to the question ‘Did you experience symptoms as a result of your condition?’ or answered one of ‘None’, ‘Very mild’ or ‘Mild’ to all 15 questions related to symptom severity. High exposure to COVID-19 was assessed through having a COVID-19 positive cohabitating person, assessed by the question ‘Has someone in your household tested positive for COVID-19?’.

Continuous severity phenotype creation

A continuous severity score was derived by computing the first principal component across nine survey fields related to COVID-19 clinical outcomes. Six of the nine questions were binary: hospitalization, ICU admittance with oxygen, ICU admittance with ventilation, septic shock, respiratory failure and organ failure due to COVID-19. Binary responses were encoded as 0 for ‘No’ and 1 for ‘Yes’. Three questions related to shortness of breath, fever and nausea/vomiting symptoms were encoded as a unit-scaled variable based on the following mapping: 0 = none, 0.2 = very mild, 0.4 = mild, 0.6 = moderate, 0.8 = severe and 1.0 = very severe. These three symptoms were chosen based on prior literature indicating their positive association with COVID-19 hospitalization7. The following assumptions were made so that a score could be calculated for most participants who reported a positive COVID-19 test result: (1) participants who responded ‘No’ to the question ‘Did you experience symptoms as a result of your condition?’ were not presented with additional questions regarding symptomatology or hospitalization and thus were encoded as 0 for all individual symptoms (shortness of breath, fever or nausea/vomiting), hospitalization, ICU admittance and severe complications due to COVID-19 illness; (2) participants who responded ‘No’ to the question ‘Were you hospitalized due to these symptoms?’ were not presented any further questions regarding hospitalization and thus were encoded as 0 for ICU admittance and supplemental oxygen; and (3) participants who declined to answer a question about complications due to COVID-19 illness such as septic shock, respiratory failure and organ failure were encoded as 0 for those complications (<2% of all participants for whom continuous severity was scored).

Power analysis

Power analysis was performed with the Purcell Power Calculator for case–control discrete traits (https://zzz.bwh.harvard.edu/gpc/cc2.html, created 24 October 2008). Power was not computed for the Continuous_Severity_Score, because this phenotype is continuous and thus the effect size is on a different scale that is not comparable to the other seven binary phenotypes. Allelic power for each phenotype was computed based on the ‘Prevalence’, ‘Minority Class Cases’, ‘Control:Case’ and ‘Unselected controls?’ fields as defined in Supplementary Table 3. For all seven phenotypes, the same effect size, type I error rates and MAFs were assumed: (1) ‘Genotype relative risk Aa’ = 1.25 (equivalent protective RR Aa = 0.80), (2) ‘Genotype relative risk AA’ = 1.252 = 1.56, (3) ‘User-defined type I error rate’ = 0.05 and (4) ‘High risk allele frequency (A)’ {0.01, 0.05, 0.10, 0.20, 0.30, 0.40, 0.50}. For all phenotypes, perfect linkage disequilibrium (LD) tagging of the causal variant was assumed, and thus ‘D-prime’ = 1 and ‘Marker allele frequency (B)’ = ‘High risk allele frequency (A)’.


Customer genotype data for this study were generated using one of ten versions of a custom Illumina Human OmniExpress genotyping array. Array-based genotyping and SNP calling were performed by either Illumina with the GenotypeStudio Platform or Quest/Athena Diagnostics, a Clinical Laboratory Improvement Amendments-certified genotyping lab. These providers return called genotypes. To ensure quality of each dataset, a sample passes a number of quality control checks, including identifying duplicate samples, removing individuals with a per-sample call rate <98%, and identifying discrepancies between reported sex and genetically inferred sex. Samples that pass all quality control tests proceed to the analysis pipeline; samples that fail one or more tests must be recollected or manually cleared for analysis by lab technicians. Array markers with per-variant call rate <0.98 and array markers that had overall allele frequency differences of >0.10 between any two array versions were additionally removed before downstream analyses.

Defining ancestry cohorts

We defined three separate ancestry cohorts: EUR, LAT and AA (Extended Data Fig. 1). We assigned COVID-19 survey respondents to one of these ancestry groups with a proprietary algorithm that estimates continental admixture proportions. Briefly, this algorithm uses a hidden Markov model to estimate unphased diploid ancestry across the genome by comparing haplotype structure to a reference panel of ~56,000 samples that represent 77 global regions (details available in the Ancestry DNA Ethnicity Estimate White Paper at https://www.ancestrycdn.com/support/us/2021/09/ethnicity2021whitepaper.pdf). The reference panel consists of a combination of AncestryDNA customers and publicly available datasets and is designed to reflect global diversity. From our total cohort of 736,723 individuals who participated in the COVID-19 survey as of 3 August 2020, 537,512 (73%) individuals were designated to the EUR group, 22,464 (3%) to the AA group and 47,301 (6%) to the LAT group, and the remainder were not assigned to any ancestry group (Table 2 and Supplementary Table 2).

Removal of related individuals

AncestryDNA’s identity-by-descent inference algorithm was used to estimate the relationship between pairs of individuals (details available in the Ancestry DNA Matching White Paper at https://www.ancestry.com/dna/resource/whitePaper/AncestryDNA-Matching-White-Paper.pdf). Pairs with estimated separation of fewer than four meioses were considered close relatives. For all close relative pairs, one individual was randomly selected for exclusion from our study. In total, we excluded 60,379 (~8%) individuals from analysis due to relatedness.

Calculation of principal components

For each population described above, genetic principal components were calculated to include in the association studies to control residual population structure and were computed using FlashPCA 2.0.2 (ref. 24). Input genotypes were LD-pruned and filtered using the PLINK 1.9 indep-pairwise command with a window size of 100 SNPs, step size of 5 SNPs and r2 threshold of 0.2 and were filtered to remove SNPs with MAF < 0.05 and missing call rate >0.001.


Samples were imputed to the Haplotype Reference Consortium reference panel version 1.1, which consists of 27,165 total individuals and 36 million variants. The Haplotype Reference Consortium reference panel does not include indels; consequently, indels are not present in the results of our analyses. We determined best-guess haplotypes with Eagle25 version 2.4.1 and performed imputation with Minimac4 version 1.0.1. We used 1,077,214 unique variants as input and 8,187,660 imputed variants were retained in the final dataset. For these variants, we conservatively restricted our analyses to variants with MAF > 0.01 and Minimac4 R2 > 0.30 using imputed dosages for all variants regardless of whether they were originally genotyped.

GWASs in each Ancestry population

We conducted separate GWASs for the EUR, LAT and AA populations described above. For the EUR population only, we conducted sex-stratified GWASs and meta-analyzed the results via inverse-variance weighting implemented in METAL26 (version released 25 March 2011).

For each individual population (EUR female, EUR male, LAT and AA) and each of the eight phenotypes, we conducted a GWAS assuming an additive genetic model with PLINK2.0. Imputed genotype dosage value was the primary predictor. The following were included as fixed-effect covariates: principal components 1–25 (described above), array platform, orthogonal age and orthogonal age2. Orthogonal polynomials were used to eliminate collinearity between age and age2, and were calculated in R version 3.6.0 with base function poly(age, degree = 2). We additionally used PLINK2.0 to remove variants with Minimac4 imputation quality R2 < 0.3 or with MAF < 0.01. See the Supplementary Methods for a list of PLINK2.0 flags used for each analysis. All variant effect estimates are adjusted for the 28 covariates described above.

Trans-ancestry meta-analysis

For each phenotype, we additionally performed a trans-ancestry meta-analysis of the sex-combined EUR, AA and LAT summary statistics, again using fixed-effect inverse-variance weighting implemented in METAL26 (version released 25 March 2011). These summary statistics were used to assess replication of the 12 SNPs and investigate enrichment of protective effects, as defined in the next sections. In the genome-wide analysis, we only examined SNPs that passed quality control in all individual populations. We implemented a stringent, Bonferroni-corrected significance threshold by dividing a trans-ancestry genome-wide significance threshold of P < 1 × 10−8 by the eight phenotypes, which results in P < 1.25 × 10−9. Suggestive significance followed the definition used by the HGI consortium of P < 1 × 10−5. All reported P values are two sided. Manhattan plots for each of the eight trans-ancestry meta-analyses are provided in Extended Data Fig. 3, and corresponding quantile–quantile plots and genomic inflation factors are provided in Extended Data Fig. 4.

Replication of 12 independent SNPs from previous studies

We manually curated a list of 12 independent SNPs that represent lead loci identified by either HGI or Horowitz et al. Eight of the 12 SNPs were lead SNPs in HGI’s October 2020 data freeze 4, without 23andMe data included the data release. These eight SNPs were the most-associated marker at any locus achieving P < 5 × 10−8 in the European hospitalization versus population (‘ANA_B2’) or European COVID-19+ versus population (‘ANA_C2’) study. The remaining four SNPs were selected from Fig. 1 of a recent trans-ancestry meta-analysis by Horowitz et al.27 We note that a subset of AncestryDNA survey respondents overlap those included in the large meta-analyses conducted by HGI and Horowitz et al., and thus, replication in our study is not completely independent (Supplementary Fig. 2). All 12 SNPs in the final list are independent of one another (r2 < 0.05 in the EUR, AA and LAT cohorts) and represent ten positionally distinct loci (>500 kb apart). One of the ten loci encompasses three independent SNPs that span a 54-kb region near SLC6A20/LZTFL1 on chr3. For these 12 index SNPs, we extracted corresponding summary statistics from the trans-ancestry meta-analysis for each phenotype. We computed two-sided −log10(P value) from the trans-ancestry meta-analysis, setting any trans-ancestry P > 0.05 or with inconsistent directions of effect compared to the previous study equal to zero. From the resulting matrix of −log10(P values), we generated a heatmap with R package pheatmap version 1.0.12 and used hierarchical clustering to order the phenotype rows and the SNP columns in an unsupervised fashion. We generated a forest plot of corresponding minor allele effect estimates and 95% confidence intervals with the R package ggforestplot version 0.1.0.

Analysis of enrichment of protective effects

For each of the eight phenotypes, we identified all SNPs that were associated at different P-value significance thresholds (α), where α {0.01, 0.001, 1 × 10−4, 1 × 10−5}. For each suggestive association, we designated the SNP with the lowest trans-ancestry P value within a 500-kb window as the index SNP. We recorded the total number of associated index SNPs and the proportion of index SNPs for which the minor allele was associated with a protective direction of effect. Each index SNP was also binned into one of six MAF bins: [0.01,0.05], (0.05,0.1], (0.1,0.2], (0.2,0.3] (0.3,0.4], (0.4,0.5].

To test for enrichment of protective or risk effects for each phenotype and α threshold, we used a Cochran–Mantel–Haenszel (CMH) test of two proportions with the base R (version 3.6.0) function mantelhaen.test. We compared the proportion of protective index SNPs for each phenotype to the proportion of protective index SNPs identified in the remaining seven phenotypes at the same α. MAF bins were used as strata. We considered a Bonferroni-corrected two-sided CMH P value < 0.05/(8 phenotypes × 4 α thresholds) = 0.00156 evidence for enrichment of risk or protective effects relative to the other phenotypes. We plotted the MAF-adjusted CMH ORs in Fig. 3b, where positive numbers represent enrichment for protective effects. In cases where the protective enrichment OR was less than 1, we reported −(1/OR), so that lower numbers represent greater enrichment of risk effects.

Statistics and reproducibility

This is a cross-sectional genetic association study using extant genetic data from AncestryDNA customers and disease data collected via survey over an approximately 3.5-month period early in the COVID-19 pandemic. No statistical method was used to predetermine sample size. Individuals who could not be assigned to one of the three ancestry cohorts were excluded from analyses. For all close relative pairs, one randomly selected individual was excluded from analyses. The experiments were otherwise not randomized. This was an observational study from self-reported data, and as such, the investigators were not blinded to allocation during experiments and outcome assessment.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.


Source link