Sequencing data

We used publicly available sequencing data from the GIAB consortium45, 1000 Genomes Project high-coverage data46 and Human Genome Structural Variation Consortium (HGSVC)4. All datasets include only samples consented for public dissemination of the full genomes.

Statistics and reproducibility

For generating the assemblies, we used all 14 samples for which PacBio HiFi-data were available. For variant calling, the three children (HG00733, HG00514 and NA19240) were used for quality control and were not included in the final callsets/graphs, because they do not provide any additional information for genotyping. Code and pipelines to reproduce our analysis are available on Zenodo68,69.

Variant calling and pangenome construction


Fully phased assemblies for 14 samples (HG00731, HG00732, HG00733, HG00512, HG00513, HG00514, NA19238, NA19239, NA19240, NA12878, HG02818, HG03125, NA24385 and HG03486) were generated using a development version of the PGAS pipeline2,4 (parameter settings v.13). Compared with the previous PGAS production release (v.12 used in the HGSVC project4), this PGAS development update included a new version of the SaaRclust package70 (v.6cb8c96), controlled for adapter contamination in the input HiFi reads (reimplementation of the process published at https://github.com/sheinasim/HiFiAdapterFilt), and employed hifiasm71 v.0.15.2 as default assembler. In direct comparison to the previously used HiFi assembler Peregrine72, hifiasm substantially reduces the number of sequence collapses, leading to overall more correct assemblies (see the evaluation in Cheng et al.71). We provide assembly statistics in Supplementary Table 1.

Variant calling

We used haplotype-resolved assemblies of all 14 samples to call variants (Extended Data Fig. 1a). The three child samples (HG00733, HG00514 and NA19240) were used only for quality control and filtering, and thus were not part of our final callset/graph. For each sample, we separately mapped contigs of each haplotype (Supplementary Table 1) to the reference genome (GRCh38). This was done using minimap2 (ref. 53) (v.2.18) with parameters -cx asm20 -m 10000 -z 10000,50 -r 50000 –end-bonus=100 -O 5,56 -E 4,1 -B 5 –cs. In the next step, we called variants on each haplotype of all autosomes and chromosome X using paftools (https://github.com/lh3/minimap2/tree/master/misc) with default parameters. We generated a biallelic, VCF file containing variant calls made across all 11 unrelated samples (Extended Data Fig. 1a). If a region was not covered by any contig alignment in a sample, or the sample had multiple overlapping contig alignments, we set all its genotypes in this region to missing (“./.”), because it is unclear what the true genotype alleles are in this case. Furthermore, we removed variants from our callset for which >20% of the samples have missing genotype information. The remaining regions covered 91.8% (2.8 Gbp) of chromosomes 1–22 and chromosome X. Of the 8.2% of regions not covered, 48.3% were gaps in GRCh38 and 24.0% were centromeres.

We computed the Mendelian consistency for the Puerto Rican (HG00731, HG00732, HG00733), Chinese (HG00512, HG00513, HG00514) and Yoruban (NA19238, NA19239, NA19240) trios and observed that 97.9%, 96.8% and 97.6% of all variants were consistent with Mendelian laws, respectively. We removed a variant from our callset if there was a Mendelian conflict in at least one of the three trios. We show the number of variants in our final callset and the intermediate stages of variant calling in the first three columns of Supplementary Table 2.

Pangenome construction

Given the filtered variant calls, our goal was to construct an acyclic and directed graph by inserting the variants of all haplotypes into the linear reference genome. Variants produce bubbles in the graph with branches that define the corresponding alleles. The input haplotypes can be represented as paths through the resulting pangenome. When constructing the graph, we represent sets of variants overlapping across haplotypes as a single bubble, with potentially multiple branches reflecting all the allele sequences observed in the haplotypes in the respective genomic region (Extended Data Fig. 1b). The total number of bubbles in the resulting graph is presented in the last column of Supplementary Table 2. We represent the pangenome in terms of a fully phased, multisample VCF file that contains one entry for each bubble in the graph (Extended Data Fig. 1b). At each site, the number of branches of the bubble is limited by the number of input haplotype sequences and the genotypes of each sample define two paths through this graph, corresponding to the respective haplotypes. We keep track of which individual input variants contribute to each bubble in the graph, so that we can convert our pangenome graph representation back to the set of input variants. In this way, we can convert genotypes computed by a genotyper for all these bubbles to genotypes for each individual callset variants.

PanGenie’s genotyping algorithm

We define a hidden Markov model that can be used to compute the two most likely haplotype sequences of a given sample based on known haplotype paths and the sample reads. The new haplotype sequences are combinations of the existing paths through the graph and are computed based on the copy numbers of unique k-mers observed in the sequencing reads provided for the sample to be genotyped.

Identifying unique k-mers

Sets of bubbles that are less than the k-mer size apart (we use k = 31) are combined and treated as a single bubble. The alleles corresponding to such a combined bubble are defined by the haplotype paths in the respective region. For each bubble position v, we determine a set of k-mers, kmersv, that uniquely characterize the region. This is done by counting all k-mers along haplotype paths in the pangenome graph using Jellyfish73 (v.2.2.10), and then determining a set of k-mers for each bubble that occurs at most once within a single allele sequence and are not found anywhere outside the variant bubble. We additionally counted all k-mers of the graph in the sequencing reads. This allows us to compute the mean k-mer coverage of the data, which we use later to compute emission probabilities (see Observable states).

Hidden states and transitions

We assume being given N haplotype paths Hi, i = 1, …, N, through the graph. Furthermore, for each bubble v, v = 1, …, M, we are given a vector of k-mers, kmersv, that uniquely characterize the alleles of a bubble. We assume some (arbitrary) order of the elements in kmersv and refer to the ith k-mer as kmersv[i]. In addition, we are given sequencing data of the sample to be genotyped and corresponding k-mer counts for all k-mers in kmersv. For each bubble v, we define a set of hidden states (eta _v = left{ {H_{v,i,j}|i,j le N} right}) which contain a state for each possible pair of the N given haplotype paths in the graph. Each such state Hv,i,j induces an assignment of copy numbers to all k-mers in kmersv. We define a vector av,i,j such that the kth position contains the copy number assigned to the kth k-mer in kmersv:

$${mathbf{a}}_{v,i,j}left[ k right] = left{ {begin{array}{*{20}{l}} {0quad kmers_vleft[ k right] notin H_i cup H_j} \ {1quad kmers_vleft[ k right] in H_ibackslash H_j} \ {1quad kmers_vleft[ k right] in H_jbackslash H_i} \ {2quad kmers_vleft[ k right] in H_i cap H_j}. end{array}quad } right.forall k = 1, ldots ,left| {kmers_v} right|$$

The idea here is that we expect to see copy number 2 for all k-mers occurring on both haplotype paths. In case only one of the haplotypes contains a k-mer, its copy number must be 1 and k-mers that do not appear in any of the two paths must have copy number 0. From each state Hv,i,j in ηv that corresponds to bubble position v, there is a transition to each state corresponding to the next position, v + 1. In addition, there is a start state, from which there is a transition to each state of the first bubble, and an end state, to which there is a transition from each state that corresponds to the last bubble.

Transition probabilities

Transition probabilities are computed following the Li–Stephens model37. Given a recombination rate r, the effective population size Ne and the distance x (in basepairs) between two ascending bubbles v − 1 and v we define:

$$d = x times frac{1}{{1,000,000}} times 4rN_e.$$

We compute the Li–Stephens transition probabilities as:

$$p_r = left( {1 – {mathrm{exp}}left( { – frac{d}{N}} right)} right) times frac{1}{N}$$

$$q_r = {mathrm{exp}}left( { – frac{d}{N}} right) + p_r.$$

Finally, the transition probability from state Hv–1,k,l to state Hv,i,j is computed as shown below:

$$Pleft( {H_{v,i,j}{{{mathrm{|}}}}H_{v-1,k,l}} right) = left{ {begin{array}{*{20}{c}} {q_r times q_rquad i = k,j = l} \ {q_r times p_rquad i = k,j ne l} \ {q_r times p_rquad i ne k,j = l} \ {p_r times p_rquad i ne k,j ne l}. end{array}} right.$$

Observable states

Each hidden state Hv,i,j in ηv outputs a count for each k-mer in kmersv. Let Ov be a vector of length |kmersv| for bubble v such that Ov[k] contains the observed k-mer count of the kth k-mer in the sequencing reads. To define the emission probabilities, we first need to model the distribution of k-mer counts for each copy number, (Pleft( {{mathbf{O}}_vleft[ k right]|a_{v,i,j}left[ k right] = c} right),c = 0,1,2). For copy number 2, we use a Poisson distribution with mean λ which we set to the mean k-mer coverage that we compute from the k-mer counts of all graph k-mers. Similarly, we approximate the k-mer count distribution for copy number 1 in terms of a Poisson distribution with mean (frac{lambda }{2}). For copy number 0, we need to model the erroneous k-mers that arise from sequencing errors. This is done using a geometric distribution, the parameter p of which we choose based on the mean k-mer coverage. Finally, we compute the emission probability for a given state and given observed read k-mer counts as shown below, making the assumption that the k-mer counts are independent:

$$Pleft( {{mathbf{O}}_v|H_{v,i,j}} right) = mathop {prod }limits_{l = 1}^{left| {kmers_v} right|} Pleft( {{mathbf{O}}_vleft[ l right]|{mathbf{a}}_{v,i,j}left[ l right]} right).$$

Genotypes and haplotypes

In this model, genotypes correspond to pairs of given haplotype paths at each bubble position. Genotype likelihoods can be computed using the forward–backward algorithm.

Forward–backward algorithm

The initial distribution of our HMM is such that we assign probability 1 to the start state and 0 to all others. Forward probabilities αv() are computed in the following way:

$$alpha _0left( {{mathrm{start}}} right) = 1.$$

For states corresponding to bubbles v = 1, …, M, the forward probabilities are computed as shown below. The set of observed k-mer counts at position v is given by Ov:

$$begin{array}{l}alpha _vleft( {H_{v,i,j}} right) =\ mathop {sum}nolimits_{H_{v – 1,s,t} in eta _{v – 1}} {alpha _{v – 1}left( {H_{v – 1,s,t}} right) times Pleft( {H_{v,i,j}|H_{v – 1,s,t}} right) times Pleft( {{mathbf{O}}_v|H_{v,i,j}} right)forall i,j}.end{array}$$

The transition probabilities are computed as described above, except for transitions from the start state to all states in the first column, which we assume to have uniform probabilities. Backward probabilities are computed in a similar manner. We set:

$$beta _Mleft( {{mathrm{end}}} right) = 1.$$

For (v = 1,…,M – 1), we compute them as:

$$begin{array}{l}beta _vleft( {H_{v,i,j}} right) =\ mathop {sum}nolimits_{H_{v + 1,s,t} in eta _{v + 1}} {beta _{v + 1}left( {H_{v + 1,s,t}} right) times Pleft( {H_{v + 1,s,t}|H_{v,i,j}} right) times Pleft( {{mathbf{O}}_{v + 1}|H_{v + 1,s,t}} right)forall i,j}.end{array}$$

Finally, posterior probabilities for the states can be computed:

$$Pleft( {H_{v,i,j}|{mathbf{O}_1,mathbf{O}_2},ldots,mathbf{O}_M} right) = frac{{alpha _vleft( {H_{v,i,j}} right) times beta _vleft( {H_{v,i,j}} right)}}{{mathop {sum}nolimits_{h in eta _v} {alpha _vleft( h right)beta _vleft( h right)} }}.$$

Several states at a bubble position v can correspond to the same genotype, because different paths can cover the same allele. Also, the alleles in a genotype are unordered, therefore states Hv,i,j and Hv,j,i always lead to the same genotype. To compute genotype likelihoods, we sum up the posterior probabilities for all states that correspond to the same genotype. In this way, we can compute genotype likelihoods for all genotypes at a bubble position, based on which a genotype prediction can be made.

Comparison to existing genotyping methods

We conducted a ‘leave-one-out’ experiment to mimic a realistic scenario in which we genotyped variants detected from haplotype-resolved assemblies of a set of known samples in a new, unknown sample. We collected variants called across all but one sample and used them as input for genotyping the left-out sample (we refer to this set as known variants in the following). We used the set of variants called from the assemblies of the left-out sample for evaluation (evaluation variants). We ran this experiment twice, removing samples NA12878 and NA24385, respectively. As input for PanGenie (commit 1f3d2d2 (ref. 68)), BayesTyper (v.v1.5) and Paragraph (v.2a), we constructed a pangenome graph representation based on the known variants in the same way as described in Constructing a pangenome reference. We kept track of which variant alleles each resulting bubble consists of, so that genotypes derived for all bubbles can later be converted back to the original variant representation. For the other genotypers tested (GATK, Platypus 0.8.1, GraphTyper 2.7.1 and Giraffe v.1.30.0), we directly used the set of known variants as input, without generating the graph representation first, because we observed that these tools could better handle variants represented in this way. As a result of running all genotypers, we had one VCF file per tool containing genotypes for all our known variants. We used the evaluation variants to evaluate the genotype predictions of all tools. Extended Data Fig. 2 provides an illustration of the leave-one-out experiment.

Note that re-genotyping a set of known variants in a new sample is different from variant detection. Variants present in the new sample that have not been seen in the callset samples can thus not be genotyped because genotypers can genotype only variants that they have seen before. We provide the number of unique variants of each panel sample in Supplementary Table 3. Most genotyping metrics (weighted genotyping concordance, adjusted precision/recall) explained in detail in Supplementary Note exclude these variants.

Besides re-genotyping our callset variants, we additionally ran GATK and Platypus in discovery mode to detect and genotype their own variants. We evaluated the results by computing precision/recall based on our ground-truth variants (Supplementary Figs. 10 and 11).

Evaluation regions

Some genomic regions are more difficult to genotype than others, such as SVs that tend to be located in repetitive and more complex regions of the genome. Therefore, we looked at variants located inside and outside of STR/VNTR regions which we obtained from the UCSC genome browser (Simple Repeats Track for GRCh38)47. In addition, we classified the genome into ‘complex’ and ‘biallelic’ regions based on the bubble structure of our pangenome graph: all variants located inside of complex bubbles, that is, bubbles with more than two branches, fell into the first category, and the remaining regions into the second. Consider Extended Data Fig. 1 for an example: the first and third bubbles are complex, thus all variants contained inside these bubbles fall into the category ‘complex’. The second bubble is biallelic and therefore the corresponding SNP variant is considered ‘biallelic’.

For our ‘leave-one-out’ experiment for sample NA12878, we show the number of variants falling into the different categories in Fig. 3, Extended Data Figs. 38 and Supplementary Table 4. It can be observed that most complex bubbles are located inside STR/VNTR regions (Supplementary Table 4). In addition, more than half of all midsize and large variants are located in these repetitive regions.

Genotyping larger cohorts

We randomly selected 100 trios (20 of each superpopulation: AFR, AMR, EAS, EUR, South Asian (SAS)) that are part of the 1000 Genomes Project and genotyped all our variant calls across these 300 samples. We used our pangenome graph representation containing all 11 assembly samples as an input for PanGenie, genotyped all bubbles and later converted the resulting genotypes back to obtain genotypes for the individual callset variants. Our callset might contain variants that are difficult to genotype correctly. To identify a high-quality subset of variants that we could reliably genotype, we defined different filters based on the predicted genotypes that we list below. One metric used for defining filters is the Mendelian consistency. We computed the Mendelian consistency for each variant by counting the number of trios for which the predicted genotypes are consistent with Mendelian laws. We considered only trios with at least two different genotypes, that is, we excluded a trio if all three genotypes were 0/0, 0/1 or 1/1. This resulted in a more strict definition of Mendelian consistency (Supplementary Fig. 14). In addition to genotyping all 300 trio samples, we also genotyped all 11 panel samples using the full input panel. Genotyping samples that are also in the panel helped us to find cases where panel haplotypes and reads disagreed and thus was another useful filter criterion. We defined filters as follows: (1) ac0-fail: a variant fails this filter if it was genotyped with AF 0.0 across all samples; (2) mendel-fail: a variant fails this filter if the fraction of Mendelian consistent trios was <90% (our definition of Mendelian consistency excludes all trios with all 0/0, all 0/1 or all 1/1 genotypes and only considers such with at least two different genotypes); (3) gq-fail: a variant failed this filter if it was genotyped with a genotype quality <200 in >5 samples; (4) self-fail: in addition to the 100 trios, we also genotyped the 11 panel samples; a variant failed this filter if the genotype concordance across all panel samples was <90%; and (5) non-ref-fail: the variant was genotyped as 0/0 across all panel samples.

For all combinations of filters, we show the number of large deletions and large insertions in each category in Supplementary Fig. 15. To define a strict, high-quality set of variants, we selected all that passed all five filters (Supplementary Table 7).

For quality control, we analyzed allele frequencies and the fraction of heterozygous genotypes for all variants contained in our unfiltered and strict sets (Supplementary Figs. 16 and 17). In addition, we used VCFTools74 (v.0.1.16) to test the genotype predictions of all variants typed with an AF > 0.0 for conformance with the HWE and corrected for multiple hypothesis testing by applying the Benjamini–Hochberg correction75 (α = 0.05).

In addition to defining a strict set, we constructed a more lenient set for our SV calls (≥50 bp) using a machine-learning approach based on support vector regression. We used the strict set as a positive set and defined a negative set consisting of all variants that were typed with an AF > 0.0 and failed at least three filters. For large insertions, the negative set contained 2,611 variants, and for large deletions 1,125. The model then predicted scores between −1 (worst) and 1 (best) for all variants that were in neither the positive nor the negative set. We show the distribution of scores for our variant calls in Supplementary Fig. 18. The lenient set was then constructed by adding all variants with a score >−0.5 to our strict SV set (Supplementary Table 8 and Supplementary Fig. 18).

LD analysis

We performed an LD analysis based on the genotypes we obtained across all 200 unrelated samples. We used gatk4 (ref. 16) (v. to annotate the calls with variant IDs from dbSNP (build 154)76. We selected variants that are contained in the NHGRI-EBU GWAS catalog58 and used plink77 (v.190b618) to determine SVs that are in LD with the GWAS variants (r2 ≥ 0.8). For comparison with other nonhuman primates, human genomic sequence (GRCh38; chr9:133278657-133279020) corresponding to 50 bp flanking the annotated LTR10B2 VNTR was used to retrieve the corresponding orthologous sequence from primate genomes62 or HiFi PacBio sequence data from nonhuman primates63. Multiple sequence alignments were constructed using MAFFT and manually inspected for VNTR copy number.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.


Source link