Mark J.P. Chaisson1, John Huddleston1,2, Megan Y. Dennis1, Peter H. Sudmant1, Maika Malig1, Fereydoun Hormozdiari1, Francesca Antonacci3, Urvashi Surti4, Richard Sandstrom1, Matthew Boitano5, Jane M. Landolin5, John A. Stamatoyannopoulos1, Michael W. Hunkapiller5, Jonas Korlach5, and Evan E. Eichler1,2
All sequence data including whole-genome sequence (WGS) and clones are available through NCBI's Sequence Read Archive (SRA) and GenBank.
All analysis resources from Chaisson et al. 2014 are organized below by the reference genome on which they were performed.
All structural variants ≥50 bp called in CHM1 against GRCh37 are available below in BED format.
CHM1 contains more than 10,000 expansions of short-tandem repeats (STRs) that are not present in GRCh37. The insertion coordinates and sequences for these STRs are available below in BED and FASTA formats, respectively.
Gap closures and extensions were assembled with PacBio reads mapping adjacent to annotated gaps in GRCh37. For this reason, the complete closure and extension assemblies include gap-flanking sequence from the reference in addition to the novel sequences inside the gaps.
We provide coordinates and sequences for both the complete assemblies and the novel sequences.
Region type | Complete assemblies | Novel sequence only |
---|---|---|
Closures |
|
|
Extensions |
|
|
Inaccessible regions are regions of GRCh37 without deletion calls and with fewer than six (6) high-quality alignments. These regions are indicators of potential reference misassemblies.
To enable analysis of novel sequences from CHM1 in the context of the whole genome, we created a patched version of GRCh37 with all STR expansions >1 kbp and all gap closures and extensions.
The following resource includes the complete patched reference, coordinates of inserted novel sequences, and a UCSC Genome Browser track hub to facilitate exploration of these new sequences.
In addition to the complete analysis of GRCh37, we also closed gaps in GRCh38 and analyzed heterochromatic sequences in the CHM1 PacBio reads using GRCh38's updated centromeric and acrocentric sequences.
We identified heterochromatic sequence in the CHM1 PacBio reads with three approaches. First, we mapped the PacBio reads to GRCh38 and identified all reads mapping to regions annotated as satellites by RepeatMasker including the new centromeric models. Second, we RepeatMasked all reads that did not map to GRCh38 and identified all reads that were annotated as satellites. Finally, we identified the longest single read extending into telomeres and centromeres in GRCh37 where alignments are not influenced by GRCh38's centromeric models.
We provide the GRCh38 coordinates of all mapped reads identified as satellites classified by satellite class and the read names of all unmapped reads containing satellites as a resource for further exploration of heterochromatic regions of the human genome.
We also provide the sequences of the longest reads extending into telomeres and centromeres of GRCh37 and dotplots of the longest reads aligned against themselves for visualization of higher order repeats in the reads.