Resolving the complexity of the human genome using single-molecule sequencing

Mark J.P. Chaisson1, John Huddleston1,2, Megan Y. Dennis1, Peter H. Sudmant1, Maika Malig1, Fereydoun Hormozdiari1, Francesca Antonacci3, Urvashi Surti4, Richard Sandstrom1, Matthew Boitano5, Jane M. Landolin5, John A. Stamatoyannopoulos1, Michael W. Hunkapiller5, Jonas Korlach5, and Evan E. Eichler1,2

  1. Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA
  2. Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA
  3. Dipartimento di Biologia, Università degli Studi di Bari “Aldo Moro”, Bari 70125, Italy
  4. Department of Pathology, University of Pittsburgh, Pittsburgh, PA 15261, USA
  5. Pacific Biosciences of California, Inc., Menlo Park, CA 94025, USA

Sequence data

All sequence data including whole-genome sequence (WGS) and clones are available through NCBI's Sequence Read Archive (SRA) and GenBank.

Data set SRA accession
CHM1 PacBio whole-genome sequence (54X) SRX533609
CHM1 Illumina whole-genome sequence (40X) SRP044331
Clone accessions See Supplemental Table 34 ("Data Accession")

Sequence analyses

All analysis resources from Chaisson et al. 2014 are organized below by the reference genome on which they were performed.

GRCh37

Structural variation calls

All structural variants ≥50 bp called in CHM1 against GRCh37 are available below in BED format.

STR expansions

CHM1 contains more than 10,000 expansions of short-tandem repeats (STRs) that are not present in GRCh37. The insertion coordinates and sequences for these STRs are available below in BED and FASTA formats, respectively.

Gap closures

Gap closures and extensions were assembled with PacBio reads mapping adjacent to annotated gaps in GRCh37. For this reason, the complete closure and extension assemblies include gap-flanking sequence from the reference in addition to the novel sequences inside the gaps.

We provide coordinates and sequences for both the complete assemblies and the novel sequences.

Region type Complete assemblies Novel sequence only
Closures
Extensions

Inaccessible regions

Inaccessible regions are regions of GRCh37 without deletion calls and with fewer than six (6) high-quality alignments. These regions are indicators of potential reference misassemblies.

GRCh37 patched

To enable analysis of novel sequences from CHM1 in the context of the whole genome, we created a patched version of GRCh37 with all STR expansions >1 kbp and all gap closures and extensions.

The following resource includes the complete patched reference, coordinates of inserted novel sequences, and a UCSC Genome Browser track hub to facilitate exploration of these new sequences.

GRCh38

In addition to the complete analysis of GRCh37, we also closed gaps in GRCh38 and analyzed heterochromatic sequences in the CHM1 PacBio reads using GRCh38's updated centromeric and acrocentric sequences.

Gap closures

Heterochromatic sequence analysis

We identified heterochromatic sequence in the CHM1 PacBio reads with three approaches. First, we mapped the PacBio reads to GRCh38 and identified all reads mapping to regions annotated as satellites by RepeatMasker including the new centromeric models. Second, we RepeatMasked all reads that did not map to GRCh38 and identified all reads that were annotated as satellites. Finally, we identified the longest single read extending into telomeres and centromeres in GRCh37 where alignments are not influenced by GRCh38's centromeric models.

We provide the GRCh38 coordinates of all mapped reads identified as satellites classified by satellite class and the read names of all unmapped reads containing satellites as a resource for further exploration of heterochromatic regions of the human genome.

We also provide the sequences of the longest reads extending into telomeres and centromeres of GRCh37 and dotplots of the longest reads aligned against themselves for visualization of higher order repeats in the reads.