Resolving the complexity of the human genome using single-molecule sequencing

Data set	SRA accession
CHM1 PacBio whole-genome sequence (54X)	SRX533609
CHM1 Illumina whole-genome sequence (40X)	SRP044331
Clone accessions	See Supplemental Table 34 ("Data Accession")

Sequence analyses

All analysis resources from Chaisson et al. 2014 are organized below by the reference genome on which they were performed.

GRCh37

Structural variation calls

All structural variants ≥50 bp called in CHM1 against GRCh37 are available below in BED format.

Insertions (15373)
- ALR (622)
- Alu Mosaic (175)
- AluS simple (108)
- Alu STR (112)
- AluY simple (859)
- Beta (3)
- Complex (1115)
- HERV simple (58)
- HSAT (46)
- L1 (107)
- L1HS simple (145)
- L1P (128)
- MER (117)
- Unannotated (2386)
- Singletons (168)
- STR (6007)
- SVA simple (457)
- TRF (2760)
Deletions (11311)
- ALR (769)
- Alu Mosaic (122)
- AluS simple (186)
- Alu STR (64)
- AluY simple (859)
- Beta (0)
- Complex (317)
- HERV simple (60)
- HSAT (48)
- L1 (141)
- L1HS simple (141)
- L1P (129)
- MER (120)
- Unannotated (2312)
- Singletons (277)
- STR (2986)
- SVA simple (382)
- TRF (2398)
Inversions (33)

STR expansions

CHM1 contains more than 10,000 expansions of short-tandem repeats (STRs) that are not present in GRCh37. The insertion coordinates and sequences for these STRs are available below in BED and FASTA formats, respectively.

Insertion coordinates (BED)
Insertion sequences (FASTA)

Gap closures

Gap closures and extensions were assembled with PacBio reads mapping adjacent to annotated gaps in GRCh37. For this reason, the complete closure and extension assemblies include gap-flanking sequence from the reference in addition to the novel sequences inside the gaps.

We provide coordinates and sequences for both the complete assemblies and the novel sequences.

Region type	Complete assemblies	Novel sequence only
Closures	Coordinates (BED) Sequences (FASTA)	Coordinates (BED) Sequences (FASTA)
Extensions	Coordinates (BED) Sequences (FASTA)	Coordinates (BED) Sequences (FASTA)

Inaccessible regions

Inaccessible regions are regions of GRCh37 without deletion calls and with fewer than six (6) high-quality alignments. These regions are indicators of potential reference misassemblies.

Inaccessible regions (BED)

GRCh37 patched

To enable analysis of novel sequences from CHM1 in the context of the whole genome, we created a patched version of GRCh37 with all STR expansions >1 kbp and all gap closures and extensions.

The following resource includes the complete patched reference, coordinates of inserted novel sequences, and a UCSC Genome Browser track hub to facilitate exploration of these new sequences.

Complete reference sequence (2.9 GB FASTA)
UCSC-style chromosome lengths (TAB-delimited)
Coordinates of inserted STRs and gaps
- STR expansions (BED)
- Gap closures and extensions (BED)
UCSC track hub with gap, STR, and gene annotations
- View the GRCh37 patched genome on UCSC's public browser
- View the raw track hub entry point

GRCh38

In addition to the complete analysis of GRCh37, we also closed gaps in GRCh38 and analyzed heterochromatic sequences in the CHM1 PacBio reads using GRCh38's updated centromeric and acrocentric sequences.

Gap closures

Complete assemblies
- Coordinates (BED)
- Sequences (FASTA)
Novel sequence only
- Coordinates (BED)
- Sequences (FASTA)

Heterochromatic sequence analysis

We identified heterochromatic sequence in the CHM1 PacBio reads with three approaches. First, we mapped the PacBio reads to GRCh38 and identified all reads mapping to regions annotated as satellites by RepeatMasker including the new centromeric models. Second, we RepeatMasked all reads that did not map to GRCh38 and identified all reads that were annotated as satellites. Finally, we identified the longest single read extending into telomeres and centromeres in GRCh37 where alignments are not influenced by GRCh38's centromeric models.

We provide the GRCh38 coordinates of all mapped reads identified as satellites classified by satellite class and the read names of all unmapped reads containing satellites as a resource for further exploration of heterochromatic regions of the human genome.

Reads mapped to GRCh38 by satellite class
- Centromeric (BED)
- Telomeric (BED)
- Acrocentric (BED)
- Other satellites (BED)
Unmapped reads with satellites
- Centromeric (text)
- Telomeric (text)
- Acrocentric (text)
- Other satellites (text)

We also provide the sequences of the longest reads extending into telomeres and centromeres of GRCh37 and dotplots of the longest reads aligned against themselves for visualization of higher order repeats in the reads.

Dotplots of read extensions (PDF)
Telomeric extensions
- chr1p
- chr2p
- chr3p
- chr4p
- chr5p
- chr7p
- chr8p
- chr9p
- chr10p
- chr11p
- chr12p
- chr16p
- chr18p
- chr19p
- chr20p
- chr1q
- chr2q
- chr3q
- chr4q
- chr5q
- chr6q
- chr7q
- chr8q
- chr9q
- chr10q
- chr11q
- chr12q
- chr13q
- chr14q
- chr15q
- chr17q
- chr18q
- chr19q
- chr21q
- chrXq
Centromeric extensions
- chr1p
- chr2p
- chr3p
- chr4p
- chr5p
- chr6p
- chr7p
- chr9p
- chr10p
- chr11p
- chr12p
- chr16p
- chr18p
- chr19p
- chr20p
- chr21p
- chr1q
- chr2q
- chr3q
- chr4q
- chr5q
- chr6q
- chr9q
- chr10q
- chr11q
- chr12q
- chr13q
- chr14q
- chr15q
- chr17q
- chr18q
- chr19q
- chr20q
- chr21q
- chr21q
- chrXq

Resolving the complexity of the human genome using single-molecule sequencing

Sequence data

Sequence analyses

GRCh37

Structural variation calls

STR expansions

Gap closures

Inaccessible regions

GRCh37 patched

GRCh38

Gap closures

Heterochromatic sequence analysis

Software