fetchExtendedChromInfoFromUCSC {GenomeInfoDb}R Documentation

Fetching chromosomes info for some of the UCSC genomes

Description

Fetch the chromosomes info for some of the UCSC genomes. Only supports hg38, hg19, mm10, dm3, and sacCer3 at the moment.

Usage

fetchExtendedChromInfoFromUCSC(genome,
        goldenPath_url="http://hgdownload.cse.ucsc.edu/goldenPath")

Arguments

genome

A single string specifying the UCSC genome e.g. "sacCer3".

goldenPath_url

A single string specifying the URL to the UCSC goldenPath location. This URL is used internally to build the full URL to the 'chromInfo' MySQL dump containing chromosomes information for genome. See Details section below.

Details

Chromosomes information (e.g. names and lengths) for any UCSC genome is stored in the UCSC database in the 'chromInfo' table and is normally available as a MySQL dump at:

  goldenPath_url/<genome>/database/chromInfo.txt.gz
fetchExtendedChromInfoFromUCSC downloads and imports that table into a data frame, and keeps only the UCSC_seqlevels and UCSC_seqlengths columns (after renaming them). Then it lookups the assembly report at NCBI for that genome corresponding (e.g. GRCh38 assembly for hg38), extracts the seqlevels and GenBank accession numbers from the report, matches them to each UCSC seqlevels (using some heuristic), and adds them to the returned data frame.

Value

A data frame with one row per seqlevel in the UCSC genome, and with the following columns:

Note that the rows are not sorted in any particular order.

Note

Only supports the hg38, hg19, mm10, dm3, and sacCer3 genomes at the moment. More will come...

Author(s)

H. Pages

See Also

Examples

## ---------------------------------------------------------------------
## A. BASIC EXAMPLE
## ---------------------------------------------------------------------
chrominfo <- fetchExtendedChromInfoFromUCSC("sacCer3")
chrominfo

## ---------------------------------------------------------------------
## B. USING fetchExtendedChromInfoFromUCSC() TO PUT UCSC SEQLEVELS ON
##    THE GRCh38 GENOME
## ---------------------------------------------------------------------

## Load the BSgenome.Hsapiens.NCBI.GRCh38 package:
library(BSgenome)
genome <- getBSgenome("GRCh38")  # this loads the
                                 # BSgenome.Hsapiens.NCBI.GRCh38 package

## A quick look at the GRCh38 seqlevels:
length(seqlevels(genome))
head(seqlevels(genome), n=30)

## Fetch the extended chromosomes info for the hg38 genome:
hg38_chrominfo <- fetchExtendedChromInfoFromUCSC("hg38")
dim(hg38_chrominfo)
head(hg38_chrominfo, n=30)

## 2 sanity checks:
##   1. Check the NCBI seqlevels:
stopifnot(setequal(hg38_chrominfo$NCBI_seqlevels, seqlevels(genome)))
##   2. Check that the sequence lengths in 'hg38_chrominfo' (which are
##      coming from the same 'chromInfo' table as the UCSC seqlevels)
##      are the same as in 'genome':
stopifnot(
  identical(hg38_chrominfo$UCSC_seqlengths,
            unname(seqlengths(genome)[hg38_chrominfo$NCBI_seqlevels]))
)

## Extract the hg38 seqlevels and put the GRCh38 seqlevels on it as
## the names:
hg38_seqlevels <- setNames(hg38_chrominfo$UCSC_seqlevels,
                           hg38_chrominfo$NCBI_seqlevels)

## Set the hg38 seqlevels on 'genome':
seqlevels(genome) <- hg38_seqlevels[seqlevels(genome)]
head(seqlevels(genome), n=30)

[Package GenomeInfoDb version 1.0.2 Index]