RiboSeeker is an R package for ribosome profiling data analysis. It has two components:
Here, we focus on using the RiboSeeker R package for Ribo-seq data analysis. The example data used are from a human K562 ribosome profiling dataset (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3566410). We run through the Ribomake workflow using hg38 human genome and Ensembl 98 annotation.
First install and load RiboSeeker and several other packages to run the example data.
library(GenomicAlignments) # deals with bam alignments
library(GenomicFeatures) # deals with genome annotations
library(GenomicRanges) # deals with general genomic data
library(ggplot2) # deals with visualizations
library(RiboSeeker)
Then specify the example data files. We only take the reads aligned to chromosome 1 to minimize the running time and storage space. For genome annotation, we also only take chromosome 1 annotations.
# example data folder
example_folder = system.file('extdata', package='RiboSeeker')
# bam file
bam_file = file.path(example_folder, 'GSM3566410.chr1.sorted.bam')
# genome annotation txdb file
txdb_file = file.path(example_folder, 'hg38.Ens_98.chr1.txdb.sqlite')
# # alternatively, you can use bioconductor human annotation package
# if (!requireNamespace('BiocManager', quietly=TRUE)) {
# install.packages("BiocManager")
# }
# BiocManager::install('TxDb.Hsapiens.UCSC.hg38.knownGene')
# library(TxDb.Hsapiens.UCSC.hg38.knownGene)
Next load BAM and TxDb files. We use loadBam
function to load an indexed bam file and return a GAlignments
object of aligned reads. The loadBam
function is essentially a wrapper of readGAlignments
function in GenomicAlignments
package. Note that Ribo-seq data are typically single-end, the paired RNA-seq data can be paired-end. Since the single- or paired-end information is not very important in Ribo-seq data analysis, we simply treat all BAM files as single-end. A message will be printed out if paired-end BAM file is detected.
# load bam file
bam = loadBam(bam_file)
bam
## GAlignments object with 586492 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr1 - 28M 28 15823 15850 28
## [2] chr1 - 28M 28 15874 15901 28
## [3] chr1 - 29M 29 17508 17536 29
## [4] chr1 - 29M 29 17508 17536 29
## [5] chr1 - 28M 28 17508 17535 28
## ... ... ... ... ... ... ... ...
## [586488] chr1 + 29M 29 248918346 248918374 29
## [586489] chr1 + 29M 29 248918346 248918374 29
## [586490] chr1 - 28M 28 248935323 248935350 28
## [586491] chr1 + 25M 25 248936965 248936989 25
## [586492] chr1 + 28M 28 248942310 248942337 28
## njunc
## <integer>
## [1] 0
## [2] 0
## [3] 0
## [4] 0
## [5] 0
## ... ...
## [586488] 0
## [586489] 0
## [586490] 0
## [586491] 0
## [586492] 0
## -------
## seqinfo: 194 sequences from an unspecified genome
# load txdb file
txdb = loadDb(txdb_file)
# alternatively, you can use bioconductor human annotation txdb object
# txdb = TxDb.Hsapiens.UCSC.hg38.knownGene
Ribo-seq reads, or ribosome protected fragments (RPFs) tend to have a specific length distribution. Usually the RPFs are around 30 bp. We implement readLengthDist
function to count the number and percentage of reads for each read length. The function readLengthDistPlot
can be used to plot the read length distribution (percentage or actual read counts) as a barplot.
# read length distribution
read_len_dist = readLengthDist(bam)
read_len_dist
## readLen count pct
## 1 25 30437 5.1896701
## 2 26 46103 7.8608063
## 3 27 101916 17.3772191
## 4 28 249421 42.5276048
## 5 29 83834 14.2941421
## 6 30 25058 4.2725220
## 7 31 24984 4.2599047
## 8 32 11563 1.9715529
## 9 33 7672 1.3081167
## 10 34 3577 0.6098975
## 11 35 1927 0.3285637
# read length distribution plot
p = readLengthDistPlot(read_len_dist, color='steelblue', type='pct')
p
If you run through the Ribomake workflow, it will also generate a read length distribution plot. You might notice the distribution is somehow different than presented here. There might be two reasons: 1. the example data presented here only include reads aligned to chromosome 1; and 2. here we take all aligned reads regardless of where they are mapped (for example, reads aligned to intergenic regions are also counted here). But the Ribomake workflow only takes reads that are aligned to transcripts.
It may also be interesting to see where the reads are aligned on the genome. We implement readGenomeDist
function to assign reads to genomic features such as CDS, UTR, Intron, and Intergenic. This function resizes reads to its 5’-end positions to minimize overlap with multiple features. It returns a list of two elements, the first is the genomic feature assignment for each read, and the second is the summary statistics for user specified genomic features.
# read genomic feature distribution
read_feature_dist = readGenomeDist(bam, txdb, category=c('CDS', 'UTR', 'Intron', 'Intergenic'),
fiveEndOnly=TRUE, ignoreStrand=TRUE)
read_feature_dist$summary
## feature count pct
## 1 CDS 358224 63.084265
## 2 Intergenic 9920 1.746940
## 3 Intron 146884 25.866690
## 4 UTR 52822 9.302104
# read genomic feature distribution plot
p = readGenomeDistPlot(read_feature_dist$summary, type='pct')
p
We can see that for this example dataset, most reads are assigned to CDS and UTR. What is interesting about the full dataset is that if we take all the reads aligned to all chromosomes, the most majority of reads are actually assigned to intergenic regions.
A very useful and important plot in Ribo-seq data analysis is the metagene plot, especially around CDS start (start codon) and CDS end (stop codon) regions. Due to the large size of the ribosome, the 5’-end of RPFs are actually upstream of the ribosome P-site. This shift is called P-site offset. Metagene plot is a good way to determine the P-site offset.
We implement calcMetagene
and metagenePlot
function to compute the metagene matrix and make the plot. For metagene matrix calculation, we require a BAM alignment and a metagene region. Similar with genomic feature plot, we also take only the 5’-end position for each read. If the region is not specified, we will extract all CDS start and end regions from the TxDb annotation. Then the metagene matrix is calculated using either user specified region or the extracted two CDS start and end regions. Each row in the metagene matrix represents a transcript, and each column represents a position in the metagene region. We also provide the option to normalize row-wise the counts to the total counts in each row.
For metagene plot, one plot will be made if the metagene region is specified by users. Otherwise, two plots around CDS start and end regions will be made. Position 0 in CDS start means the first base of the start codon, and position 0 in CDS end means the last base of the stop codon.
# metagene
metagene = calcMetagene(bam, txdb=txdb)
# metagene plot
p = metagenePlot(metagene, metageneFun='mean')
p
From the metagene plot, we do not specify the metagene region and use the extracted CDS start and end regions. We can see a large peak at -12 position in CDS start regions. Therefore, we can determine that the P-site offset for this example dataset is 12 bp. In fact, the 12 bp offset is a very common number for many Ribo-seq experiments. For CDS end regions, we can also see a large peak at -17 position. This also makes sense, because if we shift the reads 12 bp downstream, this peak will be at -5 position, which is exactly the P-site position since position -5, -4, -3 is the last codon before stop codon, and position -2, -1, and 0 is the stop codon.
From the read length distribution and the metagene plots, we can determine the most abundant read lengths and the P-site offset. For this example dataset, we can take 28 bp reads with 12 bp P-site offset. Before downstream analysis, reads needs to be shifted. We implement shiftReads
function to select certain read lengths and shift reads towards downstream direction.
# shift reads
# here we do not select read length but shift all reads 12 bp downstream
# set fiveEndOnly to FALSE will return a GAlignments object instead of a GRanges object
bam_shift = shiftReads(bam, shiftLens=12, fiveEndOnly=FALSE)
bam_shift
## GAlignments object with 586492 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr1 - 13M 13 17681 17693 13
## [2] chr1 - 13M 13 17960 17972 13
## [3] chr1 - 13M 13 91248 91260 13
## [4] chr1 - 10M8277N3M 13 259016 267305 8290
## [5] chr1 - 13M 13 268702 268714 13
## ... ... ... ... ... ... ... ...
## [586488] chr1 - 23M 23 244863904 244863926 23
## [586489] chr1 - 23M 23 244863904 244863926 23
## [586490] chr1 - 23M 23 247199336 247199358 23
## [586491] chr1 + 23M 23 247857649 247857671 23
## [586492] chr1 - 23M 23 248856369 248856391 23
## njunc
## <integer>
## [1] 0
## [2] 0
## [3] 0
## [4] 1
## [5] 0
## ... ...
## [586488] 0
## [586489] 0
## [586490] 0
## [586491] 0
## [586492] 0
## -------
## seqinfo: 194 sequences from an unspecified genome
# shift reads
# here we select reads with 28 bp and shift 12 bp downstream
# set fiveEndOnly to FALSE will return a GAlignments object instead of a GRanges object
bam_shift_28bp = shiftReads(bam, readLens=28, shiftLens=12, fiveEndOnly=FALSE)
bam_shift_28bp
## GAlignments object with 249421 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr1 - 16M 16 15823 15838 16
## [2] chr1 - 16M 16 15874 15889 16
## [3] chr1 - 16M 16 17508 17523 16
## [4] chr1 - 16M 16 17684 17699 16
## [5] chr1 - 16M 16 17688 17703 16
## ... ... ... ... ... ... ... ...
## [249417] chr1 + 16M 16 248918253 248918268 16
## [249418] chr1 + 16M 16 248918262 248918277 16
## [249419] chr1 + 16M 16 248918289 248918304 16
## [249420] chr1 - 16M 16 248935323 248935338 16
## [249421] chr1 + 16M 16 248942322 248942337 16
## njunc
## <integer>
## [1] 0
## [2] 0
## [3] 0
## [4] 0
## [5] 0
## ... ...
## [249417] 0
## [249418] 0
## [249419] 0
## [249420] 0
## [249421] 0
## -------
## seqinfo: 194 sequences from an unspecified genome
We can also make a metagene plot using the shifted reads. In this case, we can see the large peak should be aligned at exactly position 0, which is the start codon.
# metagene plot
p = metagenePlot(calcMetagene(bam_shift, txdb=txdb), metageneFun='mean')
p
After shifting reads, we can perform further analyses. Here, we show the calculation of ORFScore and ORF expression.
ORFScore is firstly defined in Bazzini et al., 2014 (PMID: 24705786), and is used to discover novel open reading frames (ORFs) or rank ORFs showing active translation. ORFScore counts reads falling into the three frames (represented as frame 1, 2, and 3). Then, a Chi-squared test statistic is computated by comparing the read counts with an equal null distribution p = c(1/3, 1/3, 1/3). Also, if frame 1 has more reads than frame 2 and 3, the sign of ORFScore will be positive, and negative otherwise. So in summary, ORFScore is a simple way to check if the reads follows the three-nucleotide periodicity.
We implement calcORFScore
function to calculate ORFScore given aligned reads and ORF coordinates.
The aligned reads should be output from shiftReads
function.
The ORF coordinates should be a GRangesList
object and each element is a GRanges
object representing an ORF. Also, we recommend assigning a unique name to each ORF. We perform some filtering and clearning steps to the ORFs user provides. 1. If the names are NULL, rename each element as “orf_1”, “orf_2”, etc; 2. Strands marked as "*" are replaced with “+”; 3. Remove elements with multiple chromosomes or strands (one ORF is on multiple chromosomes or different strands); 4. Remove elements where the ORF length is not divisible by 3; and 5. MOST IMPORTANTLY, if an ORF is on positive strand, sort by coordinates (seqnames, start, end) in ascending order. Otherwise, sort by coordinates (seqnames, end, start) in descending order. The purpose is to achieve the same behavior as cdsBy function in GenomicFeatures
package.
By default, we assume the first position in each ORF is frame 1, the second position is frame 2, the third is frame 3, and repeat this pattern afterwards. We also assume the frame 1 is expected to have higher counts, and the null hypothesis is equal distribution on the three frames.
Note that there are many packages designed for predicting ORFs from genome sequences. We plan in the near future to make another tutorial on extracting potential ORFs and converting them into a GRangesList
object.
Here, we take the shifted 28 bp reads and use canonical ORFs to illustrate ORFScore calculation.
canonical_orfs = cdsBy(txdb, by='tx', use.names=TRUE) # use transcript names
orfscore = calcORFScore(bam_shift_28bp, canonical_orfs)
rmarkdown::paged_table(orfscore)
The result is a dataframe of ORFScore related stats, including the read counts for the three frames, the raw ORFScore (Chi-squared test statistic) and the log2 transformed ORFScore.
The three columns (frame1PosPct, frame2PosPct, and frame3PosPct) summarize the proportion of frame 1, 2, and 3 positions having non-zero read counts. The purpose of these three columns is to help filtering ORFs with high ORFScore, but the reads only show up in very few positions in the target frame. An example would be an ORF has 300 positions. Frame 1 has 100 read counts, and frame 2 and 3 has 0 read counts. But all the 100 read counts for frame 1 are located in the same position. In this case, the ORFScore will be large (if frame 1 is the target frame), but frame1PosPct is small (only 1%). This ORF might be more likely to be a false positive.
We can also calculate the total counts and normalized expression (e.g. read counts per kilo base per million reads, RPKM). We implement calcRPKM
function to calculate read counts and RPKM values given aligned reads and ORF coordinates.
Also, from the above metagene plots, reads tend to stack at coding start and end regions for Ribo-seq data. So we add the option to trim from both ORF ends. By default, 6 bp or two codons are trimmed from ORF start and end.
# we use all shifted reads here
orf_rpkm = calcRPKM(bam_shift, canonical_orfs)
rmarkdown::paged_table(orf_rpkm)
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] RiboSeeker_0.0.1.0 ggplot2_3.3.4
## [3] GenomicFeatures_1.42.2 AnnotationDbi_1.52.0
## [5] GenomicAlignments_1.26.0 Rsamtools_2.6.0
## [7] Biostrings_2.58.0 XVector_0.30.0
## [9] SummarizedExperiment_1.20.0 Biobase_2.50.0
## [11] MatrixGenerics_1.2.1 matrixStats_0.58.0
## [13] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4
## [15] IRanges_2.24.1 S4Vectors_0.28.1
## [17] BiocGenerics_0.36.0
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.2 jsonlite_1.7.2 bit64_4.0.5
## [4] assertthat_0.2.1 askpass_1.1 highr_0.8
## [7] BiocFileCache_1.14.0 blob_1.2.1 GenomeInfoDbData_1.2.4
## [10] yaml_2.2.1 progress_1.2.2 pillar_1.5.1
## [13] RSQLite_2.2.4 lattice_0.20-41 glue_1.4.2
## [16] digest_0.6.27 colorspace_2.0-1 htmltools_0.5.1.1
## [19] Matrix_1.3-2 XML_3.99-0.6 pkgconfig_2.0.3
## [22] biomaRt_2.46.3 zlibbioc_1.36.0 purrr_0.3.4
## [25] scales_1.1.1 BiocParallel_1.24.1 tibble_3.1.0
## [28] openssl_1.4.3 farver_2.1.0 generics_0.1.0
## [31] ellipsis_0.3.1 cachem_1.0.4 withr_2.4.1
## [34] magrittr_2.0.1 crayon_1.4.1 memoise_2.0.0
## [37] evaluate_0.14 fansi_0.4.2 xml2_1.3.2
## [40] tools_4.0.2 prettyunits_1.1.1 hms_1.0.0
## [43] lifecycle_1.0.0 stringr_1.4.0 munsell_0.5.0
## [46] DelayedArray_0.16.2 compiler_4.0.2 rlang_0.4.10
## [49] grid_4.0.2 RCurl_1.98-1.3 rappdirs_0.3.3
## [52] labeling_0.4.2 bitops_1.0-6 rmarkdown_2.9
## [55] gtable_0.3.0 DBI_1.1.1 curl_4.3
## [58] R6_2.5.0 knitr_1.31 dplyr_1.0.5
## [61] rtracklayer_1.50.0 fastmap_1.1.0 bit_4.0.4
## [64] utf8_1.2.1 stringi_1.5.3 Rcpp_1.0.6
## [67] vctrs_0.3.6 dbplyr_2.1.0 tidyselect_1.1.0
## [70] xfun_0.24