Introduction

RiboSeeker is an R package for ribosome profiling data analysis. It has two components:


Quick Start

Here, we focus on using the RiboSeeker R package for Ribo-seq data analysis. The example data used are from a human K562 ribosome profiling dataset (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3566410). We run through the Ribomake workflow using hg38 human genome and Ensembl 98 annotation.


Set up packages and data

First install and load RiboSeeker and several other packages to run the example data.

library(GenomicAlignments) # deals with bam alignments
library(GenomicFeatures) # deals with genome annotations
library(GenomicRanges) # deals with general genomic data
library(ggplot2) # deals with visualizations

library(RiboSeeker)

Then specify the example data files. We only take the reads aligned to chromosome 1 to minimize the running time and storage space. For genome annotation, we also only take chromosome 1 annotations.

# example data folder
example_folder = system.file('extdata', package='RiboSeeker')

# bam file
bam_file = file.path(example_folder, 'GSM3566410.chr1.sorted.bam')
# genome annotation txdb file
txdb_file = file.path(example_folder, 'hg38.Ens_98.chr1.txdb.sqlite')

# # alternatively, you can use bioconductor human annotation package
# if (!requireNamespace('BiocManager', quietly=TRUE)) {
#   install.packages("BiocManager")
# }
# BiocManager::install('TxDb.Hsapiens.UCSC.hg38.knownGene')
# library(TxDb.Hsapiens.UCSC.hg38.knownGene)

Next load BAM and TxDb files. We use loadBam function to load an indexed bam file and return a GAlignments object of aligned reads. The loadBam function is essentially a wrapper of readGAlignments function in GenomicAlignments package. Note that Ribo-seq data are typically single-end, the paired RNA-seq data can be paired-end. Since the single- or paired-end information is not very important in Ribo-seq data analysis, we simply treat all BAM files as single-end. A message will be printed out if paired-end BAM file is detected.

# load bam file
bam = loadBam(bam_file)
bam
## GAlignments object with 586492 alignments and 0 metadata columns:
##            seqnames strand       cigar    qwidth     start       end     width
##               <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##        [1]     chr1      -         28M        28     15823     15850        28
##        [2]     chr1      -         28M        28     15874     15901        28
##        [3]     chr1      -         29M        29     17508     17536        29
##        [4]     chr1      -         29M        29     17508     17536        29
##        [5]     chr1      -         28M        28     17508     17535        28
##        ...      ...    ...         ...       ...       ...       ...       ...
##   [586488]     chr1      +         29M        29 248918346 248918374        29
##   [586489]     chr1      +         29M        29 248918346 248918374        29
##   [586490]     chr1      -         28M        28 248935323 248935350        28
##   [586491]     chr1      +         25M        25 248936965 248936989        25
##   [586492]     chr1      +         28M        28 248942310 248942337        28
##                njunc
##            <integer>
##        [1]         0
##        [2]         0
##        [3]         0
##        [4]         0
##        [5]         0
##        ...       ...
##   [586488]         0
##   [586489]         0
##   [586490]         0
##   [586491]         0
##   [586492]         0
##   -------
##   seqinfo: 194 sequences from an unspecified genome
# load txdb file
txdb = loadDb(txdb_file)

# alternatively, you can use bioconductor human annotation txdb object
# txdb = TxDb.Hsapiens.UCSC.hg38.knownGene


Read length distribution

Ribo-seq reads, or ribosome protected fragments (RPFs) tend to have a specific length distribution. Usually the RPFs are around 30 bp. We implement readLengthDist function to count the number and percentage of reads for each read length. The function readLengthDistPlot can be used to plot the read length distribution (percentage or actual read counts) as a barplot.

# read length distribution
read_len_dist = readLengthDist(bam)
read_len_dist
##    readLen  count        pct
## 1       25  30437  5.1896701
## 2       26  46103  7.8608063
## 3       27 101916 17.3772191
## 4       28 249421 42.5276048
## 5       29  83834 14.2941421
## 6       30  25058  4.2725220
## 7       31  24984  4.2599047
## 8       32  11563  1.9715529
## 9       33   7672  1.3081167
## 10      34   3577  0.6098975
## 11      35   1927  0.3285637
# read length distribution plot
p = readLengthDistPlot(read_len_dist, color='steelblue', type='pct')
p

If you run through the Ribomake workflow, it will also generate a read length distribution plot. You might notice the distribution is somehow different than presented here. There might be two reasons: 1. the example data presented here only include reads aligned to chromosome 1; and 2. here we take all aligned reads regardless of where they are mapped (for example, reads aligned to intergenic regions are also counted here). But the Ribomake workflow only takes reads that are aligned to transcripts.


Genomic feature distribution of aligned reads

It may also be interesting to see where the reads are aligned on the genome. We implement readGenomeDist function to assign reads to genomic features such as CDS, UTR, Intron, and Intergenic. This function resizes reads to its 5’-end positions to minimize overlap with multiple features. It returns a list of two elements, the first is the genomic feature assignment for each read, and the second is the summary statistics for user specified genomic features.

# read genomic feature distribution
read_feature_dist = readGenomeDist(bam, txdb, category=c('CDS', 'UTR', 'Intron', 'Intergenic'), 
  fiveEndOnly=TRUE, ignoreStrand=TRUE)
read_feature_dist$summary
##      feature  count       pct
## 1        CDS 358224 63.084265
## 2 Intergenic   9920  1.746940
## 3     Intron 146884 25.866690
## 4        UTR  52822  9.302104
# read genomic feature distribution plot
p = readGenomeDistPlot(read_feature_dist$summary, type='pct')
p

We can see that for this example dataset, most reads are assigned to CDS and UTR. What is interesting about the full dataset is that if we take all the reads aligned to all chromosomes, the most majority of reads are actually assigned to intergenic regions.


Metagene

A very useful and important plot in Ribo-seq data analysis is the metagene plot, especially around CDS start (start codon) and CDS end (stop codon) regions. Due to the large size of the ribosome, the 5’-end of RPFs are actually upstream of the ribosome P-site. This shift is called P-site offset. Metagene plot is a good way to determine the P-site offset.

We implement calcMetagene and metagenePlot function to compute the metagene matrix and make the plot. For metagene matrix calculation, we require a BAM alignment and a metagene region. Similar with genomic feature plot, we also take only the 5’-end position for each read. If the region is not specified, we will extract all CDS start and end regions from the TxDb annotation. Then the metagene matrix is calculated using either user specified region or the extracted two CDS start and end regions. Each row in the metagene matrix represents a transcript, and each column represents a position in the metagene region. We also provide the option to normalize row-wise the counts to the total counts in each row.

For metagene plot, one plot will be made if the metagene region is specified by users. Otherwise, two plots around CDS start and end regions will be made. Position 0 in CDS start means the first base of the start codon, and position 0 in CDS end means the last base of the stop codon.

# metagene
metagene = calcMetagene(bam, txdb=txdb)
# metagene plot
p = metagenePlot(metagene, metageneFun='mean')
p

From the metagene plot, we do not specify the metagene region and use the extracted CDS start and end regions. We can see a large peak at -12 position in CDS start regions. Therefore, we can determine that the P-site offset for this example dataset is 12 bp. In fact, the 12 bp offset is a very common number for many Ribo-seq experiments. For CDS end regions, we can also see a large peak at -17 position. This also makes sense, because if we shift the reads 12 bp downstream, this peak will be at -5 position, which is exactly the P-site position since position -5, -4, -3 is the last codon before stop codon, and position -2, -1, and 0 is the stop codon.


Shift reads

From the read length distribution and the metagene plots, we can determine the most abundant read lengths and the P-site offset. For this example dataset, we can take 28 bp reads with 12 bp P-site offset. Before downstream analysis, reads needs to be shifted. We implement shiftReads function to select certain read lengths and shift reads towards downstream direction.

# shift reads
# here we do not select read length but shift all reads 12 bp downstream
# set fiveEndOnly to FALSE will return a GAlignments object instead of a GRanges object
bam_shift = shiftReads(bam, shiftLens=12, fiveEndOnly=FALSE)
bam_shift
## GAlignments object with 586492 alignments and 0 metadata columns:
##            seqnames strand       cigar    qwidth     start       end     width
##               <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##        [1]     chr1      -         13M        13     17681     17693        13
##        [2]     chr1      -         13M        13     17960     17972        13
##        [3]     chr1      -         13M        13     91248     91260        13
##        [4]     chr1      -  10M8277N3M        13    259016    267305      8290
##        [5]     chr1      -         13M        13    268702    268714        13
##        ...      ...    ...         ...       ...       ...       ...       ...
##   [586488]     chr1      -         23M        23 244863904 244863926        23
##   [586489]     chr1      -         23M        23 244863904 244863926        23
##   [586490]     chr1      -         23M        23 247199336 247199358        23
##   [586491]     chr1      +         23M        23 247857649 247857671        23
##   [586492]     chr1      -         23M        23 248856369 248856391        23
##                njunc
##            <integer>
##        [1]         0
##        [2]         0
##        [3]         0
##        [4]         1
##        [5]         0
##        ...       ...
##   [586488]         0
##   [586489]         0
##   [586490]         0
##   [586491]         0
##   [586492]         0
##   -------
##   seqinfo: 194 sequences from an unspecified genome
# shift reads
# here we select reads with 28 bp and shift 12 bp downstream
# set fiveEndOnly to FALSE will return a GAlignments object instead of a GRanges object
bam_shift_28bp = shiftReads(bam, readLens=28, shiftLens=12, fiveEndOnly=FALSE)
bam_shift_28bp
## GAlignments object with 249421 alignments and 0 metadata columns:
##            seqnames strand       cigar    qwidth     start       end     width
##               <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##        [1]     chr1      -         16M        16     15823     15838        16
##        [2]     chr1      -         16M        16     15874     15889        16
##        [3]     chr1      -         16M        16     17508     17523        16
##        [4]     chr1      -         16M        16     17684     17699        16
##        [5]     chr1      -         16M        16     17688     17703        16
##        ...      ...    ...         ...       ...       ...       ...       ...
##   [249417]     chr1      +         16M        16 248918253 248918268        16
##   [249418]     chr1      +         16M        16 248918262 248918277        16
##   [249419]     chr1      +         16M        16 248918289 248918304        16
##   [249420]     chr1      -         16M        16 248935323 248935338        16
##   [249421]     chr1      +         16M        16 248942322 248942337        16
##                njunc
##            <integer>
##        [1]         0
##        [2]         0
##        [3]         0
##        [4]         0
##        [5]         0
##        ...       ...
##   [249417]         0
##   [249418]         0
##   [249419]         0
##   [249420]         0
##   [249421]         0
##   -------
##   seqinfo: 194 sequences from an unspecified genome

We can also make a metagene plot using the shifted reads. In this case, we can see the large peak should be aligned at exactly position 0, which is the start codon.

# metagene plot
p = metagenePlot(calcMetagene(bam_shift, txdb=txdb), metageneFun='mean')
p


Further analysis

After shifting reads, we can perform further analyses. Here, we show the calculation of ORFScore and ORF expression.


ORFScore

ORFScore is firstly defined in Bazzini et al., 2014 (PMID: 24705786), and is used to discover novel open reading frames (ORFs) or rank ORFs showing active translation. ORFScore counts reads falling into the three frames (represented as frame 1, 2, and 3). Then, a Chi-squared test statistic is computated by comparing the read counts with an equal null distribution p = c(1/3, 1/3, 1/3). Also, if frame 1 has more reads than frame 2 and 3, the sign of ORFScore will be positive, and negative otherwise. So in summary, ORFScore is a simple way to check if the reads follows the three-nucleotide periodicity.

We implement calcORFScore function to calculate ORFScore given aligned reads and ORF coordinates.

  • The aligned reads should be output from shiftReads function.

  • The ORF coordinates should be a GRangesList object and each element is a GRanges object representing an ORF. Also, we recommend assigning a unique name to each ORF. We perform some filtering and clearning steps to the ORFs user provides. 1. If the names are NULL, rename each element as “orf_1”, “orf_2”, etc; 2. Strands marked as "*" are replaced with “+”; 3. Remove elements with multiple chromosomes or strands (one ORF is on multiple chromosomes or different strands); 4. Remove elements where the ORF length is not divisible by 3; and 5. MOST IMPORTANTLY, if an ORF is on positive strand, sort by coordinates (seqnames, start, end) in ascending order. Otherwise, sort by coordinates (seqnames, end, start) in descending order. The purpose is to achieve the same behavior as cdsBy function in GenomicFeatures package.

  • By default, we assume the first position in each ORF is frame 1, the second position is frame 2, the third is frame 3, and repeat this pattern afterwards. We also assume the frame 1 is expected to have higher counts, and the null hypothesis is equal distribution on the three frames.

Note that there are many packages designed for predicting ORFs from genome sequences. We plan in the near future to make another tutorial on extracting potential ORFs and converting them into a GRangesList object.

Here, we take the shifted 28 bp reads and use canonical ORFs to illustrate ORFScore calculation.

canonical_orfs = cdsBy(txdb, by='tx', use.names=TRUE) # use transcript names
orfscore = calcORFScore(bam_shift_28bp, canonical_orfs)

rmarkdown::paged_table(orfscore)


The result is a dataframe of ORFScore related stats, including the read counts for the three frames, the raw ORFScore (Chi-squared test statistic) and the log2 transformed ORFScore.

The three columns (frame1PosPct, frame2PosPct, and frame3PosPct) summarize the proportion of frame 1, 2, and 3 positions having non-zero read counts. The purpose of these three columns is to help filtering ORFs with high ORFScore, but the reads only show up in very few positions in the target frame. An example would be an ORF has 300 positions. Frame 1 has 100 read counts, and frame 2 and 3 has 0 read counts. But all the 100 read counts for frame 1 are located in the same position. In this case, the ORFScore will be large (if frame 1 is the target frame), but frame1PosPct is small (only 1%). This ORF might be more likely to be a false positive.


ORF expression

We can also calculate the total counts and normalized expression (e.g. read counts per kilo base per million reads, RPKM). We implement calcRPKM function to calculate read counts and RPKM values given aligned reads and ORF coordinates.

Also, from the above metagene plots, reads tend to stack at coding start and end regions for Ribo-seq data. So we add the option to trim from both ORF ends. By default, 6 bp or two codons are trimmed from ORF start and end.

# we use all shifted reads here
orf_rpkm = calcRPKM(bam_shift, canonical_orfs)

rmarkdown::paged_table(orf_rpkm)


Session Info

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS  10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] RiboSeeker_0.0.1.0          ggplot2_3.3.4              
##  [3] GenomicFeatures_1.42.2      AnnotationDbi_1.52.0       
##  [5] GenomicAlignments_1.26.0    Rsamtools_2.6.0            
##  [7] Biostrings_2.58.0           XVector_0.30.0             
##  [9] SummarizedExperiment_1.20.0 Biobase_2.50.0             
## [11] MatrixGenerics_1.2.1        matrixStats_0.58.0         
## [13] GenomicRanges_1.42.0        GenomeInfoDb_1.26.4        
## [15] IRanges_2.24.1              S4Vectors_0.28.1           
## [17] BiocGenerics_0.36.0        
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.2             jsonlite_1.7.2         bit64_4.0.5           
##  [4] assertthat_0.2.1       askpass_1.1            highr_0.8             
##  [7] BiocFileCache_1.14.0   blob_1.2.1             GenomeInfoDbData_1.2.4
## [10] yaml_2.2.1             progress_1.2.2         pillar_1.5.1          
## [13] RSQLite_2.2.4          lattice_0.20-41        glue_1.4.2            
## [16] digest_0.6.27          colorspace_2.0-1       htmltools_0.5.1.1     
## [19] Matrix_1.3-2           XML_3.99-0.6           pkgconfig_2.0.3       
## [22] biomaRt_2.46.3         zlibbioc_1.36.0        purrr_0.3.4           
## [25] scales_1.1.1           BiocParallel_1.24.1    tibble_3.1.0          
## [28] openssl_1.4.3          farver_2.1.0           generics_0.1.0        
## [31] ellipsis_0.3.1         cachem_1.0.4           withr_2.4.1           
## [34] magrittr_2.0.1         crayon_1.4.1           memoise_2.0.0         
## [37] evaluate_0.14          fansi_0.4.2            xml2_1.3.2            
## [40] tools_4.0.2            prettyunits_1.1.1      hms_1.0.0             
## [43] lifecycle_1.0.0        stringr_1.4.0          munsell_0.5.0         
## [46] DelayedArray_0.16.2    compiler_4.0.2         rlang_0.4.10          
## [49] grid_4.0.2             RCurl_1.98-1.3         rappdirs_0.3.3        
## [52] labeling_0.4.2         bitops_1.0-6           rmarkdown_2.9         
## [55] gtable_0.3.0           DBI_1.1.1              curl_4.3              
## [58] R6_2.5.0               knitr_1.31             dplyr_1.0.5           
## [61] rtracklayer_1.50.0     fastmap_1.1.0          bit_4.0.4             
## [64] utf8_1.2.1             stringi_1.5.3          Rcpp_1.0.6            
## [67] vctrs_0.3.6            dbplyr_2.1.0           tidyselect_1.1.0      
## [70] xfun_0.24