Detecting editing sites in NGS data of miRNAs
Supplementary protocol for the paper:
 'Systematic identification of edited microRNAs in the human brain'
Notes:
A) This protocol refers to the output of Illumina sequencing machines. Modifications are needed to achieve compatibility for different platforms.
B) The scripts attached are written in Perl. Running these files requires pre-installation of Perl.
Step 0: Acquiring the resulting output of the sequencing run
As the following analysis relies on the quality scores of the sequencing reads, the Fastq formatted output of the sequencing run is required. The Fastq file could be the raw reads without any filter, or the filtered Illumina software tools output. In the latter case, proceed directly to Step 2. Otherwise, continue with Step 1.
Step 1: Filtering low quality reads and trimming sequence adaptors
Raw sequencing reads likely contain parts of the adapter sequence. Therefore, these sequences must be identified and removed. Moreover, low quality reads (as defined by the reads quality score) are unlikely to be informative and therefore should be removed. There are several published tools for raw Fastq filtering, or one can use our in-house filtering script. The script allows the user to define 'low quality reads' by choosing: a) a quality score cutoff and b) the maximum number of locations allowed to have lower quality score compared to the chosen cutoff. For example, in the paper we have used a cutoff of 20 for the phred quality score and the maximum number of locations was set to 3. The script also trims the adapters and the resulting trimmed read can be filtered if it is too long or too short (as defined by the user). For example, in the paper we have removed reads with length longer than 28 bases or shorter than 15 bases.
Step 2: Aligning the reads against the genome
The filtered and trimmed reads are aligned against the genome of interest and not against a miRNAs sequences database. We require unique best alignment, that is, the reads cannot be aligned to other locations in the genome with the same number of mismatches. Lastly, we recommend using only alignments with up to one mismatch (in our datasets, allowing up to two mismatches did not aid in detecting more editing sites, instead, it added unreliable alignments). These steps, taken together solve, by and large, the cross mapping problem that significantly hinders identification of true editing sites in mature miRNAs.
The last two bases (the 3' end) of mature miRNA undergo extensive adenylation and uridylation. It is thus recommended that these bases will not be considered in the alignment. Naturally, doing so prevents detection of editing in these locations. However, not taking this measure and still demanding low number of mismatches will severely reduce the number of alignments obtained.
We have used Bowtie for aligning the reads against the human and mouse DNA. Bowtie executable, as well as pre-build indexes for the human and mouse genomes, can be found here.
The following command line runs bowtie in a manner compatible with the above considerations:
bowtie -n 1 -e 50 -a -m 1 --best --strata --trim3 2 The_bowtie_folder/The_genome_indexes The_filtered_fastq_file > The_output_file
Note that the allowed phred score in the mismatch position is not limited (-e 50). This means that even mismatches with high quality score will be allowed to align.
Step 3: Mapping the mismatches against the pre-miRNA sequences
The purpose of this step is to move from reads aligned against the genome (the end-point of the previous step) to counts of each of the four possible nucleotides at each position along the pre-miRNA sequence, for all the pre-miRNAs. Performing this transformation will allow us to focus our analysis only on bona fide miRNA and to use, in the following step, binomial statistics in order to detect significant modifications inside them.
We used two pre-built files for the transformation: the alignment of the pre-miRNA sequences against the genome and the mature/star sequences of miRNAs. The miRNA sequences can be downloaded from miRBase, and the alignment of the pre-miRNA sequence against the genome can be achieved using bowtie. Taken together, these datasets give the location of the pre-miRNA and the mature/star miRNA inside the genome. We also used RNAfold to identify the secondary structure of the pre-miRNA. We give these files for human and mouse (using miRBase release 17): RNAfold human pre-miRNA, RNAfold mouse pre-miRNA, mature miRNA sequence, human pre-miRNA aligned against the genome and mouse pre-miRNA aligned against the genome. Files for human and mouse using miRBase release 18 can be found here.
These files above and the bowtie output file from the previous step are the input for this script. This script transforms the data as described above. A key user-defined input for this script is the minimum quality score allowed in the location of the mismatch in order for it to be counted. Naturally, the higher this number is, the lower the probability for sequencing error (see also below). However, taking very high quality score filter will give small number of counted modifications. In the paper we have used 30 as the minimum quality score allowed.
The main output of this script is a text file (termed 'main_output.txt') containing the counts of each of the four possible nucleotides at each position along all the pre-miRNA sequences. This output file is later used for the binomial test. This output representation can also be used to detect significant variation from the miRBAse mature/star definitions (in terms of beginning and end locations). Only modifications inside the mature/star sequence as defined in miRBAse are analyzed in the script. Additional output file is a text file containing the number of reads for each mature/star miRNA, this file (termed 'miRNA_expression.txt') can be used for expression analysis.
The binomial statistics requires the number of successes (the number of mismatches of a given type in a given position), the number of trials (the total number of counts in the given position) and the sequencing error probability. The sequencing error probability can be estimated using the phred score. For example, if only mismatches with phred score of 30 are allowed, the expected base call error rate should be 0.1% in each position. The aligned read data can be used in order to examine the actual sequencing error in the retained reads. Merging the data from all the mature and star miRNAs the rate of any type of mismatch from the 12 possible mismatches can be calculated. In this calculation the location of the mismatch should also be taken into account, as biological modifications appear in the 3' of the miRNA and sequencing quality is lower towards the end of the read. Therefore, the calculation is performed separately in each position along the mature/star miRNA and the results are also presented in the file 'main_output.txt'. The rate relevant for the statistical analysis to follow is the maximal mismatch rate of any single type of mismatch. If it is higher than the rate expected from the phred score, the higher estimate should be used.
Note that running this script on large sequencing datasets (large bowtie output files) requires extensive internal memory resources. A way around this hurdle is to divide the sequencing dataset into several smaller files; afterwards the output files can be merged.
Step 4: Using binomial statistics to remove sequencing errors
In this step binomial statistics is performed on the output file 'main_output.txt' in order to separate sequencing errors from statistically significant modifications.
Importantly, binomial statistics do not require any arbitrary expression level filter. It is well suited even for low expressed miRNAs with low number of sequencing reads, and the p-values computed reflect the absolute numbers of reads detected, small or large as the case may be.
This analysis is performed for every position (except the last two positions of the miRNA due to the extensive adenylation and uridylation) in every mature/star miRNA separately. As multiple tests are performed, the resulting p-value for each position must be corrected accordingly. The script performing this analysis gives the user an option to use either Bonferroni or Benjamini-Hochberg corrections.
We note that some low-abundance isomirs display 5' sequence modifications similar to the biological modifications reported at the 3' of mature miRNAs. We identified several isomers that start one or two bases upstream from a different isomer and display sequence modifications in the 5' end in the form of adenylation and uridylation. In all these events the abundance of the modified isomer was at least two orders of magnitude lower than the unmodified isomer. We have discarded all these events from our analysis.
The script performing this analysis is given here. The script requires the Perl package Math-CDF.
Step 5: Removing SNPs from the list of statistically significant modifications
Known SNPs should be filtered from the statistically significant modifications detected in the previous step. This can be done manually examining the sites obtained in the UCSC genome browser, or automated using the SNP tables (downloadable from UCSC site).