User guide

Parameters

To view command-line parameters type miRge3.0 -h:

usage: miRge3.0 [options]

miRge3.0 (Comprehensive analysis of small RNA sequencing Data)

optional arguments:
  -h, --help  show this help message and exit
  --version   show program's version number and exit

Options:
  -s,    --samples            list of one or more samples separated by comma or a file with list of samples separated by new line (accepts *.fastq, *.fastq.gz) 
  -db,   --mir-DB             the reference database of miRNA. Options: miRBase and miRGeneDB (Default: miRBase) 
  -lib,  --libraries-path     the path to miRge libraries 
  -on,   --organism-name      the organism name can be human, mouse, fruitfly, nematode, rat or zebrafish
  -ex,   --crThreshold        the threshold of the proportion of canonical reads for the miRNAs to retain. Range for ex (0 - 0.5), (Default: 0.1)
  -phr,  --phred64            phred64 format (Default: 33)
  -spk,  --spikeIn            switch to annotate spike-ins if spike-in bowtie index files are located at the path of bowtie's index files (Default: off)
  -ie,   --isoform-entropy    switch to calculate isomir entropy (default: off)
  -cpu,  --threads            the number of processors to use for trimming, qc, and alignment (Default: 1)
  -ai,   --AtoI               switch to calculate A to I editing (Default: off)
  -tcf   --tcf-out            switch to write trimmed and collapsed fasta file (Default: off)
  -gff   --gff-out            switch to output isomiR results in gff format (Default: off) 
  -bam   --bam-out            switch to output results in bam format (Default: off) 
  -trf   --tRNA-frag          switch to analyze tRNA fragment and halves (Default: off)
  -o     --outDir             the directory of the outputs (Default: current directory) 
  -dex   --diffex             perform differential expression with DESeq2 (Default: off)
  -mdt   --metadata           the path to metadata file (Default: off, require '.csv' file format if -dex is opted)
  -cms   --chunkmbs           chunk memory in megabytes per thread to use during bowtie alignment (Default: 256)
  -spl   --save-pkl           save collapsed reads in binary format for later runs (Default: off)
  -rr    --resume             resume from collapsed reads (Default: off)
  -shh   --quiet              enable quiet/silent mode, only show warnings and errors (Default: off)

Data pre-processing:
  -a,    --adapter            Sequence of a 3' adapter. The adapter and subsequent bases are trimmed
  -g,    --front              Sequence of a 5' adapter. The adapter and any preceding bases are trimmed
  -u,    --cut                Remove bases from each read. If LENGTH is positive, remove bases from the beginning. If LENGTH is negative, remove bases from the end
  -nxt,  --nextseq-trim       NextSeq-specific quality trimming (each read). Trims also dark cycles appearing as high-quality G bases
  -q,    --quality-cutoff     Trim low-quality bases from 5' and/or 3' ends of each read before adapter removal. If one value is given, only the 3' end is trimmed
                              If two comma-separated cutoffs are given, the 5' end is trimmed with the first cutoff, the 3' end with the second
  -l,    --length             Shorten reads to LENGTH. Positive values remove bases at the end while negative ones remove bases at the beginning. This and the following
                              modifications are applied after adapter trimming
  -NX,   --trim-n             Trim N's on ends of reads
  -m,    --minimum-length     Discard reads shorter than LEN. (Default: 16)
  -umi,  --uniq-mol-ids       Trim nucleotides of specific length at 5’ and 3’ ends of the read, after adapter trimming. eg: 4,4 or 0,4. (Use -udd to remove PCR duplicates)  
  -udd,  --umiDedup           Specifies argument to removes PCR duplicates (Default: False); if TRUE it will remove UMI and remove PCR duplicates otherwise it only remove UMI and keep the raw counts (Require -umi option)
  -qumi, --qiagenumi          Removes PCR duplicates of reads obtained from Qiagen platform (Default: Illumina; "-umi x,y " Required)
  
  
miRNA Error Correction:
  microRNA correction method for single base substitutions due to sequencing errors (Note: Refines reads at the expense of time)
  -mEC,  --miREC              Enable miRNA error correction (miREC)
  -kh,   --threshold          the value for frequency threshold τ (Default kh = 5)
  -ks,   --kmer-start         kmer range start value (k_1, default 15) 
  -ke,   --kmer-end           kmer range end value (k_end, default 20)

Predicting novel miRNAs:
  The predictive model for novel miRNA detection is trained on human and mouse!
  -nmir, --novel-miRNA        include prediction of novel miRNAs
  -minl, --minLength          the minimum length of the retained reads for novel miRNA detection (default: 16)
  -maxl, --maxLength          the maximum length of the retained reads for novel miRNA detection (default: 25)
  -c,    --minReadCounts      the minimum read counts supporting novel miRNA detection (default: 2)
  -mloc, --maxMappingLoci     the maximum number of mapping loci for the retained reads for novel miRNA detection (default: 3)
  -sl,   --seedLength         the seed length when invoking Bowtie for novel miRNA detection (default: 25)
  -olc,  --overlapLenCutoff   the length of overlapped seqence when joining reads into longer sequences based on the coordinate 
                              on the genome for novel miRNA detection (default: 14)
  -clc,  --clusterLength      the maximum length of the clustered sequences for novel miRNA detection (default: 30)

Optional PATH arguments:
  -pbwt, --bowtie-path        the path to system's directory containing bowtie binary
  -psam, --samtools-path      the path to system's directory containing samtools binary
  -prf,  --RNAfold-path       the path to system's directory containing RNAfold binary

miRge3.0 libraries

miRge3.0 pipeline aligns the raw reads against a set of small-RNA annotation libraries. The libraries specific to the organism of interest can be obtained from SourceForge. Downloading the libraries on terminal:

Command-line Interface (CLI)

We recommend to create a directory miRge3_Lib and download using wget as shown below,

mkdir miRge3_Lib
cd miRge3_Lib
wget -O human.tar.gz "https://sourceforge.net/projects/mirge3/files/miRge3_Lib/human.tar.gz/download"
wget -O mouse.tar.gz "https://sourceforge.net/projects/mirge3/files/miRge3_Lib/mouse.tar.gz/download"
wget -O rat.tar.gz "https://sourceforge.net/projects/mirge3/files/miRge3_Lib/rat.tar.gz/download"
wget -O nematode.tar.gz "https://sourceforge.net/projects/mirge3/files/miRge3_Lib/nematode.tar.gz/download"
wget -O fruitfly.tar.gz "https://sourceforge.net/projects/mirge3/files/miRge3_Lib/fruitfly.tar.gz/download"
wget -O zebrafish.tar.gz "https://sourceforge.net/projects/mirge3/files/miRge3_Lib/zebrafish.tar.gz/download"
wget -O hamster.tar.gz "https://sourceforge.net/projects/mirge3/files/miRge3_Lib/hamster.tar.gz/download"

Users can download only what is necessary. Unzip the files once downloaded by the following command:

tar -xzf human.tar.gz

Replace human with the organism of interest. If you want to extract all the files at once, you could use tar -xzf *.tar.gz instead.

Direct download

If you are having trouble downloading files through SourceForge, please use the direct link to download the library by clicking on links: Human, Mouse, Rat, Zebrafish, Nematode, Fruitfly, Golden Hamster and md5sum.

Graphical User Interface (GUI)

We recommend to create a folder miRge3_Lib and download the libraries directly from SourceForge. Once downloaded, extract/unzip the compressed files.

Building new libraries

If you are interested in creating specific library for an organism that is not part of this set then please refer to miRge3_build.

CLI - Example usage

Example command usage:

miRge3.0 -s SRR772403.fastq,SRR772404.fastq,SRR772405.fastq,SRR772406.fastq -lib miRge3_Lib -on human -db mirgenedb -o output_dir -gff -nmir -trf -ai -cpu 12 -a illumina 

Output command line:

bowtie version: 1.2.3
Samtools version: 1.7
RNAfold version: 2.4.14
Collecting and validating input files...

miRge3.0 will process 4 out of 4 input file(s).

Cutadapt finished for file SRR772403 in 2.5358 second(s)
Collapsing finished for file SRR772403 in 0.0126 second(s)

Cutadapt finished for file SRR772404 in 7.3542 second(s)
Collapsing finished for file SRR772404 in 0.2786 second(s)

Cutadapt finished for file SRR772405 in 11.0667 second(s)
Collapsing finished for file SRR772405 in 0.8585 second(s)

Cutadapt finished for file SRR772406 in 3.5771 second(s)
Collapsing finished for file SRR772406 in 0.8677 second(s)

Matrix creation finished in 0.3838 second(s)

Data pre-processing completed in 27.2443 second(s)

Alignment in progress ...
Alignment completed in 15.8305 second(s)

Summarizing and tabulating results...
The number of A-to-I editing sites for is less than 10 so that no heatmap is drawn.
Summary completed in 71.4691 second(s)

Predicting novel miRNAs


Performing prediction of novel miRNAs...
Start to predict
Prediction of novel miRNAs Completed (104.83 sec)

The analysis completed in 222.2487 second(s)

Test

The test case illustrates the usage of miRge3.0 with a sample dataset, mapping to human reference libraries.

First download human miRge libraries as shown below:

mkdir miRge3_Lib
cd miRge3_Lib
wget -O human.tar.gz "https://sourceforge.net/projects/mirge3/files/miRge3_Lib/human.tar.gz/download"
tar -xzf human.tar.gz
cd ..

Download the sample file from Source Forge, SRR772403

You can download to your working directory as shown below:
wget -O SRR772403.fastq.gz "https://sourceforge.net/projects/mirge3/files/test/SRR772403.fastq.gz/download"

Run basic miRge3.0 command to annotate and report isomiRs

miRge3.0 -s SRR772403.fastq.gz -lib /mnt/d/Halushka_lab/Arun/miRge3_Lib -a illumina -on human -db mirbase -o output_dir -gff -cpu 8

bowtie version: 1.3.0
cutadapt version: 3.1
Samtools version: 1.11
Collecting and validating input files...

miRge3.0 will process 1 out of 1 input file(s).

Cutadapt finished for file SRR772403 in 3.4343 second(s)
Collapsing finished for file SRR772403 in 0.0216 second(s)

Matrix creation finished in 0.0263 second(s)

Data pre-processing completed in 3.5111 second(s)

Alignment in progress ...
Alignment completed in 8.1488 second(s)

Summarizing and tabulating results...
Summary completed in 2.27 second(s)


The analysis completed in 15.2276 second(s)

Output folder, sample output can be accessed here

miRge creates a subfolder inside the folder "output_dir" and all the files will be stored there. The test output can be accessed at the following link:
https://sourceforge.net/projects/mirge3/files/test/output_dir/miRge.2021-06-25_15-16-58/

Trimming both 5’ and 3’ adapters - Linked adapters

If the data contains adapters at both 5’ and 3’ ends of the reads and both the adapters need to be removed then you should perform linked adapter trimming. This is part of Cutadapt and more about linked adapters can be found here.

Example:

miRge3.0 -s DRR013811.fastq -lib /mnt/d/Halushka_lab/Arun/GTF_Repeats_miRge2to3/miRge3_Lib/revised_hsa  -on human -db mirbase -o output_dir -g "TTAGGC...TGGAATTCTCGGGTGCCAAGGAACTCCAGT"

Description of adapter: "TTAGGC...TGGAATTCTCGGGTGCCAAGGAACTCCAGT", where TTAGGC is the 5’ adapter and TGGAATTCTCGGGTGCCAAGGAACTCCAGT is the 3’ adapter sequence.

Note: Complete adapter sequence must be provided (mandatory) i.e., simply specifying illumina will not be decoded to its actual adapter sequence.
This will NOT WORK: -g "TTAGGC...illumina"
This will WORK: -g "TTAGGC...TGGAATTCTCGGGTGCCAAGGAACTCCAGT"

Save and resume functions

Saving collapsed reads and accessory files in binary (pickle) format

For researchers interested in trying different parameters without redoing the entire run, the post-collapsed reads datafile can be saved. The parameter -spl/--save-pkl (save pickle) should be specified to save the pickle files. By default the internal variables such as the Pandas dataframe containing collapsed reads before alignment, read summary and sample information is saved as two different pickle files namely collapsed.pkl for collapsed read counts and collapsed_accessories.pkl for accessory files (read summary, sample information etc). An example usage is described below:

miRge3.0 -s SRR772403.fastq,SRR772404.fastq -a illumina -lib miRge3_Lib -on human -db mirbase -o output_dir -spl
bowtie version: 1.3.0
cutadapt version: 4.1
Samtools version: 1.3.1
Collecting and validating input files...

miRge3.0 will process 2 out of 2 input file(s).

Cutadapt finished for file SRR772403 in 3.8598 second(s)
Collapsing finished for file SRR772403 in 0.0259 second(s)

Cutadapt finished for file SRR772404 in 13.5832 second(s)
Collapsing finished for file SRR772404 in 0.3531 second(s)

Matrix creation finished in 0.1652 second(s)

Data pre-processing completed in 18.113 second(s)

Alignment in progress ...
Alignment completed in 20.1637 second(s)

Summarizing and tabulating results...
Summary completed in 1.9921 second(s)


The path to output directory: /mnt/d/Halushka_lab/Arun/datasets/output_dir/miRge.2022-07-07_13-59-51

The analysis completed in 43.278 second(s)

Resuming from collapsed reads and try out different miRge3.0 parameters

The sample execution previously run with -spl option can only be used to resume miRge3.0 with different parameters. The sample parameter -s takes the path to the previous output folder (specified earlier as -o). Include the -rr/--resume (re-run or resume) parameter to indicate that you want to re-run miRge3.0 with different parameters. An example usage is described below:

miRge3.0 -s /mnt/d/Halushka_lab/Arun/datasets/output_dir/miRge.2022-07-07_13-59-51 -lib miRge3_Lib -on human -db mirbase -o output_dir -rr -gff
bowtie version: 1.3.0
cutadapt version: 4.1
Samtools version: 1.3.1
Collecting and validating input files...

miRge3.0 will process 2 saved run(s) from binary pickle file.

Alignment in progress ...
Alignment completed in 19.9428 second(s)

Summarizing and tabulating results...
Summary completed in 7.6734 second(s)


The path to output directory: /mnt/d/Halushka_lab/Arun/datasets/output_dir/miRge.2022-07-07_14-12-03

The analysis completed in 30.6275 second(s)

Running samples with UMI

Qiagen - based UMI

Testing sample data run on UMI obtained from Qiagen platform. Important parameters are (-umi, --qiagenumi and -udd)

miRge3.0 -s SRR13077007.fastq -db miRBase -lib miRge3_Lib -on human -a AACTGTAGGCACCATCAAT --qiagenumi -umi 0,12 -o output_dir -cpu 10 -udd

Please note: As of July, 2021, the standard internal 3’ adapter was AACTGTAGGCACCATCAAT ligated to 12 nucleotide UMI sequence followed by external 3’ adapter sequence. If you have different internal adapter other than AACTGTAGGCACCATCAAT, then please provide that.

Example of reads, UMI and adapters for hsa-let-7a (sequence left to right in the order mentioned below with-in angular brackets):

<hsa-let-7a-5p: TGAGGTAGTAGGTTGTATAGTT><Internal 3’ adapter:AACTGTAGGCACCATCAAT><12 nt UMI><external 3’ adapter AGATCGGAAGAGCACACGTCT>

TGAGGTAGTAGGTTGTATAGTTAACTGTAGGCACCATCAATGTTAGACCTGCAAGATCGGAAGAGCACACGTCTG
TGAGGTAGTAGGTTGTATAGTTAACTGTAGGCACCATCAATCAATGACGATTTAGATCGGAAGAGCACACGTCTG
TGAGGTAGTAGGTTGTATAGTTAACTGTAGGCACCATCAATAAACAAAGATCCAGATCGGAAGAGCACACGTCTG
TGAGGTAGTAGGTTGTATAGTTAACTGTAGGCACCATCAATCGCATCGCCGACAGATCGGAAGAGCACACGTCTG
TGAGGTAGTAGGTTGTATAGTTAACTGTAGGCACCATCAATTTTGCCATTACTAGATCGGAAGAGCACACGTCTG

Illumina - based UMI/4N method

Testing sample data run on UMI/4N obtained from Illumina or similar platform. Important parameters are (-umi and -udd)

miRge3.0 -s SRR6379839.fastq -db miRBase -lib miRge3_Lib -on human -a illumina -umi 4,4 -o output_dir -cpu 10 -udd

<04 nt UMI><hsa-let-7a-5p: TGAGGTAGTAGGTTGTATAGTT><04 nt UMI><3’ adapter:TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACCGGAATATCTCG>

TACATGAGGTAGTAGGTTGTATAGTTCCTCTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACCGGAATATCTCG
TACCTGAGGTAGTAGGTTGTATAGTTACTATGGAATTCTCGGGTGCCAAGGAACTCCAGTCACCGGAATATCTCG
CAGGTGAGGTAGTAGGTTGTATAGTTGGTATGGAATTCTCGGGTGCCAAGGAACTCCAGTCACCGGAATATCTCG
AGAATGAGGTAGTAGGTTGTATAGTTACTATGGAATTCTCGGGTGACAAGGAACTCCAGTCACCGGAATATCTCG
AGGTTGAGGTAGTAGGTTGTATAGTTACTATGGAATTCTCGGGTGCCAAGGAACTCCAGTCACCGGAATATCTCG

Performing differential expression analysis

Download example datasets from NCBI SRA (Note: Tutorial on how to download SRA files is below).
Prepare metadata information in CSV format as shown below. For this tutorial, download the file from here.

id,group
SRR8497647,Control
SRR8497648,Control
SRR8497649,Control
SRR8497650,Control
SRR8497651,treated
SRR8497652,treated
SRR8497653,treated
SRR8497654,treated

Execute the following command:

miRge3.0 -s SRR8497647.fastq,SRR8497648.fastq,SRR8497649.fastq,SRR8497650.fastq,SRR8497651.fastq,SRR8497652.fastq,SRR8497653.fastq,SRR8497654.fastq -lib miRge3_Lib -on human -db miRGeneDB -o differential_Exp -a TGGAATTCTCGG -cpu 12 -dex -mdt DESmetadata.csv

The result files for the above miRge3.0 run can be found at SourceForge

Tutorial on how to download SRA files:
This turorial is only brief introduction and doesn’t cover all the details of downloading NCBI SRA files. You could find YouTube tutorials on how to download SRA files.

Download and install NCBI SRA toolkit: You could refer to NCBI SRA Handbook or GitHub
Download command: One could use fasterq-dump -t temp -e 10 SRR8497647 or simply fastq-dump SRR8497647. The only difference being that the fasterq-dump is faster. Similarly, download all other Runs (i.e., SRR8497648, SRR8497649 etc.)

miRge3.0 GUI

The application is cross platform, the image below is a screenshot of the software from MacOS
The software is easy to use with default parameters. The parameters are tabulated into four groups such as basic, trimming parameters, novel miRNA prediction and other optional parameters.
Screenshot with basic parameters
Screenshot with trimming parameters
Screenshot with novel miRNA predictions
Screenshot with other optional parameters

Resources

Lu, Y., et al., miRge 2.0 for comprehensive analysis of microRNA sequencing data. 2018. BMC Bioinformatics. PMID.
Baras, S. A., et al., miRge - A Multiplexed Method of Processing Small RNA-Seq Data to Determine MicroRNA Entropy. 2015. PLoS One. PMID.