mrFAST
We moved to SourceForge. Update your bookmarks: http://mrfast.sourceforge.net for mrFAST
Our new copy number calling implementation (requires mrFAST or mrsFAST version > 2.0): http://mrcanavar.sourceforge.net
This page is kept for mrFAST version 1 only.
mrfast reports the anchoring position only; and takes FASTA files as input.
mrfast-fq reports the mismatch and indel
information, and takes FASTQ files as input; however it is
limited by edit distance 2.
Installation HOWTO
1 - Download the package.
2 - Compile the auxiliary stuff:
gcc fastabreak.c -o fastabreak
gcc fastafrac.c -o fastafrac
gcc runmrfast.c -o runmrfast
mpicc mrfastbatch.c -o mrfastbatch
NOTE: the MPI job distributor assumes that your SGE system creates a temp directory
it should be
something like ("/tmp/%d.1.all.q", job_id);
3 - Mask your genome the way you like it; and put it under a directory "mygenome".
4 - Create a start.idx file something like this. If you prefer 0-based coordinates, change the 1's to 0's.
5 - fractionate your genome:
mkdir fracgenome;
awk '{print "fastafrac -i " $1 " -s " $2 " -offset "
$3 " -outfolder fracgenome"}' start.idx | sh
6 - Concatenate the log files created under your "mygenome" directory;
cat mygenome/*log > mygenome.idx
And fix the path in the first column if needed & create a second
copy with the path name under /tmp or /var/tmp directories for
cluster use. It is highly recommended to copy the indices to the local
disks
of each node.
7 - Create your indices:
cd fracgenome;
for i in `ls`; do mrfast -f $i; done
NOTE: There's an optional step which may gain 10-15% speed up. If you
expect all your inputs will be a certain length (i.e. 36bp), create your
indices like:
for i in `ls`; do mrfast -f $i -l 36; done
Those indices will still work for queries of different length, but won't benefit the extra speed up.
8 - Move your indices to your final destination for them (mv *index /what/ever)
9 - Single CPU run:
mkdir myoutfolder;
mrfast -loadidx mygenome.idx -outfolder myoutfolder/ -search mysequences.fa -textout
if you want to search in a single chromosome, or a few of them, just
make a smaller "mygenome.idx" file with grep and such.
10 - Cluster run:
Break your fasta files to batches with fastabreak:
cat myfasta/* | fastabreak -n 1000000 -out mybrokenfastas/
this will generate batches of 1 million sequences each
Download the sample SGE script : here
Change the options depending on your cluster.
Change the email address line "#$ -M your@email.address"
Change the lines:
# Specify the location of the output
#$ -o /output/directory
#$ -e /output/directory
Change the lines:
/usr/bin/uniq $TMPDIR/machines | /mnt/local/bin/rgang --rcpto
8000 -c --rcp="/usr/bin/rcp -r" - my/home/build35_tmp.idx
/tmp/FASTDB
/usr/bin/uniq $TMPDIR/machines |
/mnt/local/bin/rgang --rcpto 8000 -c --rcp="/usr/bin/rcp
-r" - /my/home/BUILD35 /tmp/FASTDB/HUMAN
these will copy your .idx file and the indices directory to /tmp/FASTDB
and /tmp/FASTDB/HUMAN. Before running the script, create those
directories in all
the nodes:
cexec mkdir /tmp/FASTDB
cexec mkdir /tmp/FASTDB/HUMAN
Finally modify the last line; correct the paths, etc.
/software/mpich-1.2.7-amd64/bin/mpirun
-np $NSLOTS -machinefile $TMPDIR/machines /mpiprograms/mrfastbatch
"/bin/runmrfast -genome /tmp/PFASTDB/build35_tmp.idx"
/home/mybrokenfasta /output/directory/myout $JOB_ID
don't forget to create the outout directory "/output/directory/myout"
run it:
qsub mrfast-sge.sh
good luck and have fun debugging :-)
Output Format
mrFAST:
In order, tab-delimited:
Column 1: Read name
Column 2: Start position
Column 3: End position
Column 4: Edit distance
Column 5: Strand (F: + , R: -)
Column 6: Chromosome
mrfast-fq.
Here the things get messy.
First 6 columns are the same as above.
If edit distance (column 4) is equal to 0 then
Column7: FASTQ quality string (if mapped to R strand, this is reversed)
If you have editdistance (ed)>0:
First, the info for the first read comes in in the order of:
Column 7: error type 1
Column 8: error position 1
Column 9: error character 1
Column 10: error quality (FASTQ char) 1
If edit distance = 2 then this repeats for
the second error as errtype2, errpos2, errchr2, errqual2.
IF errtype = 'D' (deletion)
then there is no quality value of a nonexisting base; so to make the
parsing easier; we report 'X' as the phredChar. After the error info,
the full quality string comes in.
errtype='S' substitution (SNP)
errtype='I' insertion
errtype='D' deletion