mrFAST

We moved to SourceForge. Update your bookmarks: http://mrfast.sourceforge.net for mrFAST

Our new copy number calling implementation (requires mrFAST or mrsFAST version > 2.0): http://mrcanavar.sourceforge.net


This page is kept for mrFAST version 1 only.



 mrfast reports the anchoring position only; and takes FASTA files as input.
 mrfast-fq  reports the mismatch and indel information,  and takes FASTQ files as input; however it is limited by edit distance 2.


Installation HOWTO


1 - Download the package.

2 - Compile the auxiliary stuff:

        gcc fastabreak.c -o fastabreak

        gcc fastafrac.c -o fastafrac
   
        gcc runmrfast.c -o runmrfast

        mpicc mrfastbatch.c -o mrfastbatch

        NOTE: the MPI job distributor assumes that your SGE system creates a temp directory
                it should be something like ("/tmp/%d.1.all.q", job_id);  

3 - Mask your genome the way you like it; and put it under a directory "mygenome".

4 - Create a start.idx file something like this. If you prefer 0-based coordinates, change the 1's to 0's.

5 - fractionate your genome:

        mkdir fracgenome;

        awk '{print "fastafrac -i " $1 " -s " $2 " -offset " $3 " -outfolder fracgenome"}' start.idx | sh

6 - Concatenate the log files created under your "mygenome" directory;

    cat mygenome/*log > mygenome.idx

    And fix the path in the first column if needed & create a second
    copy with the path name under /tmp or /var/tmp directories  for cluster use. It is highly recommended to copy the indices to the local disks
    of each node.

7 - Create your indices:

    cd fracgenome;
    for i in `ls`; do mrfast -f $i; done

    NOTE: There's an optional step which may gain 10-15% speed up. If you expect all your inputs will be a certain length (i.e. 36bp), create your
    indices like:
       
        for i in `ls`; do mrfast -f $i -l 36; done

    Those indices will still work for queries of different length, but won't benefit the extra speed up.

8 - Move your indices to your final destination for them (mv *index /what/ever)

9 - Single CPU run:

    mkdir myoutfolder;
    mrfast -loadidx mygenome.idx -outfolder myoutfolder/ -search mysequences.fa -textout

   if you want to search in a single chromosome, or a few of them, just make a smaller "mygenome.idx" file with grep and such.

10 - Cluster run:


    Break your fasta files to batches with fastabreak:

    cat myfasta/* | fastabreak -n 1000000 -out mybrokenfastas/

    this will generate batches of 1 million sequences each

    Download the sample SGE script : here

    Change the options depending on your cluster.
    Change the email address line "#$ -M your@email.address"
   
    Change the lines:
    # Specify the location of the output
    #$ -o /output/directory
    #$ -e /output/directory
   
    Change the lines:

    /usr/bin/uniq $TMPDIR/machines | /mnt/local/bin/rgang --rcpto 8000  -c --rcp="/usr/bin/rcp -r"  - my/home/build35_tmp.idx /tmp/FASTDB
    /usr/bin/uniq $TMPDIR/machines | /mnt/local/bin/rgang --rcpto 8000  -c --rcp="/usr/bin/rcp -r"  - /my/home/BUILD35 /tmp/FASTDB/HUMAN

    these will copy your .idx file and the indices directory to /tmp/FASTDB and /tmp/FASTDB/HUMAN. Before running the script, create those directories in all
    the nodes:

    cexec mkdir /tmp/FASTDB
    cexec mkdir /tmp/FASTDB/HUMAN

    Finally modify the last line; correct the paths, etc.

/software/mpich-1.2.7-amd64/bin/mpirun -np $NSLOTS -machinefile $TMPDIR/machines /mpiprograms/mrfastbatch "/bin/runmrfast -genome /tmp/PFASTDB/build35_tmp.idx" /home/mybrokenfasta /output/directory/myout $JOB_ID

    don't forget to create the outout directory "/output/directory/myout"


    run it:

    qsub mrfast-sge.sh


    good luck and have fun debugging :-)



   

Output Format

mrFAST: In order, tab-delimited: Column 1: Read name
Column 2: Start position
Column 3: End position
Column 4: Edit distance
Column 5: Strand (F: + , R: -)
Column 6: Chromosome


mrfast-fq.

Here the things get messy.

First 6 columns are the same as above.

If edit distance (column 4) is equal to 0 then
   Column7: FASTQ quality string (if mapped to R strand, this is reversed)
If you have editdistance (ed)>0:

First, the info for the first read comes in in the order of:

Column 7: error type 1
Column 8: error position 1
Column 9: error character 1
Column 10: error quality (FASTQ char) 1

If edit distance = 2 then this repeats for
the second error as errtype2, errpos2, errchr2, errqual2.

IF errtype = 'D' (deletion)
then there is no quality value of a nonexisting base; so to make the
parsing easier; we report 'X' as the phredChar. After the error info,
the full quality string comes in.
errtype='S' substitution (SNP)
errtype='I' insertion
errtype='D' deletion