User Guide for CloudBurst on Kbase

Michael Schatz, the author of CloudBurst has made his code available on  Sourceforge site and mentions a brief introduction to CloudBurst which goes as follows:

“Next-generation DNA sequencing machines are generating an enormous amount of sequence data, placing unprecedented demands on traditional single-processor read mapping algorithms. CloudBurst is a new parallel read-mapping algorithm optimized for mapping next-generation sequence data to the human genome and other reference genomes, for use in a variety of biological analyses including SNP discovery, genotyping, and personal genomics. It is modeled after the short read mapping program RMAP, and reports either all alignments or the unambiguous best alignment for each read with any number of mismatches or differences. This level of sensitivity could be prohibitively time consuming, but CloudBurst uses the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes.”

Please read his paper to find internal details of CloudBurst implementation and cite it if it’s useful in your research.

We have successfully installed and tested the latest version of CloudBurst. Presented below is modified version of the software’s README file with specific instruction pertaining to Kbase. You can use your own data files for the instructions below or use sample files included under /shared/apps/CloudBurst/cloudburst_sample_data

1. Convert your reference and read data to binary format:

$ java -jar /shared/apps/cloudburst/ConvertFastaForCloud.jar ref.fa

$ java -jar /shared/apps/cloudburst/ConvertFastaForCloud.jar qry.fa

Keep track of the minimum read length in qry.fa, as this value will be needed for step 3

2. Copy data files into the hdfs: (note /path/to/data is the path within the hdfs)

$ hadoop fs -put /path/to/data

$ hadoop fs -put /path/to/data

3. Launch CloudBurst (Assumption for sample data command: 36bp reads, 3 mismatches, 24 nodes, best-alignment only)

$ hadoop jar /shared/apps/cloudburst/CloudBurst.jar /path/to/data/ /path/to/data/ /path/to/results 36 36 3 0 1 240 48 24 24 128 16 >& cloudburst.err

cloudburst.err is a text file containing the status information of your run

4. Copy the results back to the regular filesystem, and convert to a tab-deliminated file

$ hadoop fs -get /path/to/results results

$ java -jar /shared/apps/cloudburst/PrintAlignments.jar results > results.txt

This entry was posted in User Guide. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s