User Guide for Crossbow on Kbase

This guide gives you directions to using Crossbow, a Hadoop enabled software pipeline for whole genome resequencing analysis on Kbase, our private cloud infrastructure. Crossbow is scalable, portable, and automatic Cloud Computing tool for finding SNPs in genomes from short read data. Crossbow employs modified versions of Bowtie and SOAPsnp to perform the short read alignment and SNP calling respectively. As per the benchmarking test done by authors the pipeline can accurately analyze over 35x coverage of a human genome in one day on a 10-node local cluster, or in 3 hours for about $100 using a 40-node, 320-core cluster rented from Amazon’s EC2 utility computing service.

Here is a link to their main site and reference manual

http://bowtie-bio.sourceforge.net/crossbow/index.shtml

http://bowtie-bio.sourceforge.net/crossbow/manual.shtml

Please read their paper to find internal details of Crossbow implementation

Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL. Searching for SNPs with cloud computing. Genome Biol 10:R134.

We are not the authors of this tool and are grateful to the authors for making it open source. But we have it running on our private Hadoop cloud and some of these instructions have been compiled from their manual or internal source code documentation for command line users to run their jobs.

Checking the environment

Check that the following commands are in your path by typing them at the command line. They should be part of your environment once you log in.

            hadoop

            bowtie-build

            crossbow.pl

Data gathering and curation

Once this is confirmed, you need to assemble your data to be analyzed. Crossbow requires three kinds of files and the next generation sequencing reads from the Short read archive(fastq files).

  1. The reference sequences as FASTA files, one file per sequence.
  2. A Bowtie index of the reference sequences, one index for all sequences combined.
  3. (Optional): Files describing known SNPs and allele frequences for each reference sequence, one file per sequence.

The FASTA files in the sequences subdirectory must each be named chrX.fa, where X is the 0-based numeric id of the chromosome or sequence in the file. For example, for a human reference, chromosome 1’s FASTA file could be named chr0.fa, chromosome 2 named chr1.fa, etc, all the way up to chromosomes 22, X and Y, named chr21.fa, chr22.fa and chr23.fa. Also, the names of the sequences within the FASTA files must match the number in the file name, i.e., the first line of the FASTA file chr0.fa must be >0.

The index files in the index subdirectory must have the basename index, i.e., the index subdirectory must contain these files:

index.1.ebwt

index.2.ebwt

index.3.ebwt

index.4.ebwt

index.rev.1.ebwt

index.rev.2.ebwt

The index must be built using the bowtie-build tool distributed with Bowtie. When bowtie-build is executed, the FASTA files specified on the command line must be listed in ascending order of numeric id. For instance, for a set of FASTA files encoding human chromosomes 1,2,…,22,X,Y as chr0.fa,chr1.fa,…,chr21.fa, chr22.fa,chr23.fa, the command for bowtie-build must list the FASTA files in that order:

bowtie-build chr0.fa,chr1.fa,…,chr23.fa index

The SNP description files in the snps subdirectory must also have names that match the corresponding FASTA files in the sequences subdirectory, but with extension .snps. E.g. if the sequence file for human Chromosome 1 is named chr0.fa, then the SNP description file for Chromosome 1 must be named chr0.snps. SNP description files may be omitted for some or all chromosomes.

The format of the SNP description files must match the format expected by SOAPsnp‘s -s option. The format consists of 1 SNP per line, with the following tab-separated fields per SNP:

  1. Chromosome ID
  2. 1-based offset into chromosome
  3. Whether SNP has allele frequency information (1 = yes, 0 = no)
  4. Whether SNP is validated by experiment (1 = yes, 0 = no)
  5. Whether SNP is actually an indel (1 = yes, 0 = no)
  6. Frequency of A allele, as a decimal number
  7. Frequency of C allele, as a decimal number
  8. Frequency of G allele, as a decimal number
  9. Frequency of T allele, as a decimal number
  10. SNP id (e.g. a dbSNP id such as rs9976767)

Execution

 1. Upload reference data to hdfs into index directory.

            $ hadoop fs -mkdir <subdir>human/index

             example   $ hadoop fs -mkdir human/index

            If your files are in a local directory named index then execute the following

            $ hadoop fs -put index/*  human/index

            This directory should contain: chr*.fa, chr*.snp, index.*.ebwt files.

 2. Upload reads. These are the reads(.fastq files) from the Short read archive(SRA) or elsewhere

            $ hadoop fs -mkdir <subdir>/reads

            example    $ hadoop fs -mkdir human/reads

             If your reads are in a local directory named index then execute the following

            $ hadoop fs -put reads/* human/reads

  1. Check files are uploaded properli using the haddop fs -ls command. This also lists the complete path to your file which is needed for creation of the manifest file.
  2. Create read manifest file

            Contains the full path (with hdfs://) to each read file, with mated files on same line:

            hdfs://kbase-ib:8020/user/<yourusername>/human/reads/sim_paired_1_1_1.fq.gz 0 hdfs://kbase-ib:8020/ user/<yourusername>/human/reads/sim_paired_1_1_2.fq.gz 0

                        As an example of steps 4 and 5

            [akd@kbase ~]$ hadoop fs -ls crossbow/reads

            Found 1 items

            -rw-rw-rw-   3 akd akd  148098692 2010-06-25 15:28 /user/akd/crossbow/reads/SRR036930.fastq

            [akd@kbase ~]$ cat crossbow_trial_data/manifest

            hdfs://kbase-ib:8020/user/akd/crossbow/reads/SRR036930.fastq 0

  1. Run Crossbow

                        Preprocess reads

                     $ crossbow.pl -pre -readlist /local/path/to/manifest human

             Run Bowtie

                                    $ crossbow.pl -bowtie human

             Run SoapSNP     

                                    $ crossbow.pl -snps -fetchsnps human

             After your run, crossbow will create a human.snps file on the local filesystem. See the crossbow manual for a description of the file format.

Advertisements
This entry was posted in User Guide. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s