This guide gives you directions to using Crossbow, a Hadoop enabled software pipeline for whole genome resequencing analysis on Kbase, our private cloud infrastructure. Crossbow is scalable, portable, and automatic Cloud Computing tool for finding SNPs in genomes from short read data. Crossbow employs modified versions of Bowtie and SOAPsnp to perform the short read alignment and SNP calling respectively. As per the benchmarking test done by authors the pipeline can accurately analyze over 35x coverage of a human genome in one day on a 10-node local cluster, or in 3 hours for about $100 using a 40-node, 320-core cluster rented from Amazon’s EC2 utility computing service.
Here is a link to their main site and reference manual
http://bowtie-bio.sourceforge.net/crossbow/index.shtml
http://bowtie-bio.sourceforge.net/crossbow/manual.shtml
Please read their paper to find internal details of Crossbow implementation
Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL. Searching for SNPs with cloud computing. Genome Biol 10:R134.
We are not the authors of this tool and are grateful to the authors for making it open source. But we have it running on our private Hadoop cloud and some of these instructions have been compiled from their manual or internal source code documentation for command line users to run their jobs.
Checking the environment
Check that the following commands are in your path by typing them at the command line. They should be part of your environment once you log in.
hadoop
bowtie-build
crossbow.pl
Data gathering and curation
Once this is confirmed, you need to assemble your data to be analyzed. Crossbow requires three kinds of files and the next generation sequencing reads from the Short read archive(fastq files).
- The reference sequences as FASTA files, one file per sequence.
- A Bowtie index of the reference sequences, one index for all sequences combined.
- (Optional): Files describing known SNPs and allele frequences for each reference sequence, one file per sequence.
The FASTA files in the sequences subdirectory must each be named chrX.fa, where X is the 0-based numeric id of the chromosome or sequence in the file. For example, for a human reference, chromosome 1’s FASTA file could be named chr0.fa, chromosome 2 named chr1.fa, etc, all the way up to chromosomes 22, X and Y, named chr21.fa, chr22.fa and chr23.fa. Also, the names of the sequences within the FASTA files must match the number in the file name, i.e., the first line of the FASTA file chr0.fa must be >0.
The index files in the index subdirectory must have the basename index, i.e., the index subdirectory must contain these files:
index.1.ebwt
index.2.ebwt
index.3.ebwt
index.4.ebwt
index.rev.1.ebwt
index.rev.2.ebwt
The index must be built using the bowtie-build tool distributed with Bowtie. When bowtie-build is executed, the FASTA files specified on the command line must be listed in ascending order of numeric id. For instance, for a set of FASTA files encoding human chromosomes 1,2,…,22,X,Y as chr0.fa,chr1.fa,…,chr21.fa, chr22.fa,chr23.fa, the command for bowtie-build must list the FASTA files in that order:
bowtie-build chr0.fa,chr1.fa,…,chr23.fa index
The SNP description files in the snps subdirectory must also have names that match the corresponding FASTA files in the sequences subdirectory, but with extension .snps. E.g. if the sequence file for human Chromosome 1 is named chr0.fa, then the SNP description file for Chromosome 1 must be named chr0.snps. SNP description files may be omitted for some or all chromosomes.
The format of the SNP description files must match the format expected by SOAPsnp‘s -s option. The format consists of 1 SNP per line, with the following tab-separated fields per SNP:
- Chromosome ID
- 1-based offset into chromosome
- Whether SNP has allele frequency information (1 = yes, 0 = no)
- Whether SNP is validated by experiment (1 = yes, 0 = no)
- Whether SNP is actually an indel (1 = yes, 0 = no)
- Frequency of A allele, as a decimal number
- Frequency of C allele, as a decimal number
- Frequency of G allele, as a decimal number
- Frequency of T allele, as a decimal number
- SNP id (e.g. a dbSNP id such as rs9976767)
Execution
1. Upload reference data to hdfs into index directory.
$ hadoop fs -mkdir <subdir>human/index
example $ hadoop fs -mkdir human/index
If your files are in a local directory named index then execute the following
$ hadoop fs -put index/* human/index
This directory should contain: chr*.fa, chr*.snp, index.*.ebwt files.
2. Upload reads. These are the reads(.fastq files) from the Short read archive(SRA) or elsewhere
$ hadoop fs -mkdir <subdir>/reads
example $ hadoop fs -mkdir human/reads
If your reads are in a local directory named index then execute the following
$ hadoop fs -put reads/* human/reads
- Check files are uploaded properli using the haddop fs -ls command. This also lists the complete path to your file which is needed for creation of the manifest file.
- Create read manifest file
Contains the full path (with hdfs://) to each read file, with mated files on same line:
hdfs://kbase-ib:8020/user/<yourusername>/human/reads/sim_paired_1_1_1.fq.gz 0 hdfs://kbase-ib:8020/ user/<yourusername>/human/reads/sim_paired_1_1_2.fq.gz 0
As an example of steps 4 and 5
[akd@kbase ~]$ hadoop fs -ls crossbow/reads
Found 1 items
-rw-rw-rw- 3 akd akd 148098692 2010-06-25 15:28 /user/akd/crossbow/reads/SRR036930.fastq
[akd@kbase ~]$ cat crossbow_trial_data/manifest
hdfs://kbase-ib:8020/user/akd/crossbow/reads/SRR036930.fastq 0
- Run Crossbow
Preprocess reads
$ crossbow.pl -pre -readlist /local/path/to/manifest human
Run Bowtie
$ crossbow.pl -bowtie human
Run SoapSNP
$ crossbow.pl -snps -fetchsnps human
After your run, crossbow will create a human.snps file on the local filesystem. See the crossbow manual for a description of the file format.