Conversion from Hadoop to Openstack – An experiment with Scientific Cloud

Building an efficient cloud infrastructure to provision scientific applications takes a lot of fore-thought. There are many dimensions that could be defined to lay out the cloud architecture. Scalability, bandwidth, IO performance, CPU utilization, storage type and architecture, usability and provisioning interface, security etc. are some of the basic dimensions that are important in building a successful cloud. As much as these measures are considered crucial, the type of hardware and the quality of the interconnect can have a severe effect on the overall health of the cloud.

Over the next few weeks/months we’ll dive into various available technologies to explore the best possible combination of cloud infrastructure for scientific applications. We’ll discuss various aspects of deploying a cloud infrastructure and blog along the whole process.

Ride on.

Posted in Uncategorized | Leave a comment

Some Images of the Kandinsky Cluster

Here are some images i made of the cluster, not long after it was delivered and installed.

Posted in Hardware | Leave a comment

User Guide for CloudBurst on Kbase

Michael Schatz, the author of CloudBurst has made his code available on  Sourceforge site and mentions a brief introduction to CloudBurst which goes as follows:

“Next-generation DNA sequencing machines are generating an enormous amount of sequence data, placing unprecedented demands on traditional single-processor read mapping algorithms. CloudBurst is a new parallel read-mapping algorithm optimized for mapping next-generation sequence data to the human genome and other reference genomes, for use in a variety of biological analyses including SNP discovery, genotyping, and personal genomics. It is modeled after the short read mapping program RMAP, and reports either all alignments or the unambiguous best alignment for each read with any number of mismatches or differences. This level of sensitivity could be prohibitively time consuming, but CloudBurst uses the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes.”

Please read his paper to find internal details of CloudBurst implementation and cite it if it’s useful in your research.

We have successfully installed and tested the latest version of CloudBurst. Presented below is modified version of the software’s README file with specific instruction pertaining to Kbase. You can use your own data files for the instructions below or use sample files included under /shared/apps/CloudBurst/cloudburst_sample_data

1. Convert your reference and read data to binary format:

$ java -jar /shared/apps/cloudburst/ConvertFastaForCloud.jar ref.fa ref.br

$ java -jar /shared/apps/cloudburst/ConvertFastaForCloud.jar qry.fa qry.br

Keep track of the minimum read length in qry.fa, as this value will be needed for step 3

2. Copy data files into the hdfs: (note /path/to/data is the path within the hdfs)

$ hadoop fs -put ref.br /path/to/data

$ hadoop fs -put qry.br /path/to/data

3. Launch CloudBurst (Assumption for sample data command: 36bp reads, 3 mismatches, 24 nodes, best-alignment only)

$ hadoop jar /shared/apps/cloudburst/CloudBurst.jar /path/to/data/ref.br /path/to/data/qry.br /path/to/results 36 36 3 0 1 240 48 24 24 128 16 >& cloudburst.err

cloudburst.err is a text file containing the status information of your run

4. Copy the results back to the regular filesystem, and convert to a tab-deliminated file

$ hadoop fs -get /path/to/results results

$ java -jar /shared/apps/cloudburst/PrintAlignments.jar results > results.txt

Posted in User Guide | Leave a comment

User Guide for Crossbow on Kbase

This guide gives you directions to using Crossbow, a Hadoop enabled software pipeline for whole genome resequencing analysis on Kbase, our private cloud infrastructure. Crossbow is scalable, portable, and automatic Cloud Computing tool for finding SNPs in genomes from short read data. Crossbow employs modified versions of Bowtie and SOAPsnp to perform the short read alignment and SNP calling respectively. As per the benchmarking test done by authors the pipeline can accurately analyze over 35x coverage of a human genome in one day on a 10-node local cluster, or in 3 hours for about $100 using a 40-node, 320-core cluster rented from Amazon’s EC2 utility computing service.

Here is a link to their main site and reference manual

http://bowtie-bio.sourceforge.net/crossbow/index.shtml

http://bowtie-bio.sourceforge.net/crossbow/manual.shtml

Please read their paper to find internal details of Crossbow implementation

Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL. Searching for SNPs with cloud computing. Genome Biol 10:R134.

We are not the authors of this tool and are grateful to the authors for making it open source. But we have it running on our private Hadoop cloud and some of these instructions have been compiled from their manual or internal source code documentation for command line users to run their jobs.

Checking the environment

Check that the following commands are in your path by typing them at the command line. They should be part of your environment once you log in.

            hadoop

            bowtie-build

            crossbow.pl

Data gathering and curation

Once this is confirmed, you need to assemble your data to be analyzed. Crossbow requires three kinds of files and the next generation sequencing reads from the Short read archive(fastq files).

  1. The reference sequences as FASTA files, one file per sequence.
  2. A Bowtie index of the reference sequences, one index for all sequences combined.
  3. (Optional): Files describing known SNPs and allele frequences for each reference sequence, one file per sequence.

The FASTA files in the sequences subdirectory must each be named chrX.fa, where X is the 0-based numeric id of the chromosome or sequence in the file. For example, for a human reference, chromosome 1’s FASTA file could be named chr0.fa, chromosome 2 named chr1.fa, etc, all the way up to chromosomes 22, X and Y, named chr21.fa, chr22.fa and chr23.fa. Also, the names of the sequences within the FASTA files must match the number in the file name, i.e., the first line of the FASTA file chr0.fa must be >0.

The index files in the index subdirectory must have the basename index, i.e., the index subdirectory must contain these files:

index.1.ebwt

index.2.ebwt

index.3.ebwt

index.4.ebwt

index.rev.1.ebwt

index.rev.2.ebwt

The index must be built using the bowtie-build tool distributed with Bowtie. When bowtie-build is executed, the FASTA files specified on the command line must be listed in ascending order of numeric id. For instance, for a set of FASTA files encoding human chromosomes 1,2,…,22,X,Y as chr0.fa,chr1.fa,…,chr21.fa, chr22.fa,chr23.fa, the command for bowtie-build must list the FASTA files in that order:

bowtie-build chr0.fa,chr1.fa,…,chr23.fa index

The SNP description files in the snps subdirectory must also have names that match the corresponding FASTA files in the sequences subdirectory, but with extension .snps. E.g. if the sequence file for human Chromosome 1 is named chr0.fa, then the SNP description file for Chromosome 1 must be named chr0.snps. SNP description files may be omitted for some or all chromosomes.

The format of the SNP description files must match the format expected by SOAPsnp‘s -s option. The format consists of 1 SNP per line, with the following tab-separated fields per SNP:

  1. Chromosome ID
  2. 1-based offset into chromosome
  3. Whether SNP has allele frequency information (1 = yes, 0 = no)
  4. Whether SNP is validated by experiment (1 = yes, 0 = no)
  5. Whether SNP is actually an indel (1 = yes, 0 = no)
  6. Frequency of A allele, as a decimal number
  7. Frequency of C allele, as a decimal number
  8. Frequency of G allele, as a decimal number
  9. Frequency of T allele, as a decimal number
  10. SNP id (e.g. a dbSNP id such as rs9976767)

Execution

 1. Upload reference data to hdfs into index directory.

            $ hadoop fs -mkdir <subdir>human/index

             example   $ hadoop fs -mkdir human/index

            If your files are in a local directory named index then execute the following

            $ hadoop fs -put index/*  human/index

            This directory should contain: chr*.fa, chr*.snp, index.*.ebwt files.

 2. Upload reads. These are the reads(.fastq files) from the Short read archive(SRA) or elsewhere

            $ hadoop fs -mkdir <subdir>/reads

            example    $ hadoop fs -mkdir human/reads

             If your reads are in a local directory named index then execute the following

            $ hadoop fs -put reads/* human/reads

  1. Check files are uploaded properli using the haddop fs -ls command. This also lists the complete path to your file which is needed for creation of the manifest file.
  2. Create read manifest file

            Contains the full path (with hdfs://) to each read file, with mated files on same line:

            hdfs://kbase-ib:8020/user/<yourusername>/human/reads/sim_paired_1_1_1.fq.gz 0 hdfs://kbase-ib:8020/ user/<yourusername>/human/reads/sim_paired_1_1_2.fq.gz 0

                        As an example of steps 4 and 5

            [akd@kbase ~]$ hadoop fs -ls crossbow/reads

            Found 1 items

            -rw-rw-rw-   3 akd akd  148098692 2010-06-25 15:28 /user/akd/crossbow/reads/SRR036930.fastq

            [akd@kbase ~]$ cat crossbow_trial_data/manifest

            hdfs://kbase-ib:8020/user/akd/crossbow/reads/SRR036930.fastq 0

  1. Run Crossbow

                        Preprocess reads

                     $ crossbow.pl -pre -readlist /local/path/to/manifest human

             Run Bowtie

                                    $ crossbow.pl -bowtie human

             Run SoapSNP     

                                    $ crossbow.pl -snps -fetchsnps human

             After your run, crossbow will create a human.snps file on the local filesystem. See the crossbow manual for a description of the file format.

Posted in User Guide | Leave a comment

Accessing the Hadoop filesystem (HDFS)

Hadoop has it own distributed file system called HDFS which can be accessed with the hadoop utility. The command to access the HDFS is the file system user client command “fs”

Type “hadoop fs” on the command line to get a bunch of generic and command options supported by this command. Here are a few steps to upload a file, run some MapReduce code on it and download the results from the HDFS

Type hadoop fs -ls to get a listing of your default directory on HDFS. It should be  /user/<username>

Create a input  directory in your default HDFS directory by using

“hdfs fs -mkdir grep_input”

Upload a file to the input directory. I selected the MapReduce HTML tutorial file which is in the Hadoop distribution. You can select anything you want from your local file system

hadoop fs -put /opt/hadoop/docs/mapred_tutorial.html grep_input

Check with:

hadoop fs -ls grep_input

You can download the file from hdfs onto your local system by using the get command. Assuming ou want to download to current directory.

hadoop fs -get grep_input/mapred_tutorial.html .

Check with:

ls -lt

In the next posting we will discuss how to run standard MapReduce programs  distributed with the standard Hadoop installation and also discuss job management and tracking.

Posted in User Guide | Leave a comment

Logging into the system

The preferred way to access the Systems Biology Knowledgebase computer (sbkbase.ornl.gov) which is a Linux server, is by using a Secure shell (ssh) utility.  There are various ssh utilities available depending on the machine that you are connecting from:

  1. If you are connecting from a Linux/Unix machine make sure you have the OpenSSH client installed and use this.
  2. If you are accessing from a Mac, it comes with an installation of OpenSSH and so you just need to use the Terminal to get access to ssh command.
  3. If you are accessing from a Windows machine then there are a couple of recommended options:
    1. Install Putty SSH client and use this to connect directly to sbkbase.ornl.gov. (http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html)
    2. Or you can install Cygwin and OpenSSH on Windows and then run ssh from within cygwin. (http://pigtail.net/LRP/printsrv/cygwin-sshd.html)

Once you have your account setup on sbkbase, have a user name assigned to you and have SSH properly set up then just go to the appropriate terminal and connect as follows

ssh <username>@sbkbase.ornl.gov

Once you are authenticated and have access to the system you will be on the Master node in your home directory. Hadoop is installed in the /opt directory. Make sure you have the following line in the .bashrc file in your home directory

export PATH=$PATH:/opt/hadoop/bin

Source the file to have hadoop binaries in your Path.

The main utility  for accessing the Hadoop system is “hadoop”. You can type this on the command line to get a list of command which the hadoop utility can run.

Posted in User Guide | Leave a comment

Guidelines for using the system

We know how to play together nicely.

Posted in Governance | Leave a comment

User accounts

To log into the system, you will need what is referred to at ORNL as an XCAMS account. Please visit the following link and follow the instructions for creating an XCAMS account. https://xcams.ornl.gov/xcams/groups/scicomp/

Posted in User accounts | Leave a comment

Hardware specification and details

Kandinsky is a Linux system with a Cent OS Kernel version  5 installed and the following major hardware elements:

  • 68 compute nodes, a management node and a backup management node, and a data staging storage system.
  • Each compute node has dual AMD Opteron 6134, 2.3 GHz processors (8 cores/CPU, 16 cores/node, 1088 cores/cluster)
  • The memory controller supports 4 channel DDR3
  • Each compute node has 64 GB DDR3 1333 MHz registered ECC memory (4 GB/core, 4.4 TB/cluster)
  • All memory slots are utilized, achieving full memory bandwidth
  • Each compute node has 4 x 2 TB disk drives (8TB/node, 544 TB/cluster)
  • The management head node also has dual AMD Opteron 6134 processors, 32 GB memory and 500 GB mirrored disk.
  • The backup management node has the same configuration as the head node.
  • Data staging storage with 36 TB RAID array.
  • 20 Gbps InfiniBand Interconnect and 1 Gbps GigE Ethernet management and maintenance network.
Posted in Hardware | 1 Comment