Downloading all the nonredundant maximal trees from the current release

With release 184 we are making available in one file a set of 22,165 trees from nonredundant maximal clusters. A maximal cluster is a cluster that has no parent clusters (recall that we stop clustering at a certain depth in the NCBI tree). Trees in this set are built from disjoint sets of sequence data, and so are nonredundant with respect to the underlying data. Each tree is a RAxML optimal tree from a Muscle alignment (branch lengths included). The format consists of two columns. The first entry in each row is a unique identifier that can be parsed to give the PhyLoTA taxon ID and cluster ID for the cluster ('tiXXX_ciXXX'). The second entry is a newick string in which all spaces in taxon names have been converted to underscores. The taxon name includes the NCBI taxon ID and GI number too.

PLEASE NOTE. A bug in the script generating this file was fixed Sept. 14, 2012. If you downloaded and used this file before that, the identifier in the first column may contain an incorrect taxon id number (the newick trees themselves are fine).

Download trees

Downloading the entire database or associated scripts

Current and previous releases of the PhyLoTA Browser database can be downloaded in a format suitable for rebuilding it locally in a mysql database (or the equivalent). The database is exported as a set of mysql commands using the 'mysqldump' utility. The file is named in the following format:

where rel### is the GenBank release upon which the database was built, and date is the date the database was exported. This very large file is then broken into several 250 MB pieces with file names as above, but with the suffix, 'partxx' replacing the .gz. These can be downloaded separately. To join them again under Unix/Linux the command would be (for example):

cat pb.bu.rel168.8.18.2009.part* > pb.bu.rel168.8.18.2009.gz

Once the file is uncompressed it can be imported into mysql directly. The files are plain text, so they can also be parsed by other software easily.

Go to the download directory

Notes on the database schema, and scripts used to build the database and run the browser

The database schema

We use a relational schema implemented in MySQL with a small number of tables. The 'seqs' table is used across all releases of the database. Other tables have a suffix consisting of '_xx' indicating the GenBank release number.

The 'nodes_xx' table is constructed in part from NCBI's taxonomy flatfiles and in part from calculations and summaries built by us. The 'seqs' table is data taken directly from GenBank sequence flatfiles. The 'clusters_xx' contains summary information obtained by the clustering pipeline, and information about individual clusters is stored in 'cigi_xx'. Summary statistics on the entire cluster set are calculated and stored in 'summary_stats'.

Source code for the scripts that implement the clustering pipeline is provided below. Note, however, that it relies on additional software installations including BLAST (available from NCBI), and our PhyLoTA project software programs blink, and blast2blink, which can be downloaded from the PhyLoTA project software page.

BLAST search scripts are distributable on cluster architecturs and are written as Sun Grid Engine compliant code (in PERL). However, it is straightforward to modify these scripts to run on single workstations.

The browser

The web interface seen by the user is implemented as a set of Perl CGI scripts. These write dynamic web pages in response to user queries of various kinds. They access the MySQL database via the Perl DBI module. A configuration file is included in the source code that can be modified for local MySQL login parameters.

Perl scripts for the clustering pipeline (gzip compressed)
Perl CGI scripts for the browser interface (gzip compressed)