The database schema|
We use a relational schema implemented in MySQL with a small number of tables. The 'seqs' table is used across all releases of the database. Other tables have a suffix consisting of '_xx' indicating the GenBank release number.
The 'nodes_xx' table is constructed in part from NCBI's taxonomy flatfiles and in part from calculations and summaries built by us. The 'seqs' table is data taken directly from GenBank sequence flatfiles. The 'clusters_xx' contains summary information obtained by the clustering pipeline, and information about individual clusters is stored in 'cigi_xx'. Summary statistics on the entire cluster set are calculated and stored in 'summary_stats'.
Source code for the scripts that implement the clustering pipeline is provided below. Note, however, that it relies on additional software installations including BLAST (available from NCBI), and our PhyLoTA project software programs blink, and blast2blink, which can be downloaded from the PhyLoTA project software page.
BLAST search scripts are distributable on cluster architecturs and are written as Sun Grid Engine compliant code (in PERL). However, it is straightforward to modify these scripts to run on single workstations.
The web interface seen by the user is implemented as a set of Perl CGI scripts. These write dynamic web pages in response to user queries of various kinds. They access the MySQL database via the Perl DBI module. A configuration file is included in the source code that can be modified for local MySQL login parameters.
|Perl scripts for the clustering pipeline (gzip compressed)|
|Perl CGI scripts for the browser interface (gzip compressed)|