The GenBase System
GenBase is a highly customizable software package that integrates biological sequence, sequence meta-data, sequence annotation and search results data in a single, highly efficient search and analytics engine. This powerful solution enables seamless integration of various protein and nucleic sequences formats, including their annotation information and search results data, into a unified data repository that provides an array of data management tools, analytics capabilities and flexible connectivity to other systems.
The GenBase platform is based on ElasticSearch, a very widely used enterprise-grade, open-source, scalable and distributed search engine. The ElasticSearch engine enables easy and fast querying into vast amounts of sequence, annotation and homology search results within a unified environment. It also supports data management tasks and processes, such as sequence data de-duplication, in a controlled and verifiable manner, while keeping complete linkage and tracking between raw and processed data.
Seamless integration of the GenBase repository with GenCore-6, GenCore Grid and GenWeb provides a complete data management solution with flexible connectivity to other systems.
The GenBase system is built of the following main modules:
Biological Sequences Ingestion and Repository:
This module enables ingestion, into the ElasticSearch engine, of protein and nucleic sequences of various formats such as FASTA, GenBank, EMBL, UniProt, GENESEQ, ST.23, ST.25, ST.26 and more. Due to the flexibility of ElasticSearch and the availability of a variety of data connectors, this module can be configured to ingest data not only from flat files but also from a variety of databases, CMS, and other systems. Sequences are ingested into the repository together with their annotation data, while producing a comprehensive set of indexes and hash keys, for fast and flexible analytics and search capabilities. As part of this process a 512-bit long hash (SHA-512) is calculated for each sequence, which then enables de-duplication of the searchable sequence databases.
De-duplication of Searchable Databases:
This module is responsible for producing the de-duplicated searchable flat-files databases for the GenCore-6 and GenCore-Grid systems. This is done through the export of unique sequences (using the pre-calculated hash key), without annotation, into FASTA format searchable data sets. For in-house and version controlled sequence databases this introduces a significant reduction in the size of the searchable databases and improved clarity and structure to the homology search results of these data.
Homology Search Results API, Ingestion and Repository:
This module supports the different aspects of producing and analyzing homology search results. First, a flexible GenBase API enables GenCore-6 to produce duplication-aware homology search results, by pooling the sequence annotation and duplication information from the GenBase repository. In addition, all search results are ingested back into GenBase's ElasticSearch engine; again with comprehensive indexing. This process then supports fast and flexible analysis and querying of sequence information combined with search results data in a unified manner.
Inherent and Robust Data Integrity Support:
The GenBase platform, backed by the ElasticSearch engine, supports multiple data integrity checks and validations. Ingestion of each sequence and search result are verified for completion. This is combined with frequent (e.g. daily) validation of random representative samples of ingested files. To complement these processes, a periodic validation of the complete data sets is performed from time to time, based on the system manager's schedule preferences.
The combination of the comprehensive and flexible GenBase system with the GenCore Grid Server, the GenCore-6 search package and the GenWeb interface, provides state-of-the-art sequence and homology search-results management, with fast and flexible analytics and querying capabilities. These are combined with inherent, enterprise grade, data integrity validation and failover capabilities. This integrated solution thus enables transforming the vast biological sequence datasets and homology search results into easily accessible, clearly presented information.