e-Bioscience: problem-solving environments for omics research
Omics technologies have changed the arena of life sciences research forever. They allow generation of data at a large-scale, which started with the whole-genome DNA sequencing followed by micro-array gene-expression analysis and mass spectrometry of proteins and metabolites. Data are produced at a startling rate by a constantly growing number of new high-throughput and/or genome-wide biotechniques at each cellular level: genomics (DNA), transcriptomics (RNA), proteomics (protein), metabolomics (metabolite) and phenomics (phenotype). With ongoing innovations in nanotechnology and lab-on-a-chip applications the end of this development is not yet in sight.
However, in contrast to their longstanding tradition in the development of biotechnology, life scientists have little experience in dealing with large, information-rich data sets. Especially when integrating these data sets, a system can be studied as a whole (systems approach), patterns can be recognized, models tested, networks of relationships built, biomarkers for diagnostics identified and leads for drug development discovered. Another important aspect of the increasing complexity of science is that it becomes economically impossible to build up all necessary data, skills and competence at one location, which is even true for research laboratories of multi-national companies that operate on a global scale. This calls for an experimental environment that enables reuse and sharing of data, methods, experimental designs and knowledge within a setup of (ad-hoc) collaborations, in which the whereabouts of scientists and data are not important.
Thus, the bottlenecks for life sciences have shifted from data generation to data storage, pre-processing, analysis, interpretation, reuse and (remote) collaboration. The current challenge is to remove these bottlenecks by a combination of life sciences and information technology (IT). The area that deals with this challenge is called enhanced-Bioscience (e-Bioscience) and is characterized by multidisciplinary collaborations. Integration of (Grid) methodologies and infrastructure needed for e-Bioscience experimentation form the basis for the concept of e-Bioscience Problem-Solving Environments (PSEs). Some key aspects here are: virtual organizations, remote collaboration, data plus resource sharing and reuse, data integration, information management, and knowledge handling. Hence an e-Bioscience PSEs supports the scientist to collaborate with distant partners, design experiments, (re)use data, build models, indicate the tenability of a hypothesis, notify in case of availability of distant data or methods. Ideally, the life scientist will work in Grid-based e-Bioscience PSEs in which he can perform tasks, while at the same time the data, the processing of data and information, and the knowledge that are associated with an experiment are silently taken care of. e-Bioscience PSEs are currently be developed in the context of BSIK programs Virtual Lab e-Science (VL-e) and BioRange.
There are three application areas in Grid-based e-bioscience PSEs that currently will be most beneficial to the life scientist:
- Grid computing for:
- non-parametric statistical methods, micro-array analyses, protein structure analyses, and image analyses.
- mass-spectrometric data analysis.
- integrative computational experiments.
- generation of biological networks.
- simulation modeling in systems biology.
- Provide access to distributed resources such as:
- data sources related to technologies like micro-array, mass spectrometry, microscopy, fMRI, etc.
- data sources related to research like literature, domain models, etc.
- method sources for omics data analyses like Grid/Web services.
- Support application interaction with:
- real time visualization and communication.
- interactive and creative environments for data interpretation and hypothesis generation.
Since these elements are all researched and implemented in the context of the VL-e context, the hardware and Grid requirements are rather generic. As with other data intensive science applications, e-Bioscience deals with globally distributed databases, which use requires high performance networking, huge temporal storage capacity and extensive computational power.