The goal of our experiment is to integrate histone modification data with transcription factor binding site data using existing semantic web tools. The data we use is from the ENCODE project and provides presence/absence information of a certain histone modification for the ENCODE regions. We retrieved both types of data from the UCSC Genome Bioinformatics Site, where they are available from an ftp server in gzipped format (since we started, UCSC has made data directly available from a mySQL database). Using a modification of the Mapper program, we converted the tab-delimited tabular format of the data files into an RDF format that preserved the table structure, column names, and data types. We labelled RDF data with explicit XML Schema Definition data types such as "&xsd;integer" (See also "Gotchas" section below). The details of the translation into RDF by Mapper are determined by an XML file that discloses the mappings*. Once the data was converted into RDF, we linked names in the data into the namespace of our own OWL models (OWLDoc Overview of Ontologies) using RDFS statements. This allowed us to perform a model-based query, i.e. in terms of our own OWL models.
* Note: Our current version of Mapper has custom modifications in order to produce RDF output so some details are not immediately accessible. We plan to reengineer Mapper so that the RDF format is fully specified in the XML mapping file.
Figure 1: Transcription factor binding site (TFBS) data is converted from a tab-delimited format to RDF by using the column headers from the database schema. Similarly, ChIP-chip data for histone 3 Lysine 4 tri-methylation (H3K4Me3), is also converted to RDF. Both data sets are obtained from the UCSC site. The H3K4Me3 data is part of the ENCODE project.
Listing 1. SeRQL model-based query
|
Here is the SPARQL equivalent. We have also worked with Jena.
Our xsd data type tags apparently created a problem. The cause of the problem wasn't immediately apparent from the error message: [ERROR ] : error while adding new triples: org.openrdf.rio.ParseException: org.xml.sax.SAXParseException: Parser has reached the entity expansion limit "64,000" set by the Application. To solve this problem, set the entityExpansionLimit JVM option as follows: -DentityExpansionLimit=500000 Your limit depends on how many "entities" are being expanded. In our case: [prompt]$ grep '&' TFBSConsSites.xml | wc -l 5387285 We had to increase the limit to about 6M because of the ampersands in xml schema declarations with strings like "&xsd;integer", resulting in: java -Xmx2600M -DentityExpansionLimit=6000000 -jar SesameLoadnQuery.jar > messagesDate.txt
The data is available for ftp in zip format from ENCODE ChIP-on-chip Data for H3K4me3 and TFBS conserved sites Data.
The corresponding datatype information (i.e. Syntax Level) for the data is in the form of SQL statements (mySQL dump) at theirDataModel for ENCODE Data and theirDataModel for TFBS Data.
This work was first described in a poster that can serve as an overview of several of the relevant concepts.
See http://mad-db.science.uva.nl:10080/MADfiles/ECCBPoster.pdf
See also: http://www.eccb05.org/PostersDetail.php?postersPage=12&poster_id=132