="Histone use case">

A semantic web approach to data integration for the Histone Use Case

Overview

The goal of our experiment is to integrate histone modification data with transcription factor binding site data using existing semantic web tools. The data we use is from the ENCODE project and provides presence/absence information of a certain histone modification for the ENCODE regions. We retrieved both types of data from the UCSC Genome Bioinformatics Site, where they are available from an ftp server in gzipped format (since we started, UCSC has made data directly available from a mySQL database). Using a modification of the Mapper program, we converted the tab-delimited tabular format of the data files into an RDF format that preserved the table structure, column names, and data types. We labelled RDF data with explicit XML Schema Definition data types such as "&xsd;integer" (See also "Gotchas" section below). The details of the translation into RDF by Mapper are determined by an XML file that discloses the mappings*. Once the data was converted into RDF, we linked names in the data into the namespace of our own OWL models (OWLDoc Overview of Ontologies) using RDFS statements. This allowed us to perform a model-based query, i.e. in terms of our own OWL models.

* Note: Our current version of Mapper has custom modifications in order to produce RDF output so some details are not immediately accessible. We plan to reengineer Mapper so that the RDF format is fully specified in the XML mapping file.

The conversion of TFBS data to RDF (clickable)

TFBS Dataplus ENCODE Data Schema right arrowTFBS in RDF format

The conversion of ENCODE data to RDF (clickable)

ENCODE Data+ TFBS Data Schema right arrowENCODED in RDF format

Figure 1: Transcription factor binding site (TFBS) data is converted from a tab-delimited format to RDF by using the column headers from the database schema. Similarly, ChIP-chip data for histone 3 Lysine 4 tri-methylation (H3K4Me3), is also converted to RDF. Both data sets are obtained from the UCSC site. The H3K4Me3 data is part of the ENCODE project.

Listing 1. SeRQL model-based query

SELECT *
FROM  {datafile1} theirDataModelTFBS:contains {rowX} myModel:Chromosome_identifier {chrom1};
                                    myModel:hasStartLocation {tStart1};
                                    myModel:hasEndLocation {tEnd1};
                                    myModelExperiment:hasMeasurementValue {score1},
      {datafile2} theirDataModelH3K4Me3:contains {rowY} myModel:Chromosome_identifier {chrom2};
                                    myModel:hasStartLocation {tStart2};
                                    myModel:hasEndLocation {tEnd2};
                                    myExperimentModel:hasMeasurementValue {score2}

WHERE chrom1 = chrom2 AND 
datafile1 = <http://staff.science.uva.nl/~lpost/Data/TFBSConsSites.txt> AND 
datafile2 = <http://staff.science.uva.nl/~lpost/Data/encodeSangerChipH3K4me3.txt> AND 
(tStart1 <= tEnd2 AND tEnd1 >= tStart2)

USING NAMESPACE
   myModel = <http://staff.science.uva.nl/~lpost/SemanticModels/EpigeneticsFoundation.owl#>,
   myExperimentModel = <http://staff.science.uva.nl/~lpost/SemanticModels/Experiment.owl#>,
   theirDataModelH3K4Me3 = <http://staff.science.uva.nl/~lpost/DataModels/TheirENCODEChIPchipDataModel.owl#>,
   theirDataModelTFBS = <http://staff.science.uva.nl/~lpost/DataModels/TheirTFBSdataModel.owl#>

Here is the SPARQL equivalent. We have also worked with Jena.

Gotchas

Our xsd data type tags apparently created a problem. The cause of the problem wasn't immediately apparent from the error message:

[ERROR  ] : error while adding new triples: org.openrdf.rio.ParseException: org.xml.sax.SAXParseException: Parser has reached the entity expansion limit "64,000" set by the Application.

To solve this problem, set the entityExpansionLimit JVM option as follows:
-DentityExpansionLimit=500000

Your limit depends on how many "entities" are being expanded. In our case:
[prompt]$ grep '&' TFBSConsSites.xml | wc -l
5387285

We had to increase the limit to about 6M because of the ampersands in xml schema declarations with strings like "&xsd;integer", resulting in:

java -Xmx2600M -DentityExpansionLimit=6000000 -jar SesameLoadnQuery.jar > messagesDate.txt

Original Data Sources (UCSC = University of California at Santa Cruz)

The data is available for ftp in zip format from ENCODE ChIP-on-chip Data for H3K4me3 and TFBS conserved sites Data.

The corresponding datatype information (i.e. Syntax Level) for the data is in the form of SQL statements (mySQL dump) at theirDataModel for ENCODE Data and theirDataModel for TFBS Data.

References

This work was first described in a poster that can serve as an overview of several of the relevant concepts.

See http://mad-db.science.uva.nl:10080/MADfiles/ECCBPoster.pdf

See also: http://www.eccb05.org/PostersDetail.php?postersPage=12&poster_id=132