TCMGeneDIT database
The RDF-TCM dataset transformed the relational TCMGeneDIT [25] as RDF. TCMGeneDIT not only provides information in English but also collects the associations among herbs, genes, diseases, CM effects and CM ingredients from public databases and literature. Existing knowledge is reused and some association information is collected through text mining techniques, such as:
-
Herb names, such as Ginkgo biloba, were collected from the HULU TCM professional web site [26] and TCM-ID [27], a database on CM herbs and herbal ingredients;
-
Ingredient data were collected from the above two resources as well as the Chinese medicine resource web [28];
-
Human genes and their information were retrieved from NCBI Entrez [29];
-
Disease names were extracted from the heading and entry term fields in the disease (C) section of the medical subject headings vocabulary (MeSH) [30];
-
The relationship between genes and diseases were collected from PharmGKB [20];
-
Many other association information between herbs and genes, diseases and effects were mined and extracted from a corpus of MEDLINE abstracts collected through PubMed.
Create RDF-TCM
The TCMGeneDIT database is available as a database dump under the Creative Commons Attribution License [31]. To publish TCMGeneDIT as Linked Data, we followed our Linked Data Publication Methodology proposed previously [32], including the following steps:
-
1.
Choose a transformation strategy, either through RDF caching or virtualization;
-
2.
Design an URI scheme according to the Linked Data principles and the Cool URIs style [33], providing simple and stable URIs;
-
3.
Construct schemas or ontologies based on the source data schemas, imposing as little interpretations as possible and reusing existing ontologies where possible;
-
4.
Construct transformation scripts and mapping files, starting with transforming a small portion of the records and a test framework, which is not only useful for validating the sanity of the RDF dataset but also for revalidation when the transformation process is repeated;
-
5.
Create mappings to other data sources where immediate values are foreseen, either using customized scripts or existing software tools such as Silk [34];
-
6.
Finally, and preferably, provide metadata descriptions about the dataset, including its provenance information, and make all the scripts, configuration files, and ontologies accessible.
A skeleton of the methodology was proposed [32] and the following sections will provide details. Steps 2-5 should be applied iteratively and some design decisions must be made in accordance with fundamental principles.
Choose a transformation strategy
Linked datasets can be published either by creating RDF caching or through a virtualized access to the source data. RDF caching means that developers convert a snapshot of the source database into RDF and then load these cached data into an RDF store and publish it as Linked Data. The virtualization approach rewrites an HTTP-dereference request to a data URI into a query expressed in a language native to the source database (e.g. SQL) for evaluation against the data in their native form without transformation into RDF. The virtualization approach is more desirable if the source data have a high churn rate, but the performance of the current tools supporting this virtualization (such as Triplify [11]) is difficult to cope with large relational databases and complex rewriting rules. If the update rate of the source data is sufficiently low, the caching approach is more feasible. Because TCMGeneDIT is no longer updated, we chose the RDF caching approach to build RDF-TCM.
Design the URIs
URIs are required in Linked Data in order to identify entities (instances), types of entities (classes) and types of their relationships (properties). The 'Linked Data Principles' outlined by Berners-Lee [35] clarify the role of URIs in Linked Data and the set of best practices for publishing them:
"1. Use URIs as names for things; 2. Use HTTP URIs so that people can look up these names; 3. When someone looks up a URI, provide useful information using the standards (e.g. RDF, SPARQL); 4. Include links to other URIs, so that they can discover more things."
In addition we recommend that new URIs should only be coined if no existing URIs can be found and that they should be persistent. Reusing existing URIs improves the connectivity of a dataset with others and help establish shared names within the community. Consortia such as SharedNames [36] and Concept Web Alliance [37] are the active ongoing efforts in creating unique, shared names for biological entities. A data publisher should have control over the namespace under which new URIs are created, not only allowing useful information about these resources to be provided but also improving the stability of these URIs. Creating links to URIs published by others is highly recommended for bridging the gap between a local namespace and the Linked Data cloud.
The URIs used for RDF-TCM followed the pattern of:
http://purl.org/net/tcm/tcm.lifescience.ntu.edu.tw/id/{type}/{id}
where {type} corresponds to the type of an entity (such as Gene) and {id} is an identifier derived from the source data, e.g. the gene name or the herb name, or from a sequential number assigned by the transformation program. We used PURL [38] URIs to control the persistency of these URIs and we used the namespace of the TCMGeneDIT website as part of the URI to preserve some information about the owner and origin of the dataset. For example, the URI
http://purl.org/net/tcm/tcm.lifescience.ntu.edu.tw/id/medicine/Ginkgo_biloba
identifies the herb Ginkgo biloba.
And the URI
http://purl.org/net/tcm/tcm.lifescience.ntu.edu.tw/id/statistics/9199
denotes a statistics entity that describes confidence in the association relationship between some entities.
Design ontologies
Ontologies can be used as a controlled vocabulary to define the type of entities in a dataset and the type of relationships between them and to achieve a consistent interpretation about different datasets. A rich body of biological ontologies has been created and accumulated over the years [39]. When designing ontologies for describing linked datasets, we should reuse existing ontologies as much as possible. When a new ontology must be created, a conservative and incremental approach is recommended. Many of the linked datasets are published by a third party, rather than by the data provider. Documentation about these datasets is not always available. Imposing personal interpretations about the semantics of the data and its schema could introduce errors and should be avoided.
As the data structure of TCMGeneDIT is very simple and there was no known TCM ontology by the time of creating the dataset, we created a simple CM ontology using OWL http://purl.org/net/tcm-onto/. The ontology contains seven classes, namely Gene, Medicine, Disease, Ingredient, Effect, Association and Statistics. Each entity of type Statistics describes statistics confidence in the associations between entities. Each entity of type Association represents an association between a Medicine, a Gene and a Disease. There are six object properties in total: five of them for relating a Medicine to a Gene, a Disease, its Ingredient, or its Effect and the last one, tcm:source, for pointing to the entities whose association relationship is described by a Statistics entity. There are five data properties whose domain is Statistics and whose value represents the statistics confidence in the association. For example, the value of tcm:medicine_effect_association_tvalue represents our confidence in the association between a Medicine and its Effect. A diagram capturing the structure of the ontology is shown in Figure 1. Note that the data properties associated with the Statistics class are not shown in the figure.
A Statistics entity was used to describe the statistical value of an association. Some associations relating to more than two entities such as the association relationship of medicine-gene-diseases cannot be expressed as RDF triples. To capture this n-ary relationship, we created Statistics entities to link together every entity involved in an association (see the example below) and to express the statistical value of the association using the data properties, e.g., tcm:medicine_effect_association_tvalue. The different types of data properties were created for different types of associations.
a tcm:Statisticshttp://purl.org/net/tcm/tcm.lifescience.ntu.edu.tw/id/statistics/19087;
tcm:source
http://purl.org/net/tcm/tcm.lifescience.ntu.edu.tw/id/medicine/Acanthopanax_gracilistylus
tcm:sourcehttp://purl.org/net/tcm/tcm.lifescience.ntu.edu.tw/id/disease/Retinoblastoma;
tcm:sourcehttp://purl.org/net/tcm/tcm.lifescience.ntu.edu.tw/id/gene/CDK2;
tcm:medicine_gene_disease_association_tvalue "1.414"^^xsd:float.
Data transformation
Data transformation should be incremental and test-driven. When transforming a new dataset into RDF or writing the configuration files for virtualization, developers should start with a small subset and avoid transforming the complete dataset. Loading a large number of RDF triples into an RDF store or retrieving very complex RDF descriptions for data entities by query rewriting can be a very time-consuming task and block the execution of following-on tests. A test framework should be designed forefront to spot any problems with the testing data and to ensure the sanity of the datasets, such as no blank nodes, no URIs containing invalid characters (e.g. space), no wrong property cardinalities, or no missing property values. These principles were applied when the relational TCMGeneDIT database was transformed into RDF.
Data linking
Links between datasets can be expressed with RDF. These links either reflect a type of relationship between entities or state a reconciliation between URIs published by various authorities. An example of the relationship type of links is to associate drugs from dataset D 1 with genes from dataset D 2 through a property such as ex:targets. Properties such as owl:same As or rdfs:seeAlso can be used for stating identity reconciliation. These RDF links allow users and Linked Data applications to start from one dataset and then follow on these RDF data links to move through a potentially endless web of data.
These data links can be created either during or after the creation of a linked dataset. Commonly, relating to another dataset (e.g., ex:targets) may be achieved as part of the transformation script, while mapping two URIs from different datasets may take place after a dataset is published and be executed either by their publishers or third parties.
The links may be created manually or automatically with open-source tools such as Silk [34]. However, identity reconciliation between biological entities is known to be difficult; string mapping is not always sufficient or reliable [40]. Developers should look for existing authoritative name mappings curated by data providers. Identifying the reference databases used by the source databases could help improve the precision of the mapping. For example, by understanding that the gene names used by TCMGeneDIT are from NCBI Entrez Gene for human, we can reduce the ambiguity of the mapping to the Entrez Gene dataset previously published by Neurocommons or Bio2RDF.
Extra attention should be given to any many-to-many mappings between URIs in the results. A manual cleaning of these mappings is highly recommended, requiring either the participation of domain experts or some contextual knowledge that are difficult to be expressed in computer programs.
The gene entities in the RDF-TCM dataset were linked with those from the NCBI Entrez Gene linked dataset [41] published by Neurocommons and those from the STITCH linked dataset [42] published by the Freie Universitä t Berlin. Gene mapping was constructed with customized Python scripts based on the label of the genes. The mapping to Entrez Gene showed that 849 out of the total 945 RDF-TCM genes had a one-to-one mapping to an Entrez gene and that 95 of them had a many-to-many mapping to an Entrez gene and one of them was not mapped. The mapping to STITCH genes showed that 539 out of 943 mapped genes had a one-to-one mapping to a STITCH gene; and that 404 of them had a many-to-many mapping and two of them were not mapped. These many-to-many mappings were manually corrected so that only one-to-one mappings were in the results. We selected some sample data to manually confirm the correctness of the automatically generated one-to-one mappings. However, these automatic gene mappings were not thoroughly evaluated and this is an limitation of the work.
To link RDF-TCM with various other linked dataset from LODD, we used Silk, as part of the LODD project [23]. The mapping results by Silk have not been formally evaluated, but the correctness and completeness of Silk's approach were evaluated with other test datasets [34].
Data documentation
To improve the visibility of a dataset to Linked Data search engines such as Sindice, we recommend data publishers to describe their datasets using vocabularies such as the Vocabulary of Interlinked Datasets (voiD) [43] or the Provenance Vocabulary [44]. voiD is an RDF vocabulary for describing linked datasets on the Web in order to facilitate the discovery of these datasets and query federation applications. The Provenance Vocabulary is the first vocabulary to describe both the data creation and data access process related to a dataset on the Web.
A voiD file was published for RDF-TCM http://www.open-biomed.org.uk/void/rdf-tcm.ttl and the provenance of each RDF-TCM entity was described with the Provenance Vocabulary, published with Pubby [45], a Linked Data publication tool extended with a provenance component. We published all our Python scripts for transforming the database dump into RDF and for linking RDF-TCM to other datasets. All the scripts can be found at http://code.google.com/p/junsbriefcase/source/browse/#svn/trunk/biordf2009_query_federation_case/tcm-data.