Tuesday, March 24, 2009

A Publisher's Introduction to the Semantic Web

Presentations
White Papers
Newsletters
A Publisher's Introduction to the Semantic Web
by Ed Stevenson
Spamoflage( 'ES', 'Email' );
May 2005
Although the concept of the Semantic Web began to surface in the early part of this century (see this early Scientific American article from Tim Berners-Lee, James Hendler, and Ora Lassila), it is now reaching buzzword status. Some publishers have been using technologies considered to be "semantic" for a few years, but it is a new topic for many others.
What do the concepts behind the Semantic Web mean for publishers? Well, first let's ask—what is the Semantic Web? The point of the Semantic Web is a richer interconnectiveness among all objects (or content), thereby allowing us to pull data from various sources to discover new meaning and present in different formats. A simpler view is that the Semantic Web makes better use of metadata. That is, all objects on the web are assigned rich data to describe themselves (in a universal and standardized format), and tools are better able to make use of that data.
Creating richer metadata
Almost all publishers use metadata in some capacity. Most also use taxonomies (a hierarchy of terms used to categorize content), although they might not call them by that name. The next step beyond that is the use of ontologies. Just as taxonomies make metadata or controlled vocabularies look "flat," ontologies do the same to taxonomies. Ontologies describe more detailed relationships among concepts and provide a higher level of richness in the metadata.
Taxonomies are just like the animal and plant kingdom taxonomies, in which every species lives in a particular branch. However other, more conceptual objects don't always have that clear lineage. If we created a taxonomy based on colors with the three primary colors—red, yellow, and blue—as the top nodes, orange would need to be related to both red and yellow. In a simple taxonomy, we would probably repeat the term "orange" under both, but in a technical sense they would really be two distinct nodes that have the same name.
In an ontology, orange can be represented as the exact same concept appearing in multiple nodes on the tree. In fact, an ontology is not a tree at all. It is a complex mapping of concepts with defined relationships between those concepts (such as "part of" or "subclass of").
In their most expanded use, ontologies can in themselves be valuable collections of information and almost become database-like in nature. Imagine an ontology that captures court "metadata" for a legal publisher. That publisher may currently have a taxonomy with branches for federal courts, district courts, state courts, etc. But in this "flat" taxonomy, there is probably no implicit relationship between the local and district courts or state courts or to geographical boundaries like state or congressional district lines. In an ontology those relationships can be established. Of course, documents are still tagged to nodes in the ontology, but even without the documents, the ontology becomes a very valuable piece of content.
RDF and OWL: expressing richer metadata with W3C standards
The W3C standard framework for expressing metadata (including taxonomies and ontologies) is RDF (Resource Description Framework).
RDF provides a standard framework for expressing information about resources (metadata) that allows for complex definition of relationships, polyheirarchal taxonomies (giving a node multiple parents), and the ability to combine taxonomies (by connecting a detailed taxonomy to a broader taxonomy through a common node). The purpose of RDF is to create a syntax to capture rich metadata and relationships and allow the processing of this data by applications.
The RDF data model expresses relationships among resources in what is called "triples." These triples define two things and the relationship between them. Each triple consists of a subject, a predicate and an object (sometimes called the resource, property, and value). The subject (or resource) is the "thing" the statement is about, the predicate (or property) specifies a characteristic or property of the subject, and the object (or value) is the value of that characteristic or property.
The following illustration is an RDF graph representing a triple that illustrates the simple metadata value of the author of this newsletter article:
Where RDF gets interesting is when you start to combine triples, such as making the author the subject of another triple describing his email address or company affiliation.
The PRISM metadata standard can often be expressed in RDF, many RSS feeds use RDF syntax, and Adobe's XMP (eXtensible Metadata Platform) for embedding metadata within media objects makes use of RDF.
But being a structured framework, RDF is more syntax (structure) than semantics (meaning). OWL (Web Ontology Language) is the W3C effort to provide a standard for the types of relationships that can be expressed in RDF. OWL provides for an XML vocabulary to express hierarchies and relationships. OWL introduces specific property vocabularies, such as "sameAs" and "intertsectionOf." OWL provides a shared meaning in the RDF syntax.
Topic Maps: expressing richer metadata with ISO standards
In semantic circles, there is often discussion about RDF vs. topic maps. In most conceptual ways topic maps are very similar to RDF with some slight and subtle distinctions. Both have different origins. Whereas RDF came through the W3C, topic maps are an ISO standard and arose to address the need to create indexes (like back of the book indexes). Topic maps prime focus is on the topics (or subjects); RDF focuses on the resources. Although both were created for somewhat different purposes, both do very similar things.
Topic maps describe topic structures and associate them with resources. Like RDF, topic maps break from the traditional hierarchal taxonomy and offer much more robust classification, indexing, and relationship descriptions. Topic maps allow for the creation of complex topical descriptions which then point out to resources. There is a separation between the topical information (the index) and the content which is associated to specific topics within it.
The topic maps "language" uses topics, occurrences, and associations in its model where the topic is the resource (the thing or the subject), the occurrence is the resource that has some association with the topic, and the association is a type of relationship. You can see from a very high level the similarities with the RDF model. Note in topic maps the association is two way, that is if my topic is this article, the association is "is authored by" and the occurrence is "Ed Stevenson," the inverse is also true - that is Ed Stevenson (topic) authored (association) this article (occurrence).
It is beyond this introductory article to fully explore the differences between the two and much work has been done in that area. Additionally, the W3C has started a RDF/Topic Maps Interoperability Task Force to look for interoperability between the two. See http://www.w3.org/2001/sw/BestPractices/RDFTM/ for more information.
First steps for publishers
So if you never knew about the Semantic Web but now have the overview, what should you do next? It can be difficult to take the intellectual concepts behind the Semantic Web and apply them to practical day-to-day use in a publishing process. But it is important to be aware of the issues and the potential they have. The following are a few suggestions on preparing your publishing organization for the Semantic Web.
Consider using RDF if you are implementing or re-engineering or enhancing any process capturing metadata. Even if the full power of RDF is not harnessed initially (which it most likely will not be), starting down the path is the first step. There will be more tools and reusable code for translating RDF.
If you are defining or building metadata (or taxonomies or ontologies) look to industry standards first. Consider Dublin Core and PRISM for basic metadata and look to industry specific standards as well (like IEEE and NISO). Borrow where you can.
Value librarians on your staff. They are important now, but are generally undervalued. They will be critical in expanding taxonomies into ontologies and managing the complex relationships.
See the links mentioned in the interview with David Wood in this issue to get some hands-on experience with Semantic Web tools.
Keep an eye on Semantic Web topics through news groups, publications, web sites, etc. Some Semantic Web experts predict the health sciences industry to be some of first adopters of semantic technologies. See the next section for good resources to learn more of the basics as well as stay on top of the latest news and trends.
Monitor the actions of publishers who are using Semantic Web technologies and how they make them worthwhile.
Further reading
In addition to the links found within the document, the following offer good information on the Semantic Web (and were consulted for this article):
W3C RDF Primer
TAO of Topic Maps
O'Reilly's XML.com offers plenty of informative articles all grouped together under a link from the main left navigation: http://www.xml.com/semweb/

No comments: