hobbs's blog

Approaches to Exposing Institutional Data and Other Content

I've been thinking about and researching how an institution can share its data, documents, and other content.  Obviously your data and content is already exposed via the web, but providing the data in a more structured way allows more users (both internal and external) to manipulate the data in interesting ways, for example in mashups.  There seem to be a few ways to share data from an enterprise with a lot of content:

  • Straight RSS/Atom.  Although straight RSS/Atom (with no custom extensions / namespaces) may not be that interesting, it's obviously a useful way to get your content out there.  Typically straight RSS/Atom is fairly time-based and might in effect show some history (news items like "John goes to work" and then "John goes home") rather than some state (like "John is now home"). 
  • Common repositories / services such as Swivel and StrikeIron.   Rather than exposing your data/content directly to the outside world from your site/servers, you can use an intermediary.  Swivel allows users to create their own graphs on fata from either official sources or any user-supplied data.  StrikeIron is built into mashup editors like QEDwiki, and also has built an extension to Excel to call their services.  You probably would want to provide data to these services through an API of your own, but you could get started with Swivel for example by directly uploading the data. 
  • Specialized XML formats for particular types of content.  Examples include OpenSearch for search results and SDMX for statistical data.  These specific XML formats both allow a level of sophistication for people specializing in your type of content and allows tools built for this type of data to consume it.  This fits in with the following item, which, for historical reasons may or may not be XML-based. 
  • Institution-to-institution services.  Sometimes you need to provide a point-to-point interface with another institution.  In that case, you may need to support all sorts of unusual formats and delivery mechanisms.  Hopefully you could leverage your various systems' web services to just transform the data into the formats you need. 
  • A common API that your institution follows across all types of content.   This one is the most interesting to me and one that I alluded to in my previous post on interaction publishing.  Especially if your institution has various repositories, one possible approach would be to slap up a page that has links to the different instructions for referencing each.  But, to make access as easy as possible, a common API with consistent parameters that can be queried against all systems would be preferable (for instance, queries such as "give me all your documents and data on Chad" via url requests like http://xml.example-domain.com/apis/type=docs,data&country=td).  Potentially the returned XML could be in a simple format such as RSS extended with a custom namespace (so that other tools such as Yahoo Pipes, and even feedreaders, could easily consume the data).
  • Microformats.  Probably most useful to future browsers or other tools like the Firefox Operator extension (or for services that crawl sites such as Google), microformats allow you to just change your existing HTML a bit to expose very common types of data like address and calendar events.  For example, instead of your HTML having "100 Main Street, Anytown, USA" it would be marked up as "<div class="adr"> <div class="street-address">100 Main Street</div>, <span class="locality">Anytown</span>, <div class="country-name">USA</div></div>" and then define the CSS to show it as you wish.  For example (with sloppy CSS):
    100 Main Street

    Anytown,

    U.S.A.

    See how this page appears in Firefox Operator (also notice the tagspaces):

 Screenshot of how example adr microformat works in Firefox Operator

 

Taxonomy Mappings: Be Careful When Integrating

Sometimes you need to pull content from multiple systems into a single page, and you want to pull from both systems based on some metadata, perhaps by topic.  For instance, let's say you have a site that you want to pull data from a document repository and a news archive, and you want the the user to use a pulldown to select the topic they want to filter the content by (for example, by "Politics", "Entertainment", "Travel", "Europe", and other topics).  Sometimes out of the box the two systems will share the same list of topics, but more frequently than not they will not.

One deceptively simple approach when systems do not share the same list of topics is to have some sort of mapping between the taxonomies of the two systems (for instance, "Travel" = "Vacations", "Politics" = "Domestic Politics", "Europe" = "EU", etc).  I fairly frequently hear something like this when discussing integration between different systems: "We have a mapping between these topics, so there shouldn't be any problem."  But just because you have a mapping doesn't mean that it will be satisfactory for combining information from multiple systems.  I thought it would be helpful to think through the issues some and write out some examples.

One taxonomy's controlled vocabulary being more specific than another

Let's say you've got some content in two systems that you want to pull into one page.  Perhaps you want to find out all the fathers in both systems.  If the taxonomies available were the following (and you didn't have other metadata on gender, for example), then you could not do this:

"Relationship" values site one "Relationship" values on site two
Father Parent
Mother  
Sister Sibling
Brother  

A simple and meaningful mapping between the two would be something like this (allowing you to find all the people across systems that are a parent, for example):

Father or Mother - > Parent
Sister or Brother - > Sibling

Note that the other direction makes no sense (it's tough to be both a sister and brother, and you wouldn't know which to pick when translating between systems).  So, although you may have a mapping between the systems, it does NOT neccessarily enable the types of queries you want to do.

A slightly more realistic example

Of course that was a simplified example to illustrate the point, and you usually have overlapping, something like the following (still a forced example though):

"Location" in system one "Location" in system two
SF Bay Area  
Palo Alto Silicon Valley
(other cities in Silicon Valley)  
Richmond East Bay
(other cities in East Bay)  
Sausalito Marin County
(other cities in Marin County)  
San Francisco San Francisco
South San Francisco  

Let's say these two systems had a selection of companies tagged to these controlled vocabularies.  What kinds of queries would probably be meaningful?

  • All companies in Silicon Valley
  •  All companies in East Bay
  • All companies in Marin County
  • All companies in the San Francisco Bay area

Obviously, you couldn't query on cities since system two has virtually no cities.  But what about San Francisco?  Isn't that on both lists?  Although at first blush it may seem that you could find all companies in San Francisco across both systems, looking at the list more carefully it becomes apparent that they almost certainly have different meanings: the first taxonomy only has the broad San Francisco Bay Area and then cities, and the second taxonomy is just listing areas within the San Francisco Bay area.  So San Fancisco in system 2 probably includes San Francisco proper as well as, for example, South San Francisco.  So you can do this query (but *not* query on all companies in San Francisco):

  • All companies in San Francisco and the immediate area (including South San Francisco)

Part of the issue is that often you have much larger taxonomies that are more difficult to analyze (for example, for a taxonomy that includes all cities in California, or the US).  It would be very difficult to go through and determine the meaning of the different values of the taxonomy.

What to do?

In practice, you probably won't be able to deeply analyze all the mappings between your systems, so you'll have a mapping but might only have a feel for how good it is (and in what direction).  Perhaps the most dangerous mapping, and one that is hopefully fairly easy to identify, is from a more general taxonomy to a more specific one (the first example above) and should be avoided entirely.  Of course if the systems do not even having taxonomies that are close, then this will be obvious and require changes to at least one of the source systems.   The second example above (overlapping but not quite lining up) might a type of taxonomy matching that's not be that bad but just require documentation/labeling (just use "San Francisco and South San Francisco" in a pulldown to select areas of the Bay Area) and careful design (obviously don't allow the user to select a city if you need data from both systems, or you could clearly show in the results that the information is just from System 1).   But figuring out the relationships between the taxonomies might take a lot of work.  Some potential general approaches: be careful and a) where possible try to access globally well-understood and clear values (like zip codes, lat/long, ISO country codes, etc) rather than fall into the trap of just trying to use two taxonomies since they're called the same thing (this is probably easier in something like a location than a topic), b) force all systems to tag to a neutral reference source (this could, with a lot of work in defining rules, be automated with something like Teragram for example), or c) seek out a metadata expert since they have some best practices of mapping between flat or networked taxonomies.  Also, when you are designing a system in the first place (even before being faced with a new integration), try if possible to use metadata for your content to well-understood values (especially easy for geographic tagging).

Interesting 2007 and Opportunities for 2008

The following seemed noteworthy and interesting enough in 2007 to highlight (although most were not new in 2007):

I also see a lot of opportunities for improvement in 2008:

  • Continued performance improvements. Ajax and more widespread javascript (such as pulldown/popup menus rather than having whole new pages load for many menus) has helped many sites speed up, but let's face it, in general using web sites is fairly slow. One example site that's particularly slow in the UI is Rhapsody, which I use every day but the performance has just got to improve (an aside: basic cellphone call quality still isn't at acceptable performance levels in my opinion, but they keep improving year by year). Hopefully Adobe AIR (previously Apollo) or some other runtime environment like it will help deploy applications that interact directly with local files and with less server trips for a dramatically faster experience for our usrs. Also see Yahoo's Exceptional Performance resources for ways to speed up existing pages.
  • More sophisticated offshoring models. The naive view of offshoring goes something like this: if someone costs $X per hour in your country and $X/3 per hour in another country, then it would seem to obvious to give the work to the offshore resource. Sometimes this works. Highly repeatable tasks are the most obvious (for example call centers). Also, it often works when you can hand off a specifications document and then wait for the implementation, although this Wall Street Journal article (subscription required) on the outsourcing problems of the 787 points out interesting issues there too: like your outsources suppliers outsourcing to their own suppliers, quality control/process issues, and taking for granted expertise/background built inside Boeing when handing off to suppliers. If the task isn't highly repeatable or very tightly specified, then the overhead of communications/management is very high. I would also expect that places that are currently considered "offshore" will be developing innovative products themselves (see this blog post: State of Innovation in India).
  • Improvements in single sign on and passwords. If I go to Amazon and then B&H now, I have to log on twice. Worse, if I go to very small sites I have to create a separate username/password (it's one thing to trust Amazon with my password, but why should I trust a very small site with that information?). I plan on adding OpenID for accounts on to this site, and I would encourage others to add it to theirs (many platforms such as Drupal now support this). OpenID allows the user to decide who they trust to keep/authorize their account information (notably password) and you chose what information to give to different sites. Once you log in once, you don't need to provide your password again when you go to a site using OpenID. Hopefully at least smaller sites will start adopting OpenID, but it would be great if this was adopted by larger players as well. I'm still hoping for a replacement of passwords entirely, perhaps by graphical methods, (how archaic is remembering a bunch of passwords, or, worse, if you force users to use "strong" passwords and change them a lot, then they'll just write them down?), but at least reducing the number of accounts you have would help.
  • Mashup building for the masses. Although APIs and mashups have taken a big stride forward, I hope to see some standardization in APIs and enhanced mashup editors that allow less technical people to create their own interesting (not only with maps!) mashups. See my Enabling the Interaction Publisher post.

 

Link Repository: Structured Link Checking

Especially as a content management system grows to have a large amount of content, it would be nice if you could do structured link checking. One of the problems with link checking in general is what to do with the reports once you get them. Of course, for a very small site you can easily scan an entire site with tools like LinkScan ($) and Xenu Linksleuth (free, but ads are put in the reports) or even monitor 404 requests and use single page tools like the LinkChecker Firefox extension. But with large sites you can end up with reports that are hard to know where to even start fixing links. This is especially true for CMS-driven sites: the same bad link may appear in only one piece of content that is displayed throughout the site. Or you could wind up linking from lots of content items to a url (possibly outside your control) that changes. I envision getting a report with a list of the bad links, where a user (with appropriate global rights) could indicate the correct new link which would get reflected in all content items (or left menus or other components surrounding the content) that used that link. This list could be prioritized by the cumulative page views that contained that bad link, or by the number of pages that contained that link. Another approach might be to provide a prioritized list of content items that have bad links (preferably directly linkable to edit mode of that content item. At any rate, note that we're not talking about pages here but content items or links -- the user can quickly take action that will correct links on multiple pages. A long list of pages (specific urls) with bad links are confusing, but, more importantly, aren't as quickly actionable. Here is how normal link checking reports look and how more useful reports might look:

Before / Existing Reports (where do you start with a report like this, where content items may drive multiple pages?) Report indicating bad links where the user can immediately correct them (and apply the correction everywhere) Report indicating which content items have the bad links(content items linkable to edit them directly)
  • http://badlinkone.com is referenced on http://example-site.com/page1, http://example-site.com/page35, and http://example-site.com/page102
  • http://badlinktwo.com is referenced on http://example-site.com/page1, http://example-site.com/page1023, http://example-site.com/page2439, http://example-site.com/page5192

Etc.

Etc.

Etc.

One possible way to implement this is to change all the urls into some logical link in your CMS. Assuming your CMS stores straight HTML rather than a more structured format, then any url the user enters could be changed to a macro (if the user could put in a hard link directly into the HTML without the system changing it, even if there was an option for creating a logical link, most users would probably just skip the logic linking). For example if the user put in this HTML:

&lt;a href=&quot;http://hobbsontech.com&gt;Hobbs On Tech&lt;/a&gt; then the system would replace it with !link(123,&quot; _fcksavedurl=&quot;http://hobbsontech.com&gt;Hobbs On Tech&lt;/a&gt; then the system would replace it with !link(123,&quot;Hobbs On Tech&quot;)and put in its link repository that link 123 was http://hobbsontech.com. When the page was generated then the correct link could be replaced in the HTML (so of course the end user's browser should never see the "123" in the HTML). If the page linked to was in your CMS, then the macro could be different and just indicate the unique key for the content item being pointed to (this would depend on whether the context that the content appeared in was relevant). For example: !cms_item(123,&quot;Hobbs On Tech&quot;) Related items that a link repository might help with:

  • Reporting on content use. A link repository would allow other interesting reporting, such as the most linked-to content items in your repository.
  • Easily move content. In some cases, it may be easier to move content if you had a link repository. For instance, you may sometimes need to restructure your site resulting in the links changing. With a link repository, you could automatically change all the links so that the move did not result in broken links (of course this would work best for intranet sites where there were limited links outside your control to your content).

Of course, this would add complexity (and possible failure points) to a CMS. Do you think it would be worth it?

Enabling the Interaction Publisher

New sites with dynamic, interactive functionality using data from different sources and allowing the user to interact with the data are exciting to see (examples: geo.worldbank.org and carma.org). But how do we unleash this functionality so that non-programmers can create interaction like this? We have content management systems that allow more people to easily add content to sites. But I think we should be driving toward an environment where users can a) take data from a variety of sources and b) create interactive sites based on this data. Maps are the most prominent example, but interactive tables are also important. Let's review where we are now:

  • We have sites already applying Google maps and other interactive functionality to various data sources (examples above).
  • Programmers have resources/examples/documentation for creating these types of sites (see Programmable Web for example).
  • Various APIs have been exposed for interacting and using data (examples).
  • We have tools like Yahoo Pipes that allow advanced users (probably not needing full-blown programmer skills) to create mashups. That said Yahoo Pipes is now focused on consuming/dealing with RSS feeds (the Fetch Data Module is supposed to more general XML, I had problems getting it to do so -- if you look at examples using DC crime data, you see it's RSS with some customization). In addition, this is a hosted solution, so you're at the mercy of Yahoo if you host a mashup with them (I noted Yahoo Pipes having problems accessing feeds intermittently even in my brief testing).
  • There are probably other similar examples of specialized tools, but I know of Swivel, which allows you to create your own graphs of data.

Here are the types of interactive functionality that I think we should be allowing non-programmers (let's call these folks "Interaction Publisher", riffing off the role of "Content Publisher") to create:

  • Interactive data tables. Interaction Publisher should be able to point at one (or multiple) data source, and indicate which columns/attributes to display in a table. The Interaction Publisher should also indicate which attributes should be selectable (in pulldowns for example) be the end user. Of course some theming / design and annotation should be possible.
  • Interactive maps. Interaction Publisher should be able to point at a data source, the attributes containing the locations, and what data to show for each location (along with the extent of the default map and formatting). Also, please can we get rid of the points / waypoints / circles that indicate arbitrary points that are used to indicate data for a large area (for example, a pointer to the capital for a country), and instead highlight the whole area (for example, the whole country). Ideally the Interaction Publisher will be able to indicate further interaction with the map (for example, displaying different layers of a map -- if not full-blown layers, then at least indicating different sets of waypoints to display).
  • Custom data. The Interaction Publisher should also be able to easily publish their own data/content, and pull their data into an interactive feature (for instance, this could even be a simple search on a little database / resource center the user has). An extension of this would be including some mechanism for overriding other data sources data points (of course this should somehow be indicated on the map/table so it isn't misleading).
  • Wizard-like functionality. The Interaction Publisher should not have to resort to XPATH, XSL, or programming in PHP / Perl / whatever.

Sounds nice -- but how would this be possible? One possible step is for institutions to expose their data in a consistent manner (at least each institution exposing its own data in consistently). This would involve something of a meta-API, where you are consistent about:

  • Attributes that can be queried. Perhaps the list would be just topics and countries, for example. The topics lists should be something that the outside world will understand rather than an organization-centric list. If you have multiple topics lists, then it would be preferable if all systems were moved to a single topics list (even if that meant two topics lists per system).
  • Simplicity and consistency in APIs. Perhaps all your XML APIs are at http://xml.example-domain.com/apis/ (with an html page just listing all the APIs there) and then APIs to different systems like http://xml.example-domain.com/api/documents and http://xml.example-domain.com/api/web with example calls like http://xml.example-domain.com/api/web/api-version=1&topic=agriculture.
  • Consistent exposure of non-standard attributes. The issue of consistent query parameters was covered above -- this means that all systems are queried on the same parameters. But of course some systems will need to provide other attributes (such as, say, "Population"). This could be done in a custom namespace in RSS as the DC crime data (see xml) does in its Atom feed (which Yahoo Pipes, for example, can consume). This could be documented, and the consumer of the data could handle this.
  • Custom databases would also preferably comply. Perhaps there could be an http://xml.example-domain.com/api/core/ for institutionally, centrally supported repositories and http://xml.example-domain.com/api/special/ for one-off databases. This would still allow easy access of data by Interaction Publishers.

Some potential ways of inching toward the goal of the non-developer Interaction Designer easily being able to publish dynamic, interactive features would be:

  • Start by using javascript libraries. There are several javascript libraries out there (examples: Dojo, mootools, Prototype / Scriptalicious), but most seem to be too low-level (concentrating on opening/closing panels, transitions, and the like) to be useful for interactive data features. Possibly a library that has higher level features including interactive table such as EXT JS could be used as a first step. It would require touching some code, but perhaps a CMS, for example, could include in its documentation with code snippets indicating what needs to be replaced (for example, where to put in the url to the source XML).
  • Create some simple wizards in CMSes. So that we aren't relying on, for example, Yahoo Pipes for hosting our interaction, we may wish to start including simple wizards in our CMSes. For example, one could be for interactive tables that just had one data source and three columns.
  • Push for stronger hosted interactive feature builders. For example, Yahoo Pipes perhaps could include some of the features mentioned in this email (for example, a tool for creating interactive maps, or a tool for creating a pulldown of options to drive a Google map.

Here's a little chart displaying some of the ideas in this post (also see pdf version):

I'd really like your comments on this post. Specifically:

  • Is the role of Interaction Publisher important?
  • How could we enable this role?
  • What ideas above do you think would work and which would not work?
  • Is their a need for a separate generic standard XML from RSS feeds, or should an institution's RSS just be extended to include custom portions?