There is a lot of talk around the "convergence" of structured and unstructured information, but what exactly does this mean? There are various ways that structured and unstructured information could be combined, each providing different value. I will try to identify the different approaches in this posting.
While there have been earlier approaches to combining structured and unstructured information, search has been one of the biggest drivers. Interestingly enough, this is the one driver that is coming from the unstructured side. With the increasing usage of search as a paradigm for users starting their journey to try and access information, the need has begun surfacing to use the simple search box for getting to everything. And while this has been popularized by the Google's and Yahoo's of the world to get to the unstructured information out on the web, companies are now starting to enable searching across their enterprise systems to enable a single query to pull back structured information from application databases and unstructured information from their intranets, portal, file servers and content management systems.
However, the simplest and most traditional way has been to use applications as a common interface for accessing both structured and unstructured information. Many applications just need the ability to display structured and unstructured information side by side. For example, customer data from a CRM and/or order processing system, along with customer documents (contracts, invoices, correspondence, etc.) from content management systems and collaborative applications. Many companies accomplish this through "on-the-glass" integration that involves custom coding to repository specific APIs directly within the application. A newer, and more effective approach is to leverage information integration and content integration middleware products that can provide a common interface for accessing the different underlying systems where the data and content are stored. Most offerings still require you to use different APIs for structured vs. unstructured systems, but IBM WebSphere Information Integrator now provides a hook into the IBM content integration product (WebSphere Information Integrator Content Edition) so that you can at least query and retrieve both structured and unstructured information through a single interface.
You can take this one step further and create direct associations between related structured and unstructured information. This basically means embedding linkages between data and documents at the data storage level, as opposed to at the application level. The most common approach to doing this is to create links in the business object record (i.e. the customer record, the product record, the order record, etc.) pointing back to any related documents, which are often stored in other systems.
Another, more progressive and recent approach is to turn your data warehouse into an "information" warehouse by incorporating unstructured information. This doesn't necessarily mean copying all of your documents into the warehouse - that would be fairly costly and could put a drain on your system resources given the typical size of unstructured content. However, as you create that single view of the customer, or of a product, or anything else, you can create links to all of the documents and other types of unstructured information that may be relevant. Those links could point to multiple external systems and be replicated across different objects in your warehouse, but they can provide a common way to get to all information, structured and unstructured, through a single interface.
So, these three approaches have all focused on accessing related structured and unstructured information, but there is another aspect of this convergence that is equally important. And that is creating or applying "structure" to the unstructured. People want to query unstructured information with the ease and effectiveness of querying structured information. They want to report on the knowledge buried in unstructured information the way they report on standard data in a warehouse. And they want to analyze and generate insights from unstructured information the way they can through applying statistical algorithms and predictive analysis on historical and real-time structured data.
Thus, the final approaches involve multiple techniques to create a structured representation of unstructured information that enables all of these capabilities.
The simplest method is to create more structured content in the first place. This involves better tagging of websites and content, and possibly using XML instead of pure text or other document formats. Of course, this places quite a burden on the content creator, and will only get you so far. The next generation of Word will add some better capabilities here, but won't get you to the point of being able to extract any deep knowledge of the underlying content.
Another recent phenomenon, and an approach that may become viable for widely accessible content is social bookmarking, or tagging. Pioneered by del.icio.us, this entails users "tagging" content they review with additional metadata that can then be used by others to find that content and understand what it is about. This may sound a little convoluted, but it is becoming a fairly popular practice and has real benefits. After all, wouldn't you trust the opinion of a real person that has looked at the content more than some automated crawler and indexer? And yes, people are actually doing this! Just check out the success of del.icio.us (purchase by Yahoo for a handsome sum) and wikipedia (the largest resource of information gathered solely through public collaboration and contribution).
The final approach, and the most effective if done correctly, is to apply text analytics to extract concepts, entities, facts and other types of knowledge from unstructured content to create a more structured representation of the content. This is something I've blogged on before and is a major focus of the UIMA framework. Since the knowledge is extracted into a structured format, it can be sent to a database, a search index, or a business process, where it can be more easily queried, included in reports, and analyzed. And isn't that what we want to be able to do with all information? Find it, report on it, and understand it.
In a future blog, I will talk about actual use cases where some of the above capabilities can be of extreme value.