The field of data/information integration currently consists of variety of technologies for similar purposes, to transform and move data from sources into targets. In the ETL (Extract-Transform-Load) context data is retrieved from and saved to transactional data stores such as databases or files (XML, COBOL). EAI (Enterprise Application Integration) data sources and targets are typically messaging queues used in order to interconnect applications inside an organization. Business Intelligence is just one of many sectors where this integration technology is used to improve business processes. EII (Enterprise Information Integration) lives almost in the same domain but focuses on providing a consolidated and consistent view on the (most likely distributed, heterogeneous) data landscape of a company. Online data streaming is an emerging field and focuses on real-time data analysis. It finds its use especially in online trading application.
The goal of Ohua is to unify ETL, EAI and online data streaming into one system. This is a very challenging task considering the various requirements that each of these technologies has: ETL requires exactly-once delivery semantics even in the event of system failures, EAI needs transactional semantics for the data processing and for online data streaming near real-time data delivery are essential. While fault tolerance is sufficient for ETL high availability is indispensible for online trading applications. In addition to that each of these systems needs to be highly extensible to adapt quickly, scale to very large volumes of data and process data very efficiently.
The core of the Ohua system is a data streaming engine particularly designed for MPP deployment scenarios. The user specifies (directed acyclic) graphs where data travels through the system among the arcs in FIFO order and each node (operator) represents a unit of computation to be performed on the data. Data is retrieved from and written to a variety of sources such as databases, files, messaging queues, web services etc.
Ohua fulfills the above stated requirements by defining a fault tolerance framework that takes checkpoints transparently online without stopping the processing and recover from them once a failure has occurred. Additionally Ohua provides exactly-once delivery semantics to operators with side-effects to (transactional) resources (databases, XML files, etc.) at low overhead costs even in the presence of system crashes. High extensibility is assured by opposing only minor effort on operator authors.