S. Yoakum-Stover, Ph.D.
Potomac Institute for Policy Studies
US Army CERDEC I2WD Information Exploitation Futures Lab
T. Malyuta, Ph.D.
New York City College of Technology
Computer Systems Technology Department
Abstract
We propose a new solution for data integration and semantic enrichment in support of Situation Management (SIMA). Our solution applies to any modality (e.g. text, images, audio, signals etc.) and embraces the diversity of data sources, types, and models, placing no restrictions on processes, applications, or users. It is database centric and proceeds in stages to address the unified storage of structured data and its semantic enrichment in a way that remains viable in an Ultra-Large Scale systems environment. The result is a layered data integration architecture that can accommodate any kind of data to coherently support the multiplicity of processing required for SIMA.
Challenge of Data Integration in Situation Management
Though generally scoped around a particular set of circumstances, or state of affairs, Situation Management (SIMA) is a mega-process occurring in a heterogeneous and volatile data space resulting from a cacophony of human and automated systems. To understand a situation and engineer the means for managing it, we must organize its data space. In particular, the heavy load of sophisticated processing for the anticipation, recognition, and influence of a situation must be girded with an architecture that enables data sourced from wildly disparate systems, having different modalities, structures, and semantics, to be integrated into one coherent body of situational knowledge.
In most business intelligence applications, data is integrated across information systems to support a choreographed interplay of services comprising an established set of business processes. In contrast, the constituent events in SIMA typically entail information systems that are far more diverse and whose dynamic interplay is less scripted, less repeatable, and therefore less predictable. Since many of these information systems capture data for completely different and unrelated purposes, and were never intended as participants in a coherent process, for SIMA we require a data architecture that enables them to be dynamically re-used or re-purposed. Because every situation is unique and we cannot anticipate all the right “business processes,” we need the capability to quickly fuse data often in high volumes from an ad-hoc set of systems, sometimes with knowledge asserted by analysts, in meaningful ways on the fly.
Traditional approaches to data integration, both physical and virtual [Batini 1986, Parent 1998, Halevy 2005, Bernstein 2007], cannot accommodate the complexity, heterogeneity, and volatility of the SIMA data space. In actual practice, the canonical data-models that underlie such approaches, including federation, are simply too rigid. They cannot adapt their structure to handle new data sources, associations, processes, or applications without heavy manual intervention. Moreover, such approaches generally result in the loss and or distortion of data, semantics, and context, all of which may be useful or even critical in SIMA. Even if initially successful, the IT costs associated with sustaining such systems as well as the human costs resulting from their deficiencies can be devastatingly high.
The scale and complexity of SIMA places it squarely in the domain of Ultra-Large-Scale systems which are characterized by decentralization; inherently conflicting, diverse, and unknowable requirements; heterogeneous, changing and inconsistent elements; normal failures; continuous operation, evolution, and deployment; and immense scale along many dimensions [Northrop 2006]. As such, SIMA demands a supporting data architecture that remains viable in a freely evolving, interdependent collective of systems, people, policies, cultures, and economics, very little of which will ever be under our control. Our objective is to define such a solution.
Data Description Framework
To organize the SIMA data space in a ULS systems environment, we enable semantic data integration by providing for the unified storage of structured data. We embrace the diversity of domain-specific data-models by taking a data-model agnostic approach wherein the integration model makes the least possible commitment to any particular data-model. We achieve this by identifying the universal aspects inherent in all structured data and creating an integration model based on that. A key aspect of our approach is that the character and meaning of the source data-model is preserved and made accessible by the data store. The result is a data architecture that can accommodate any kind of data without placing restrictions on vocabulary, structure, semantics, or constraints, in a way that addresses the needs of the SIMA Community today while providing a seamless transition path toward a future of ULS systems imbued with semantic technologies.
The key to devising a domain-neutral storage model for structured data is to decouple that which varies, namely vocabularies and, more generally the data-models, from that which remains constant, namely the source artifact, and ideally the storage structure. To achieve this, we consider structure, vocabulary, semantics, and constraints from a higher level of abstraction from which we then distill a minimal set of elements sufficient to capture any data-model. These are illustrated in Fig. 1 and defined as follows:
Sign: A sign is a chunk of data, either physically located within a tangible artifact, or contained within an analyst’s mind. Examples of the former include a string of text in a document; an object within an image; a segment of audio in an audio stream; a spike in a signal. As illustrated in Fig. 1, regardless of the type of medium, tangible signs are always associated with a physical extent (i.e. quantifiable span which we call a mention) within the artifact. In contrast, signs that reside in an analyst’s mind become tangible when she writes down her thoughts.
Concept:
A concept is an abstract idea, defined explicitly or implicitly by a source data-model. For example, the nodes of an ontology, the tag set in an XML Schema Document (XSD), and the attribute / table names in a relational database all represent concepts. Concept is an abstraction of such representations, which in the example of Fig. 1 includes Message, Person, and Body_text.


Predicate: A predicate is an abstract idea used to express a relationship between “things.” They are used in the formation of statements (described below) and may be defined either explicitly or implicitly by a source data-model. For example, the arcs of an ontology, and the attributes of an XML or database schema represent predicates. In Fig. 1, To, From, and Body represent predicates.
Term: A term is a disambiguated mention abstracted from the source artifact or asserting analyst. The process of disambiguation associates a mention with a concept, implicitly using the IsInstanceOf
predicate. However, not every such pairing results in a distinct term. All signs that are identical, and that are identified as having the same meaning, are represented by a single term. In the example of Fig. 1, Suzi
IsInstanceOf
Person represents a term.
Statement: A statement encodes a binary relationship between a subject and an object mediated by a predicate. In our design, subject and object may be either a term or statement. The simplest kind of statement is one in which subject and object are terms. Statements in which the object is itself another statement represent reifications. Finally, a statement in which both subject and object are other statements represents a relationship between statements. In Fig. 1, we see three statements, all with the same subject, which is the term corresponding to the message itself.
This organization of these elementary constructs (sign, concept, predicate, term, and statement) defines a data reference model, which we call the Data Description Framework (DDF) [Yoakum 2008 DAMA]. Because it effectively decouples data from data-models and structured data from data-structures, it can encapsulate any sort of data-model and support any data-structure. Because it binds knowledge to data, it enables deep data integration and semantic enrichment. Because it provides a foundation for implementing a stable database, it serves as a practical data integration platform.


