S. Yoakum-Stover, Ph.D.
Potomac Institute for Policy Studies
US Army CERDEC I2WD Information Exploitation Futures Lab
T. Malyuta, Ph.D.
New York City College of Technology
Computer Systems Technology Department
Abstract
We propose a new solution for data integration and semantic enrichment in support of Situation Management (SIMA). Our solution applies to any modality (e.g. text, images, audio, signals etc.) and embraces the diversity of data sources, types, and models, placing no restrictions on processes, applications, or users. It is database centric and proceeds in stages to address the unified storage of structured data and its semantic enrichment in a way that remains viable in an Ultra-Large Scale systems environment. The result is a layered data integration architecture that can accommodate any kind of data to coherently support the multiplicity of processing required for SIMA.
Challenge of Data Integration in Situation Management
Though generally scoped around a particular set of circumstances, or state of affairs, Situation Management (SIMA) is a mega-process occurring in a heterogeneous and volatile data space resulting from a cacophony of human and automated systems. To understand a situation and engineer the means for managing it, we must organize its data space. In particular, the heavy load of sophisticated processing for the anticipation, recognition, and influence of a situation must be girded with an architecture that enables data sourced from wildly disparate systems, having different modalities, structures, and semantics, to be integrated into one coherent body of situational knowledge.
In most business intelligence applications, data is integrated across information systems to support a choreographed interplay of services comprising an established set of business processes. In contrast, the constituent events in SIMA typically entail information systems that are far more diverse and whose dynamic interplay is less scripted, less repeatable, and therefore less predictable. Since many of these information systems capture data for completely different and unrelated purposes, and were never intended as participants in a coherent process, for SIMA we require a data architecture that enables them to be dynamically re-used or re-purposed. Because every situation is unique and we cannot anticipate all the right “business processes,” we need the capability to quickly fuse data often in high volumes from an ad-hoc set of systems, sometimes with knowledge asserted by analysts, in meaningful ways on the fly.
Traditional approaches to data integration, both physical and virtual [Batini 1986, Parent 1998, Halevy 2005, Bernstein 2007], cannot accommodate the complexity, heterogeneity, and volatility of the SIMA data space. In actual practice, the canonical data-models that underlie such approaches, including federation, are simply too rigid. They cannot adapt their structure to handle new data sources, associations, processes, or applications without heavy manual intervention. Moreover, such approaches generally result in the loss and or distortion of data, semantics, and context, all of which may be useful or even critical in SIMA. Even if initially successful, the IT costs associated with sustaining such systems as well as the human costs resulting from their deficiencies can be devastatingly high.
The scale and complexity of SIMA places it squarely in the domain of Ultra-Large-Scale systems which are characterized by decentralization; inherently conflicting, diverse, and unknowable requirements; heterogeneous, changing and inconsistent elements; normal failures; continuous operation, evolution, and deployment; and immense scale along many dimensions [Northrop 2006]. As such, SIMA demands a supporting data architecture that remains viable in a freely evolving, interdependent collective of systems, people, policies, cultures, and economics, very little of which will ever be under our control. Our objective is to define such a solution.
Data Description Framework
To organize the SIMA data space in a ULS systems environment, we enable semantic data integration by providing for the unified storage of structured data. We embrace the diversity of domain-specific data-models by taking a data-model agnostic approach wherein the integration model makes the least possible commitment to any particular data-model. We achieve this by identifying the universal aspects inherent in all structured data and creating an integration model based on that. A key aspect of our approach is that the character and meaning of the source data-model is preserved and made accessible by the data store. The result is a data architecture that can accommodate any kind of data without placing restrictions on vocabulary, structure, semantics, or constraints, in a way that addresses the needs of the SIMA Community today while providing a seamless transition path toward a future of ULS systems imbued with semantic technologies.
The key to devising a domain-neutral storage model for structured data is to decouple that which varies, namely vocabularies and, more generally the data-models, from that which remains constant, namely the source artifact, and ideally the storage structure. To achieve this, we consider structure, vocabulary, semantics, and constraints from a higher level of abstraction from which we then distill a minimal set of elements sufficient to capture any data-model. These are illustrated in Fig. 1 and defined as follows:
Sign: A sign is a chunk of data, either physically located within a tangible artifact, or contained within an analyst’s mind. Examples of the former include a string of text in a document; an object within an image; a segment of audio in an audio stream; a spike in a signal. As illustrated in Fig. 1, regardless of the type of medium, tangible signs are always associated with a physical extent (i.e. quantifiable span which we call a mention) within the artifact. In contrast, signs that reside in an analyst’s mind become tangible when she writes down her thoughts.
Concept:
A concept is an abstract idea, defined explicitly or implicitly by a source data-model. For example, the nodes of an ontology, the tag set in an XML Schema Document (XSD), and the attribute / table names in a relational database all represent concepts. Concept is an abstraction of such representations, which in the example of Fig. 1 includes Message, Person, and Body_text.


Predicate: A predicate is an abstract idea used to express a relationship between “things.” They are used in the formation of statements (described below) and may be defined either explicitly or implicitly by a source data-model. For example, the arcs of an ontology, and the attributes of an XML or database schema represent predicates. In Fig. 1, To, From, and Body represent predicates.
Term: A term is a disambiguated mention abstracted from the source artifact or asserting analyst. The process of disambiguation associates a mention with a concept, implicitly using the IsInstanceOf
predicate. However, not every such pairing results in a distinct term. All signs that are identical, and that are identified as having the same meaning, are represented by a single term. In the example of Fig. 1, Suzi
IsInstanceOf
Person represents a term.
Statement: A statement encodes a binary relationship between a subject and an object mediated by a predicate. In our design, subject and object may be either a term or statement. The simplest kind of statement is one in which subject and object are terms. Statements in which the object is itself another statement represent reifications. Finally, a statement in which both subject and object are other statements represents a relationship between statements. In Fig. 1, we see three statements, all with the same subject, which is the term corresponding to the message itself.
This organization of these elementary constructs (sign, concept, predicate, term, and statement) defines a data reference model, which we call the Data Description Framework (DDF) [Yoakum 2008 DAMA]. Because it effectively decouples data from data-models and structured data from data-structures, it can encapsulate any sort of data-model and support any data-structure. Because it binds knowledge to data, it enables deep data integration and semantic enrichment. Because it provides a foundation for implementing a stable database, it serves as a practical data integration platform.
In the subsequent text, we represent mentions, concepts, and predicates using Arial font. Terms are denoted as [mention, concept] (e.g. [Adam, Chemist]) and statements are denoted using an intuitive triple representation, e.g. [Adam, Chemist] hasInventoryID [1001,InventoryID].
The Unified Data Space
As illustrated in Fig. 2, the DDF forms a layer of data and semantics (Layer 2) lying between the indigenous source systems (Layer 1) and their knowledge models (Layer 3). Layer 1 feeds the layers above, and Layers 2 and 3 interact: Layer 3 provides semantic context for Layer 2 and Layer 2 participates in the formation of an overarching knowledge model in Layer 3. Together Layers 2 and 3 form what we call the unified DDF data space.
Illustrative Example
To convey a more tangible understanding of the DDF to the user, in this section we present a simplified example that illustrates:
-
Loading three disparate data sources into the DDF
-
Surveying the resulting integrated data space
-
Enhancing the data space with additional semantic associations
- Exploring the enriched data space


Loading the DDF
Loading structured data into a DDF store is a straightforward, mechanical Extract – Transform – Load (ETL) process. This process maps the original data and semantics into the DDF using a pattern that depends primarily on the type of data source because it needs only to capture the structure and semantics of the relational metamodel (not the structure and semantics of a specific instance). For example, our prototype loader works out-of-the-box for most relational databases, extracting data structure and data from the source’s data dictionary and relations as follows:
-
Data instances ® signs
-
Table attributes ® concepts
-
Signs are bound to their respective concepts to form terms
-
Predicates are derived from non-key attributes (i.e. concepts) using ‘has’ semantics. For example the predicate derived from the concept Project is hasProject.
-
Within a record, terms associated with primary key columns are semantically linked via derived predicates to terms associated with non-primary key columns to form statements. For example, [Adam, ChemistName] hasProject [P1, Project].

Figures 4 and 5 illustrate the result of the mechanical ETL for the three data sources shown in Fig. 3. For the purpose of our illustration, we assume that everything from the sources presented in Fig. 3 is loaded, but this need not be the case. We may freely choose which parts of a data source to load and when to load them. For example, we may choose to load specific views of the source data, or perhaps only the structure of a data source, lazily loading instances only when requested. Finally, the DDF can (and should) capture any desired metadata associated with the source artifacts, the ETL process itself, the quality / strength of semantic and association facts, or any other aspects of the data space elements. For simplicity we do not illustrate this.
Surveying the Unified Data Space Floor
We refer to the integrated data space that results simply from loading data into the DDF as the Unified Data Space Floor. We may explore this space through querying. For example, we may observe the spectrum of semantics of the sign Adam by issuing a query that asks, ‘What is Adam?’ The result set will include all the concepts associated with the sign Adam
across all sources (i.e. ChemistName and Chemist). Note that this simple yet penetrating question cannot be answered by any traditional data integration solution.
Another simple but useful question that traditional data integration solutions cannot answer is: ‘Which data elements (i.e. signs)
in source B also appear in source C?’
The result is: E1001, E2119, and E3327. By looking at the range of concepts associated with this result set, one may glean useful insight for data-model harmonization. For example, we find that E1001 is associated with the concept InventoryID in source B and EquipCode in source C. An analyst might suspect therefore, that that the two concepts are the same, and if confirmed, assert this equivalence at the data-model level. Thus insight obtained by the analysis of data instances may be applied more broadly as knowledge at the data-model level. This is but one example of how Layer 2 can inform Layer 3.


By chaining such queries we can explore semantic associations and traverse unified data space floor. For example, we may ask:
-
Query: What terms are associated with the sign L1?
Result: [E1001, EquipCode], [E3327, EquipCode]
Analyst thinks: ‘This stuff is located in the same lab.’
-
Query: What other concepts are associated with signs E1001 and E3327?
Result: InventoryID (from source B)
Analyst thinks: ‘I wonder if EquipCode is the same thing as InventoryID.’
- Query: Which signs of EquipCode match signs of InventoryID?
Result: E1001, E2119, E3327
Analyst thinks: ‘The concepts EquipCode and InventoryID probably do mean the same thing.’
- Query: What other concepts are associated with InventoryID?
Result: Chemist
- Query: Which Chemists are associated with [E1001,InventoryID] and [E3327,InventoryID]?
Result: [Adam, Chemist], [Mary, Chemist]
Analyst thinks: ‘Adam and Mary have equipment in the same lab, so they probably know each other.’
These queries illustrate the ability to perform “semantic drilling” into the DDF data space. We can ask series of questions that “surf” across the entire DDF data space unimpeded by barriers between source systems. One need not have specific semantic knowledge of the source systems in order to explore the data space this way and to extract useful insight. In the next section we will illustrate how this insight may be subsequently inserted back into the data space, as additional information and knowledge, to produce further semantic enrichment and fusion.

Enhancing the Data Space
Up to this point, we have discussed the data integration and analytic power of the unified data space floor that results simply by the mechanical loading of data into Layer 2. The breadth of integration, depth of semantic enrichment, and analytic power can all be dramatically improved by building upon this floor, either by an analyst or an automated process. This can be performed at the data instance level (Level 2), the data-model level (Level 3), or the combination of the two. The first regards the assertion of new instances of DDF elements (i.e. signs, terms, concepts, predicates, and statements). The second regards the enhancement and or harmonization of source specific data-models. The third regards the association of concepts and predicates asserted in Level 2 with existing knowledge models in Level 3.
For example, as is illustrated in Fig. 5, we may introduce the predicate isEquivalent and use it to assert the statement that [Ben, ChemistName] isEquivalent [Benjamin, Chemist]. Such statements, created at the data instance level, represent data integration. In addition, we may assert new associations at the data-model level to achieve global data-model integration (e.g. harmonization). This is illustrated in Fig. 6 wherein, concept ChemistName is asserted to be the same as concept Chemist. The result of this assertion is that the meaning of all ChemistName terms becomes sameAs the meaning of all Chemist
terms.


Exploring the Enriched Data Space
As we explore the enriched data space, surfing semantics and drilling associations, we find that previously disjoint regions of the space become reachable via the newly asserted data and associations. For example, having equated the concept ChemistName with Chemist, and InventoryID with EquipCode, an analyst can simply retrieve the projects that are located in a particular lab with basically one query.
Query: Which terms are associated with [L1, lab]?
Result: [E1001, EquipCode], [Adam, Chemist], [P1, Project]
Fig. 6 shows how the asserted associations (dashed) at the data-model level enable additional associations (dotted) to be inferred. This interplay of data and data-model integration is what ultimately allows us to “connect the dots.”
Application to SIMA
To enable the rapid, ad-hoc assimilation of diverse data into situational views useful for SIMA, we must overcome system, structural, and semantic barriers between data sourced from different systems. As illustrated in Fig. 7, traditional data integration approaches attempt to achieve this by imposing a tight commitment to a particular data-model or integration schema (i.e. canonical data-model). Unfortunately, choosing which of the source data element to expose and mapping them to the canonical model inevitably leads to information loss, and or distortion, and the integration schema itself creates yet another semantic barrier.
In contrast, the DDF breaks the barriers between data sources to accommodate all within a single coherent data space. Simply loading data into the DDF in a largely automated fashion produces a fundamental level of data unity - the Unified Data Space Floor. No data-model harmonization need be made and yet non-trivial data integration results. Upon this floor, the DDF supports the construction of deeper integration and semantic enrichment at both the data instance and data-model levels without prescribing or constraining the processing by which such enrichment may be achieved. Any fusion or data integration method can be applied alone or in combination. Moreover, unlike other integration approaches, new data and associations, regardless of their origin, join seamlessly into the unified data space.
The DDF data space also supports the complete spectrum of applications and clients, from generic (i.e. those operating at the level of the DDF structure) to specific (i.e. those that have knowledge of a particular source data-model). Generic clients seamlessly span across the entire data space regardless of data source or associated data-model to perform analysis. Such clients require no modification as new data or semantics are introduced. Specific clients are able to operate with the same semantic depth in the DDF data space as they would on the source system itself since the DDF data space preserves the data and semantics of the source systems. In other words, the expressiveness and search capability native to those systems are retained [Yoakum 2008 JDIQ]. As the data space is increasingly enriched with semantics that bridge data-models, the depth of specific clients is retained while their breadth increasingly widens toward that of a generic client.



Conclusion
Successfully executing the constellation of activities that comprise SIMA, particularly in support of decision-making, requires exploiting information within a dynamic, heterogeneous, and distributed data environment that is largely beyond our control. The challenge therefore, is to dynamically integrate data, information, and knowledge into one coherent intelligence repository to serve as a foundation for SIMA processes and operations. Current practice is insufficient in the face of scale and complexity.
The approach presented in this paper overcomes the shortcomings of traditional data integration approaches using a framework, called the Data Description Framework, which enables the seamless integration of any structured data within and across data sources and models without the loss or distortion of data and semantics. Moreover, the framework supports a practical, stable implementation using any standard database system.
The simple, mechanical loading of source data and semantics into the DDF creates a unified data space floor that exhibits a primary level of data integration unmatched by traditional integration approaches. No up-front, heavy investment in data-model harmonization is required – one simply pours data on the floor. Deeper integration and semantic enrichment may then be pursued with any manual or automated processing operating either at the data instance or data-model levels.
The ultimate analytic power that is enabled by the DDF data space is essentially unlimited and exceeds that of any particular source system or traditional data integration solution at any level. Having the power and flexibility required to organize the transient and complex SIMA data space, it provides the ideal foundation on which to pursue SIMA.
Acknowledgements
The authors would like to thank the following US Army CERDEC I2WD personnel for their continued support: Mr. Anthony Lisuzzo, Director, Mr. Kesny Parent, DCGS-A Branch Chief, Ms. Virginia Goon IXFL Manager, and Mr. Norbert Antunes IXFL Computer Engineer. This work was funded by US Army CERDEC I2WD under contract number W15P7T-06-D-A401/009.
References
[Batini 1986] Batini, C. et al. A comparative analysis of methodologies for database schema integration, ACM Computing Surveys, (18) 4, 1986.
[Bernstein 2007] Bernstein P., Ho, H. Model Management and Schema Mappings: Theory and Practice, Proceedings of VLDB Conference, 2007.
[Halevy 2005] Halevy, A. et al. Enterprise information integration: successes, challenges and controversies, Proceedings of 24th International Conference on Management of Data, Baltimore, 2005.
[Northrop 2006] Northrop, L., et al., Ultra-Large-Scale Systems The Software Challenge of the Future, Pittsburgh: Carnegie Mellon University, 2007. http://www.sei.cmu.edu/publications/books/engineering/uls.html
[Parent 1998] Parent, C. and Spaccapietra, S. Issues and approaches of database integration, Communications of the ACM, 41(5), 1998.
[Yoakum 2008 DAMA] Yoakum-Stover, S. and Malyuta, T. Unified Integration Architecture for Intelligence Data, DAMA International Europe Conference 2008, November 2008, London, UK.
[Yoakum 2008 JDIQ] Yoakum-Stover, S. and Malyuta, T. Unified Architecture for Integrating Intelligence Data, ACM Journal of Data and Information Quality. September 2008. Pending decision.