systover.net

January 1, 2009

Unified Data Integration for Situation Management

Filed under: data modeling, database, dataspaces, publications — Suzanne Yoakum-Stover @ 9:21 pm

Printable copy of article

S. Yoakum-Stover, Ph.D.

Potomac Institute for Policy Studies

US Army CERDEC I2WD Information Exploitation Futures Lab

T. Malyuta, Ph.D.

New York City College of Technology

Computer Systems Technology Department

Abstract

We propose a new solution for data integration and semantic enrichment in support of Situation Management (SIMA). Our solution applies to any modality (e.g. text, images, audio, signals etc.) and embraces the diversity of data sources, types, and models, placing no restrictions on processes, applications, or users. It is database centric and proceeds in stages to address the unified storage of structured data and its semantic enrichment in a way that remains viable in an Ultra-Large Scale systems environment. The result is a layered data integration architecture that can accommodate any kind of data to coherently support the multiplicity of processing required for SIMA.

Challenge of Data Integration in Situation Management

Though generally scoped around a particular set of circumstances, or state of affairs, Situation Management (SIMA) is a mega-process occurring in a heterogeneous and volatile data space resulting from a cacophony of human and automated systems. To understand a situation and engineer the means for managing it, we must organize its data space. In particular, the heavy load of sophisticated processing for the anticipation, recognition, and influence of a situation must be girded with an architecture that enables data sourced from wildly disparate systems, having different modalities, structures, and semantics, to be integrated into one coherent body of situational knowledge.

In most business intelligence applications, data is integrated across information systems to support a choreographed interplay of services comprising an established set of business processes. In contrast, the constituent events in SIMA typically entail information systems that are far more diverse and whose dynamic interplay is less scripted, less repeatable, and therefore less predictable. Since many of these information systems capture data for completely different and unrelated purposes, and were never intended as participants in a coherent process, for SIMA we require a data architecture that enables them to be dynamically re-used or re-purposed. Because every situation is unique and we cannot anticipate all the right “business processes,” we need the capability to quickly fuse data often in high volumes from an ad-hoc set of systems, sometimes with knowledge asserted by analysts, in meaningful ways on the fly.

Traditional approaches to data integration, both physical and virtual [Batini 1986, Parent 1998, Halevy 2005, Bernstein 2007], cannot accommodate the complexity, heterogeneity, and volatility of the SIMA data space. In actual practice, the canonical data-models that underlie such approaches, including federation, are simply too rigid. They cannot adapt their structure to handle new data sources, associations, processes, or applications without heavy manual intervention. Moreover, such approaches generally result in the loss and or distortion of data, semantics, and context, all of which may be useful or even critical in SIMA. Even if initially successful, the IT costs associated with sustaining such systems as well as the human costs resulting from their deficiencies can be devastatingly high.

The scale and complexity of SIMA places it squarely in the domain of Ultra-Large-Scale systems which are characterized by decentralization; inherently conflicting, diverse, and unknowable requirements; heterogeneous, changing and inconsistent elements; normal failures; continuous operation, evolution, and deployment; and immense scale along many dimensions [Northrop 2006]. As such, SIMA demands a supporting data architecture that remains viable in a freely evolving, interdependent collective of systems, people, policies, cultures, and economics, very little of which will ever be under our control. Our objective is to define such a solution.

Data Description Framework

To organize the SIMA data space in a ULS systems environment, we enable semantic data integration by providing for the unified storage of structured data. We embrace the diversity of domain-specific data-models by taking a data-model agnostic approach wherein the integration model makes the least possible commitment to any particular data-model. We achieve this by identifying the universal aspects inherent in all structured data and creating an integration model based on that. A key aspect of our approach is that the character and meaning of the source data-model is preserved and made accessible by the data store. The result is a data architecture that can accommodate any kind of data without placing restrictions on vocabulary, structure, semantics, or constraints, in a way that addresses the needs of the SIMA Community today while providing a seamless transition path toward a future of ULS systems imbued with semantic technologies.

The key to devising a domain-neutral storage model for structured data is to decouple that which varies, namely vocabularies and, more generally the data-models, from that which remains constant, namely the source artifact, and ideally the storage structure. To achieve this, we consider structure, vocabulary, semantics, and constraints from a higher level of abstraction from which we then distill a minimal set of elements sufficient to capture any data-model. These are illustrated in Fig. 1 and defined as follows:

Sign: A sign is a chunk of data, either physically located within a tangible artifact, or contained within an analyst’s mind. Examples of the former include a string of text in a document; an object within an image; a segment of audio in an audio stream; a spike in a signal. As illustrated in Fig. 1, regardless of the type of medium, tangible signs are always associated with a physical extent (i.e. quantifiable span which we call a mention) within the artifact. In contrast, signs that reside in an analyst’s mind become tangible when she writes down her thoughts.

Concept:
A concept is an abstract idea, defined explicitly or implicitly by a source data-model.  For example, the nodes of an ontology, the tag set in an XML Schema Document (XSD), and the attribute / table names in a relational database all represent concepts. Concept is an abstraction of such representations, which in the example of Fig. 1 includes Message, Person, and Body_text.

Predicate: A predicate is an abstract idea used to express a relationship between “things.” They are used in the formation of statements (described below) and may be defined either explicitly or implicitly by a source data-model. For example, the arcs of an ontology, and the attributes of an XML or database schema represent predicates. In Fig. 1, To, From, and Body represent predicates.

Term: A term is a disambiguated mention abstracted from the source artifact or asserting analyst. The process of disambiguation associates a mention with a concept, implicitly using the IsInstanceOf
predicate. However, not every such pairing results in a distinct term. All signs that are identical, and that are identified as having the same meaning, are represented by a single term. In the example of Fig. 1, Suzi
IsInstanceOf
Person represents a term.

Statement: A statement encodes a binary relationship between a subject and an object mediated by a predicate. In our design, subject and object may be either a term or statement. The simplest kind of statement is one in which subject and object are terms. Statements in which the object is itself another statement represent reifications. Finally, a statement in which both subject and object are other statements represents a relationship between statements. In Fig. 1, we see three statements, all with the same subject, which is the term corresponding to the message itself.

This organization of these elementary constructs (sign, concept, predicate, term, and statement) defines a data reference model, which we call the Data Description Framework (DDF) [Yoakum 2008 DAMA]. Because it effectively decouples data from data-models and structured data from data-structures, it can encapsulate any sort of data-model and support any data-structure. Because it binds knowledge to data, it enables deep data integration and semantic enrichment. Because it provides a foundation for implementing a stable database, it serves as a practical data integration platform.

(more…)

Unified Architecture for Integrating Intelligence Data (full paper)

Filed under: data modeling, database, dataspaces, publications — Suzanne Yoakum-Stover @ 9:18 pm

S. Yoakum-Stover, Ph.D.

Potomac Institute for Policy Studies

US Army CERDEC I2WD Information Exploitation Futures Lab

T. Malyuta, Ph.D.

New York City College of Technology

Computer Systems Technology Department

August 24, 2008

Abstract

The principal problem spanning the Intelligence Community today is how to integrate the great variety of disparate data into one single coherent repository of knowledge. Current practice whereby all data-models would be merged into a single “Uber-model” simply does not work. We require a solution that remains viable in a freely evolving, interdependent collective of human and computational systems, very little of which will ever be under our control. Our approach is database-centric and proceeds in stages. The first addresses the unified storage of the broad spectrum of artifacts existing within the Intelligence Enterprise today regardless of modality or representation. The second builds upon the foundation provided by the first to address the unified storage of structured data and semantic data integration. In both we embrace the diversity of data-models employed throughout the Intelligence Community. The result is a layered data architecture that can accommodate any kind of data without placing restrictions on vocabulary, structure, semantics, or constraints in a way that addresses today’s Intel needs while providing a seamless transition path toward a future of ULS systems imbued with semantic technologies.

Introduction

The principal problem spanning the Intelligence Community today is how to integrate the great variety of disparate data stores and streams, both legacy and bleeding-edge, into one single coherent repository of knowledge. Pieces of the Intel puzzle lay scattered in data silos sequestered by the very systems that served to create them. Each of these systems, to include most of today’s Army Programs of Record, was built as an end-to-end solution with its own sensors, processors, and data stores, implemented and operated to achieve a specific intelligence objective. They were never meant to interoperate, share data, or even expose data beyond a narrow mission-focused enclave. The advent of network technologies and protocols, which have effectively eliminated the physical barriers between systems, has done little to bridge the chasm between these data silos. Although we can now transfer data over the wire, disparate and utterly incompatible data-models characterized by straightforward and subtle differences in vocabulary, structure, semantics, and constraints continue to stymie data integration efforts.

Data quality professionals widely recognize the importance of data integration and the need for efficient data integration approaches to redress a panoply of data quality problems [Lee 2006]. Unfortunately, current practice in data integration, whereby all data-models would be merged or harmonized, either physically or virtually [Batini 1986, Parent 1998, Halevy 2005, Bernstein 2007] fails to accommodate the demands of our fluid and rapidly growing Intelligence Enterprise. The physical mapping of disparate models into a single canonical data-model [Omelayenko 2001] is simply untenable as the scale and complexity of their subjects quickly overwhelms our tools and methods. Federation approaches share this defect and introduce new ones [Izydor 2007, Yero 2008]. In practice, these approaches provide only the illusion of data integration as they mainly integrate data-models, not the data itself, and in so doing confine all data to a model that is incapable of adapting itself or its contents as our knowledge about the domain evolves.

In all but the most constrained situations, what begins as a perfectly neat solution for a handful of systems quickly becomes intractable with scale, exposing not only the limitations of traditional implementations, but also of our grasp at the foundations of knowledge representation itself. This phenomenon is but one early symptom of our evolution toward Ultra-Large Scale (ULS) systems [Northrop 2006] and as such, invites a completely different approach - one that remains viable in a freely evolving, interdependent collective of systems, people, policies, cultures, and economics, very little of which will ever be under our control. Our objective is to define such a solution.

Conceptual Approach

Our approach to integrating intelligence data in a ULS systems environment is data-centric (as opposed to data-model centric) and proceeds in stages. The first addresses the unified storage of the entire spectrum of intelligence artifacts regardless of modality or representation. The second stage builds upon the foundation provided by the first to address the unified storage of structured data to enable semantic data integration. A third stage (beyond the scope of this paper) addresses unified storage of knowledge models. In all stages we embrace the diversity of domain-specific data-models employed throughout the Intelligence Community by taking a data-model agnostic approach wherein the integration model makes the least possible commitment to any particular data-model. In the case of “raw” artifacts, this means storing each according to its native representation without the application of structural or semantic transformations. In the case of structured artifacts, it means identifying the universal aspects inherent in all structured data and creating an integration model based on that. A key aspect of our approach is that the character and meaning of the source data-model is preserved and made accessible by the data store. The result is a layered Data Integration Framework that can accommodate any kind of data without placing restrictions on vocabulary, structure, semantics, or constraints, in a way that addresses the needs of the Intelligence Community today while providing a seamless transition path toward a future of ULS systems imbued with semantic technologies.

Scope

The types of intelligence collected by sensors and systems today span the electro-magnetic spectrum to include all manner of signals, audio, video, and images, in addition to so-called human intelligence (e.g. text artifacts such as reports, messages, web pages). Our approach to data integration supports all of these simultaneously regardless of their underlying source data-model, or lack thereof. It does not however, prescribe a solution for data-model harmonization. In particular, our approach imposes no relationship between the data-models to which the artifacts adhere. It does however, allow such relationships, created by external processes of any sort, to be effectively represented and integrated together.

As the business of intelligence is to develop and communicate understanding (which entails the collection, exploitation, and provisioning of intelligence), intelligence business processing includes any automated activity that moves intelligence artifacts with respect to the cognitive hierarchy (see Fig.1). This includes data collection, semantic enhancement and fusion from data to information to knowledge, and communication / collaboration to create understanding. In these terms, Layer 1 of our Data Integration Framework supports an aspect of collection and rudimentary exploitation. Layer 2 supports the processing by which data is enhanced with semantics to produce information, and the processing by which information is enhanced with richer associations to produce knowledge. Layer 3 supports the management and integration of knowledge models, and Layer 4 supports human computer interfaces through which the analyst “sees” all of this intelligence. The scope of this paper is limited to Layers 1 and 2, which together support the provisioning of integrated intelligence at the level of data, information, and knowledge. Layers 3 and 4 will be the subject of subsequent papers.

Technical Approach

The broad and ever-changing spectrum of intelligence artifacts existing within the Intelligence Enterprise today reflects a nearly equally broad and ever-changing spectrum of intelligence collectors, producers, and consumers. The types of artifacts they generate vary tremendously in their modality (e.g. text, images, audio, video, signals) and representation (e.g. free text, XML, SQL, vector, raster). As this diversity is beyond our control, we term all such artifacts as “indigenous.”

In Layer 1 of our Data Integration Framework, we seek to integrate the entire spectrum of indigenous artifacts by simply collecting them in one (possibly distributed) database using standard means for physical and or virtual data integration. Crucial to our approach however, is that we (a) avoid making any data or data-model transformations in the process of data ingestion and (b) make the least possible commitment to a data-model in the target storage schema. Consequently, the Layer 1 database schema is quite simple and flat, exposing a minimal set of essential meta-data fields whose main purpose is to support back-tracking to the original artifact and or source. As illustrated in Fig. 2, the principal data element within a database record is the artifact itself, which is captured either physically or virtually (by way of a link or reference) in as close to its indigenous form as possible.

Using a familiar analogy, if each indigenous artifact were to represent a single piece of a colossal Intel jigsaw puzzle, then Layer 1 of our Data Integration Framework is just the box in which we keep all the pieces. This most trivial form of integration has several important benefits: It provides a manageable yet powerful and standard interface to the source data. It gives us the option to either “lazily” load and cache data as “virtual artifacts” for performance sake, or persist and control data as “tangible artifacts” for the long term. It provides “one stop shopping” access to the indigenous data for analysts who would otherwise need to navigate and obtain access to multiple disparate systems. And most importantly, this universal indigenous data store establishes a foundation upon which deep data integration can be more effectively pursued.

Structured Data

Every analyst engaged in intelligence processing either creates or uses structured data. Just as we do not control the sources or format of indigenous artifacts, we also do not control the various methods by which such artifacts might be structured or the data-models employed therein. Thus as the objective of Layer 1 is to accommodate the diversity of indigenous artifacts regardless of type or format, the objective of Layer 2 is to accommodate the diversity of all structured data regardless of vocabulary, organization, representation, or semantics.

Structured data necessarily adheres to some sort of model, which in general specifies vocabulary, structure, semantics, and constraints. Though not all data-models specify all of these, at minimum, every structured artifact entails a vocabulary reflecting a set of entity types (e.g. person, message) and an organization reflecting their relationships (e.g. message to person). These basic elements are illustrated in the simplified example of Fig. 3. Part (a) of the figure shows a short unstructured text message, and part (b) shows a data-model according to which a message might be structured. Part (c) then shows the original message structured according to the data-model and part (d) shows how that structured message is typically persisted in a database.

(more…)

Powered by WordPress