S. Yoakum-Stover, Ph.D.
Potomac Institute for Policy Studies
US Army CERDEC I2WD Information Exploitation Futures Lab
T. Malyuta, Ph.D.
New York City College of Technology
Computer Systems Technology Department
August 24, 2008
Abstract
The principal problem spanning the Intelligence Community today is how to integrate the great variety of disparate data into one single coherent repository of knowledge. Current practice whereby all data-models would be merged into a single “Uber-model” simply does not work. We require a solution that remains viable in a freely evolving, interdependent collective of human and computational systems, very little of which will ever be under our control. Our approach is database-centric and proceeds in stages. The first addresses the unified storage of the broad spectrum of artifacts existing within the Intelligence Enterprise today regardless of modality or representation. The second builds upon the foundation provided by the first to address the unified storage of structured data and semantic data integration. In both we embrace the diversity of data-models employed throughout the Intelligence Community. The result is a layered data architecture that can accommodate any kind of data without placing restrictions on vocabulary, structure, semantics, or constraints in a way that addresses today’s Intel needs while providing a seamless transition path toward a future of ULS systems imbued with semantic technologies.
Introduction
The principal problem spanning the Intelligence Community today is how to integrate the great variety of disparate data stores and streams, both legacy and bleeding-edge, into one single coherent repository of knowledge. Pieces of the Intel puzzle lay scattered in data silos sequestered by the very systems that served to create them. Each of these systems, to include most of today’s Army Programs of Record, was built as an end-to-end solution with its own sensors, processors, and data stores, implemented and operated to achieve a specific intelligence objective. They were never meant to interoperate, share data, or even expose data beyond a narrow mission-focused enclave. The advent of network technologies and protocols, which have effectively eliminated the physical barriers between systems, has done little to bridge the chasm between these data silos. Although we can now transfer data over the wire, disparate and utterly incompatible data-models characterized by straightforward and subtle differences in vocabulary, structure, semantics, and constraints continue to stymie data integration efforts.
Data quality professionals widely recognize the importance of data integration and the need for efficient data integration approaches to redress a panoply of data quality problems [Lee 2006]. Unfortunately, current practice in data integration, whereby all data-models would be merged or harmonized, either physically or virtually [Batini 1986, Parent 1998, Halevy 2005, Bernstein 2007] fails to accommodate the demands of our fluid and rapidly growing Intelligence Enterprise. The physical mapping of disparate models into a single canonical data-model [Omelayenko 2001] is simply untenable as the scale and complexity of their subjects quickly overwhelms our tools and methods. Federation approaches share this defect and introduce new ones [Izydor 2007, Yero 2008]. In practice, these approaches provide only the illusion of data integration as they mainly integrate data-models, not the data itself, and in so doing confine all data to a model that is incapable of adapting itself or its contents as our knowledge about the domain evolves.
In all but the most constrained situations, what begins as a perfectly neat solution for a handful of systems quickly becomes intractable with scale, exposing not only the limitations of traditional implementations, but also of our grasp at the foundations of knowledge representation itself. This phenomenon is but one early symptom of our evolution toward Ultra-Large Scale (ULS) systems [Northrop 2006] and as such, invites a completely different approach - one that remains viable in a freely evolving, interdependent collective of systems, people, policies, cultures, and economics, very little of which will ever be under our control. Our objective is to define such a solution.
Conceptual Approach
Our approach to integrating intelligence data in a ULS systems environment is data-centric (as opposed to data-model centric) and proceeds in stages. The first addresses the unified storage of the entire spectrum of intelligence artifacts regardless of modality or representation. The second stage builds upon the foundation provided by the first to address the unified storage of structured data to enable semantic data integration. A third stage (beyond the scope of this paper) addresses unified storage of knowledge models. In all stages we embrace the diversity of domain-specific data-models employed throughout the Intelligence Community by taking a data-model agnostic approach wherein the integration model makes the least possible commitment to any particular data-model. In the case of “raw” artifacts, this means storing each according to its native representation without the application of structural or semantic transformations. In the case of structured artifacts, it means identifying the universal aspects inherent in all structured data and creating an integration model based on that. A key aspect of our approach is that the character and meaning of the source data-model is preserved and made accessible by the data store. The result is a layered Data Integration Framework that can accommodate any kind of data without placing restrictions on vocabulary, structure, semantics, or constraints, in a way that addresses the needs of the Intelligence Community today while providing a seamless transition path toward a future of ULS systems imbued with semantic technologies.
Scope
The types of intelligence collected by sensors and systems today span the electro-magnetic spectrum to include all manner of signals, audio, video, and images, in addition to so-called human intelligence (e.g. text artifacts such as reports, messages, web pages). Our approach to data integration supports all of these simultaneously regardless of their underlying source data-model, or lack thereof. It does not however, prescribe a solution for data-model harmonization. In particular, our approach imposes no relationship between the data-models to which the artifacts adhere. It does however, allow such relationships, created by external processes of any sort, to be effectively represented and integrated together.

As the business of intelligence is to develop and communicate understanding (which entails the collection, exploitation, and provisioning of intelligence), intelligence business processing includes any automated activity that moves intelligence artifacts with respect to the cognitive hierarchy (see Fig.1). This includes data collection, semantic enhancement and fusion from data to information to knowledge, and communication / collaboration to create understanding. In these terms, Layer 1 of our Data Integration Framework supports an aspect of collection and rudimentary exploitation. Layer 2 supports the processing by which data is enhanced with semantics to produce information, and the processing by which information is enhanced with richer associations to produce knowledge. Layer 3 supports the management and integration of knowledge models, and Layer 4 supports human computer interfaces through which the analyst “sees” all of this intelligence. The scope of this paper is limited to Layers 1 and 2, which together support the provisioning of integrated intelligence at the level of data, information, and knowledge. Layers 3 and 4 will be the subject of subsequent papers.
Technical Approach
The broad and ever-changing spectrum of intelligence artifacts existing within the Intelligence Enterprise today reflects a nearly equally broad and ever-changing spectrum of intelligence collectors, producers, and consumers. The types of artifacts they generate vary tremendously in their modality (e.g. text, images, audio, video, signals) and representation (e.g. free text, XML, SQL, vector, raster). As this diversity is beyond our control, we term all such artifacts as “indigenous.”
In Layer 1 of our Data Integration Framework, we seek to integrate the entire spectrum of indigenous artifacts by simply collecting them in one (possibly distributed) database using standard means for physical and or virtual data integration. Crucial to our approach however, is that we (a) avoid making any data or data-model transformations in the process of data ingestion and (b) make the least possible commitment to a data-model in the target storage schema. Consequently, the Layer 1 database schema is quite simple and flat, exposing a minimal set of essential meta-data fields whose main purpose is to support back-tracking to the original artifact and or source. As illustrated in Fig. 2, the principal data element within a database record is the artifact itself, which is captured either physically or virtually (by way of a link or reference) in as close to its indigenous form as possible.
Using a familiar analogy, if each indigenous artifact were to represent a single piece of a colossal Intel jigsaw puzzle, then Layer 1 of our Data Integration Framework is just the box in which we keep all the pieces. This most trivial form of integration has several important benefits: It provides a manageable yet powerful and standard interface to the source data. It gives us the option to either “lazily” load and cache data as “virtual artifacts” for performance sake, or persist and control data as “tangible artifacts” for the long term. It provides “one stop shopping” access to the indigenous data for analysts who would otherwise need to navigate and obtain access to multiple disparate systems. And most importantly, this universal indigenous data store establishes a foundation upon which deep data integration can be more effectively pursued.


Structured Data
Every analyst engaged in intelligence processing either creates or uses structured data. Just as we do not control the sources or format of indigenous artifacts, we also do not control the various methods by which such artifacts might be structured or the data-models employed therein. Thus as the objective of Layer 1 is to accommodate the diversity of indigenous artifacts regardless of type or format, the objective of Layer 2 is to accommodate the diversity of all structured data regardless of vocabulary, organization, representation, or semantics.
Structured data necessarily adheres to some sort of model, which in general specifies vocabulary, structure, semantics, and constraints. Though not all data-models specify all of these, at minimum, every structured artifact entails a vocabulary reflecting a set of entity types (e.g. person, message) and an organization reflecting their relationships (e.g. message to person). These basic elements are illustrated in the simplified example of Fig. 3. Part (a) of the figure shows a short unstructured text message, and part (b) shows a data-model according to which a message might be structured. Part (c) then shows the original message structured according to the data-model and part (d) shows how that structured message is typically persisted in a database.
Notice how the database schema is tightly coupled to the data-model that was used to structure the data, and how the raw message is bound to the data-model by the database. The data-model is imposed on the database, and the data itself is frozen into it such that no additional attributes or relationships are possible (without modifying the database schema). This is a severe shortcoming considering the tremendous variety of ways in which a given artifact might be structured or enhanced with additional features and associations. Even for the simple case shown in the figure, we can easily imagine data-models that use different entities (e.g. ‘Individual’ instead of ‘Person’), different relationships (e.g. ‘Sender’ instead of ‘From’), and different organizations (e.g. by including ‘MessageDate’), not to mention the wealth of other information external to the message itself (e.g. about ‘Tanya’) that might be brought to bear.
In a ULS systems environment, it is simply unreasonable to presume that the data-models or the various processes, either automated or manual, that structure of data can be controlled or constrained. It is also unreasonable to presume that it is possible to anticipate the totality of their breadth or their application. To the contrary, the urgency and diversity driving our Intelligence Enterprise essentially guarantees that as many different methods for extracting entities, relationships, and events will be brought to bear as our imaginations and increasingly powerful technologies can support. Thus, although we might like to enhance Layer 1 of our Data Integration Framework by exposing all possible extracted elements along with their properties and attributes in order to support efficient querying, introducing an ever expanding array of fields and tables into a database is as impractical as attempting to accommodate every kind of data and purpose within a single canonical data-model.

The challenge therefore, is to build the next layer of the Data Integration Framework to accommodate structured data in a way that exposes that structure for use, without imposing the structure on the data store itself. In other words, we must determine a method for storing and managing any kind of structured data, reflecting any data-model, so that it can be shared, efficiently exploited, and extended in unforeseen ways without requiring model-specific storage implementations. In other words, we seek a universal storage model for structured data.
Data-Model Abstraction
The key to devising a domain-neutral storage model for structured data is to decouple that which varies, namely vocabularies and, more generally the data-models, from that which remains constant, namely the source artifact, and ideally the storage structure. To achieve this, we consider structure, vocabulary, semantics, and constraints from a higher level of abstraction from which we then distill a minimal set of elements sufficient to capture any data-model. These are defined as follows:
Sign: A sign, g, is a representation of a chunk of data, either physically located within a tangible artifact, or contained within an analyst’s mind. Examples of the former include a string of text in a document; an object within an image; a segment of audio in an audio
stream; a spike in a signal. As illustrated in Fig. 4, regardless of the type of medium, a sign for tangible data is always associated with a physical extent within the artifact and has a quantifiable span, which we call a mention. In contrast, signs that reside in an analyst’s mind become tangible only when she writes down her thoughts. We explicitly include such intangible signs here to support the analyst’s ability to assert information directly into the data store without having to first represent it in a physical artifact. The set of all signs, G = {gi}, spans across all data sources. In the set, each element is unique: ?i,j (i
? j) gi ? gj. G is the construct by which the DDF represents data. From the text data shown in Fig. 4, signs G’ = {’Suzi’, ‘Tanya’, ‘July 4, 2007′, ‘Bring lunch’, ‘Message1′} contribute to G (i.e. G’ Í
G), though many more signs may be identified even from this simple example.

Concept:
A concept,
c, is a representation of an abstract idea, defined explicitly or implicitly by a source data-model. For example, the nodes of an ontology, the tag set in an XML Schema Document (XSD), and the attribute / table names in a relational database all represent concepts. In the set of all concepts C = {ci}, each element is unique: ?i,j (i
? j) ci ? cj. From the text data shown in Fig. 4, concepts C’ = {‘Message’, ‘Person’, ‘Body_text’} contribute to the full set of concepts C (i.e. C’ Í
C).
Predicate: A predicate,
p, is a representation of an abstract idea used to express a relationship between “things.” Predicates are used in the formation of statements (described below) and may be defined either explicitly or implicitly by a source data-model. For example, the arcs of an ontology, and the attributes of an XML or database schema represent predicates. In the set of all predicates P = {pi}, each element is unique: ?i,j (i
? j) pi ? pj. The text example of Fig. 4 contributes predicates P’ = {‘To’, ‘From’,
‘Body’} to the set of all predicates P (i.e. P’ Í
P). The only predicate that is “built into” (i.e. defined by) our storage model is the ‘IsInstanceOf’ predicate, which is used to disambiguate signs to form terms as described below. Concepts and predicates are the constructs by which we link to data-models and, thereby, explicitly expose data-semantics.
Term: A term,
tij,
is an ordered pair <gi,cj> where gi ? G and cj ? C. Each term represents a disambiguated sign. The process of disambiguation associates a sign with a concept using the ‘IsInstanceOf’
predicate (though not every sign from G is necessarily disambiguated, and not every concept from C is necessarily used for disambiguation). In the set of all terms T = {tij}, each element is unique: ? i,j,k,l
(i ? k or
j ? l)
tij ? tkl. The text example of Fig. 4 contributes terms T’ = {t1, t2, t3, t4} where t1 = <’Suzi’, person>, t2 = <’Tanya’, person>, t3 = <’Bring lunch’, Body_text>, t4 = <Message1, message> to the complete set of terms T (i.e. T’ Í
T).
Statement: A statement, s, encodes a binary relationship between a subject and an object mediated by a predicate. A statement is represented by an ordered triple sijh = <subjecti, predicatej, objecth>. Among the set of all statements, each element is unique: ? i,j,h,l,m,n
(i ? l or j ? m or h ? n)
sijh ? slmn. In our model, subject and object may be either a term or statement. The simplest kind of statement is one in which subject and object are terms s0ijh = <ti, pj, th>. Statements in which the object is itself another statement represent reifications: s1klm = <tk, pl, sm>. Finally, a statement in which both subject and object are other statements represents a relationship between statements: s2xyz = <sx, py, sz>. The set of all statements S = {s0ijh} U {s1klm} U {s2xyz}. The text example of Fig. 4 shows three statements: S’ = {<t4, to, t1>, <t4, from, t2>, <t4, body, t3>} all with the same subject, which is the term corresponding to the message itself. These statements contribute to the set of all statements, i.e. S’ Í
S.
Note that the above definitions are formulated to be clear and unambiguous with respect to our particular approach and may not match those found in other literature. Throughout the paper, we will denote instances of signs, concepts, predicates, terms, and statements using Arial font within single quotes (e.g. ‘person’).
DDF
Abstracted from the milieu of all possible data-models, these elementary constructs (concept, predicate, sign, term, and statement) provide the fixed-points of a data reference model that will ultimately form the basis of a practical data integration platform. We call it the Data Description Framework (DDF). Despite its simplicity, the DDF is an amazingly rich model that can be viewed from at least two different perspectives. From one perspective, the DDF encompasses a synergistic combination of two higher order models lying along different dimensions of abstraction – one that is outward-looking (”extrospective”), one inward-looking (”introspective”).
The extrospective portion of the model is a meta-model formed by (a) C and P, which look outward to domain knowledge (represented in data / knowledge models), and (b) G, which looks outward toward the data. Signs bring data into the DDF as first class entities which may then participate in various, unlimited conceptualizing relationships created by any sort of automated or manual process at any time. Signs provide a fundamental level of data integration (that traditional approaches lack) resulting from having eliminated data-model barriers. Concepts and predicates are to domain knowledge what signs are to data. They are the mechanism by which such knowledge (typically encoded in domain-specific data / knowledge models) is linked into the DDF and exposed by our Data Integration Framework for use and re-use.
The introspective portion of the model is a semantic model formed by T and S which abstract data-model internals to expose structure in a uniform way. Terms link instances to concepts, exposing the meaning of the data unambiguously with respect to the original source data-model. Statements represent semantic relationships about, within, and between disambiguated data elements.
Together the introspective and extrospective models that comprise the DDF enable both horizontal and vertical data integration. The extrospective abstraction bridges data and domain knowledge (vertical integration). The instrospective abstraction bridges data structured by various disparate processes (horizontal integration) and binds the two outward looking faces of the extrospective model to provide a comprehensive data integration model.
From the second perspective, the DDF may be regarded as a synergistic combination of two interaction patterns – one that decouples, one that binds. DDF achieves decoupling in two ways. First, as a higher order data-model abstraction, DDF effectively decouples data from data-models. Thus, the DDF can encapsulate any sort of data regardless of the source data-model. Second, as a higher order data-structure, DDF effectively decouples structured data from data storage structures. Thus, the DDF can accommodate any data regardless of the source storage structure. As a result, the DDF provides a practical foundation for implementing a stable database that can accommodate any sort of structured data.
The ways in which DDF implements binding are illustrated in Fig. 5. Specifically, sign g binds with concept c to form term t,
and predicate p
binds with term t
to form statement s. The diagram also indicates that predicate may bind term and statement to form reification or predicate may bind statement with statement to form a statement relationship. These bindings allow data to be integrated within and across data-models and continuously enriched into knowledge.
Together these interaction patterns make the DDF a powerful yet practical platform for data fusion. Decoupling gives DDF the character of a universal data store and successive bindings progressively move intelligence artifacts (or their constituent elements) upward through the cognitive hierarchy. The result is a universal data fusion platform that supports data structured by any means, unrestricted associations within and between them, and increasingly rich semantics.
Expressiveness
Although the expressiveness of the DDF is sufficient to capture the data and data-semantics of any structured data source, we illustrate this for the relational model since it is the most commonly used. Similar arguments can be made for other model types, such as hierarchical, object-oriented, and graph.
In accordance with common relational formalism [Date 2004], a relation R is defined by the set of attributes A = {Ai}. The subset of attributes that comprise the primary key are denoted as K={ Kl}, K Í
A. The set of all data values in R is D = {dij},
where dij is a value on the intersection of attribute Ai and row Wj. We can integrate data and its original semantics from R into a DDF data space consisting of G0, C0, P0, T0, and S0 according to the following procedure:
-
All attributes of R are added to the set of concepts:
C = C0 U A
-
Non-key attributes are added to the set of predicates:
P = P0 U (A - K)
-
D’ = { d’i } is the set of unique values of D: ?i,j (i
? j) d‘i ? d‘j . The values in D’ that are not already present in G0 are added to the set of signs:G = G0 U (D’– G0)
-
We build the set of terms TR = {tij} where tij=<dij, Ai> and 1 ? i ? n, 1? j ? m. T’R is the subset of unique terms of TR. Terms of T’R are added to T0.
T = T0 U T’R
-
We build the set of statements SR = {sij} where sij = < <dkj, K>, Ai, <dij, Ai> > and dkj represents the combination of values of the key attributes for the row Wj. Statements of SR are added to S0:
S = S0 U SR
Representation of R in DDF is lossless (no loss or distortion of data and semantics, even though semantics of R is not explicitly represented in DDF) because we can restore R from DDF:
-
R is contained in statements S, therefore, using processing metadata (described in the following section and shown in Fig. 6), extract from S the statements that originated from R:
SR = {sij} where sij = < <dkj, K>, C <dij, Ai> >
-
From SR restore the structure and rows of R as follows:
|
K |
Ak+1 |
Ak+2 |
… |
An |
| dk1 | dk+1,1 | dk+2,1 | . . . | dn1 |
| dk2 | dk+1,2 | dk+2,2 | . . . | dn2 |
| . . . | . . . | . . . | . . . | . . . |
| dkm | dk+1,m | dk+2,m | . . . | dnm |
The process that was used to build combinations of values of the key attributes can be reversed to get to the relation in its original form:
|
Ak |
. . . |
Ak |
Ak+1 |
Ak+2 |
… |
An |
| d11 | . . . | dk1 | dk+1,1 | dk+2,1 | . . . | dn1 |
| d12 | . . . | dk2 | dk+1,2 | dk+2,2 | . . . | dn2 |
| . . . | . . . | . . . | . . . | . . . | . . . | . . . |
| d1m | . . . | dkm | dk+1,m | dk+2,m | . . . | dnm |
Therefore, by the integration procedure described above, the data and data-semantics from R are faithfully represented with the DDF. The structure of R itself and its identity integrity are explicitly captured in Layer 3.
This procedure further reveals two powerful and distinguishing features of the DDF:
-
The DDF can accommodate data and data-semantics from structured sources without loss or distortion.
-
Data sources may be integrated within the DDF in a mechanical fashion without requiring prior knowledge, and or analysis of, their domain-specific data-models.
Towards Implementation
A universal storage model based on DDF can be implemented in a variety of ways (e.g. objects, relations, triples). We chose to use a relational Dimensional Data Modeling (DDM) approach [Kimball 2002] mainly because it handily accommodates the capture and use of the kinds of metadata that the Intelligence Community favors. In particular, we need to maintain not only metadata about the indigenous artifact itself (e.g. the who what when and where of its creation and transmission), but also metadata regarding the processing by which signs, terms and statements are created. The former (i.e. contextual metatdata) is captured in the Layer 1 storage structure as described previously. The latter, which we term “process metadata,” must be accommodated in Layer 2.
A high-level, conceptual database view of the DDF storage model based on the DDM design pattern is depicted in Fig. 6. Before discussing the diagram in detail, we begin with a very brief overview of the DDM. In general, the DDM is a business-process-centric database design pattern that aims to decouple rapidly changing business metrics (e.g. stock quantities) from slowly changing business objects (e.g. stock items). For each business process, it uses a star schema consisting of a central “fact-table” for storing quantitative metrics, linked to multiple “dimension-tables” for storing descriptive objects. The DDM as a pattern is most effective when dimensions are re-usable across business processes and a natural a separation of time scales exists between the rate at which new facts are added (fast) and the rate at which dimensions change (slow).
As reflected in Fig. 6, the essential intelligence “business processes” that the DDF captures are semantic disambiguation and association formation. Thus, the DDF storage model consists of two main fact-tables, SemanticFact
and
AssociationFact. The SemanticFact table records metrics relating to the formation and disambiguation of signs, and references dimension tables that record signs, concepts, and process metadata. The signs themselves are represented using two tables, Sign and Mention. The value of a mention is identified by the region of the artifact in which it is localized. The boundary of such a region is recorded in the Mention table. The value of a sign may represent any number of source mentions that are exactly the same or are considered to be the same from the perspective of the process which extracts / identifies them. The Concept dimension records elements from the domain knowledge which includes the source artifacts’ data-models. Each record in the SemanticFact table binds a sign to a concept using ‘isInstanceOf’
semantics.

The AssociationFact table records metrics relating to the formation of associations and references dimension tables that record statements, predicates, and process metadata. Recall that statements come in three types – an association between terms (i.e. statement), an association between a term and another statement (i.e. reification), and an association between two statements (statement relation). These are accommodated by the three subclasses of the Statement dimension which are Statement0, Statement1, Statement2 respectively. The Predicate dimension records predicates from the domain knowledge.
The ProcessMetadata package shown in Fig. 6, represents a collection of dimensional tables used to record operational and contextual metadata about the various external processes that create SemanticFact and AssociationFact records. The particular elements and formulation of this metadata would be designed to support the information assurance needs of the Intelligence Community. Typically these would include Date, Time, Creator, and SecurityClassification dimensions.
The DDF does not prescribe or constrain the processing by which the DDF storage model would be populated, and the nature of such processing depends both on the modality and structure (or lack thereof) of the indigenous artifacts. Nevertheless, to illustrate how DDF works, and provide more insight into the relationship between external processes and our Data Integration Framework, the interested reader may find a brief discussion of the processing by which Layers 1 and 2 would be populated in the Appendix.
Relation to Other Approaches
A large body of work exists on data integration approaches [Batini 1986, Parent 1998, Halevy 2005, Bernstein 2007], many of which have contributed to successful Enterprise Information Integration solutions. However, because they all are based on some kind of data-model harmonization (i.e. mapping), they fail to provide practical solution for ULS intelligence data integration. In particular, data-model integration does not address data integration, which intelligence data processing requires. Physical data integration, typical of data warehouse applications, also requires heavy up-front data-model analysis and harmonization as well. This activity is not only resource intensive, it often results in the loss and or distortion of data and its semantics which, in the context of intelligence, may reduce the richness and power of the data. DDF addresses the needs of the Intelligence Community by providing ad-hoc, lossless data integration without imposing a heavy pre-processing burden.
Because they are born from a similar abstraction, the elementary constructs at the foundation of our reference model “share DNA” with those of the Resource Description Framework (RDF) [RDF 2004]. In particular, DDF terms are cousin to RDF resources – both existing at the atomic level of data as so-called “first class citizens” which may participate in arbitrary associations. However, whereas RDF aims at exposing machine-processable semantics and supporting logical inference, DDF aims at data integration and breaking the barriers between data sources. Consequently, DDF reaches further down into data to explicitly capture the grounding of terms within artifacts (and analyst’s thought) through the use of signs, and reaches up more broadly into knowledge models to expose data-semantics regardless of their machine processability. The fundamental difference is that RDF is an instance of a language for expressing semantic relationships, while DDF is a framework for data integration that can accommodate data represented by any language. Thus, while DDF powerfully supports RDF, it neither requires nor replaces it.
The Object Management Group has defined four increasing levels of software program abstraction from implementation / platform to pure abstract model [MOF 2000]. Decoupling the program model from the implementation makes it possible to develop tooling that can automatically generate platform specific implementations by combining the program model with implementation specific configuration information. Essentially a program instance = abstract model + specific “configuration” data. In the case of DDF, we present increasing levels of abstraction of structured data from implementation / representation to pure conceptual model (i.e. from Layer 1 to Layer 3). Decoupling the conceptual model from the implementation makes it possible to store variously structured data within a single DB. Essentially, structured data = data + abstract conceptual model (i.e. DDF) + specific data-model.
The Information Model Interoperability Reference Model [Melnik 2000; Omelayenko 2001], proposed for presenting information on the web, consists of three layers – syntax, object, and semantic. The syntax layer represents serialized data content, similar to our indigenous text artifacts. The semantic layer provides semantics through data-models and languages, and the object layer provides a bridge between the two. In contrast to the DDF however, the IMI does not provide a practical model for implementation of the layers and their interfaces.
The Data Reference Model (DRM) of the Federal Enterprise Architecture (FEA) aims to provide standards for the description, categorization, and sharing of data [DRF 2005]. Like DDF, the DRM entails a data-model metamodel, but unlike DDF it does not resolve the issues of data integration and unfortunately exhibits the typical shortcomings of most physical and virtual data integration approaches.
Finally, the Common Warehouse Model (CWM) [CWM 2001] offers a standardized approach (and tools that support it) for representing and mediating the automated interchange of metadata in warehouse applications that involve multiple data sources and data processing applications. Being focused on metadata integration, as opposed to data integration, the CWM mainly addresses issues relating to Layer 3 of our Data Integration Framework.
Current & Future Work
Today there is a deployed system called the Joint Intelligence Operational Capability in Iraq (JIOC-I) that essentially implements Layer 1 of our Data Integration Framework, though only for text artifacts. Unfortunately, the JIOC-I by itself falls short of a complete integration solution because it does not address structured data in a way that exposes that structure to support further analytical processing and visualization. In other words, it lacks Layer 2. Consequently, there has been much criticism of the JIOC-I, along with various suggestions for “fixing” it (e.g. by extending the schema to accommodate structured data). In contrast, we recognize the JIOC-I as a foundational element (that got it mostly right) and a first step toward a ULS intelligence system that integrates data while embracing data diversity. Indeed, the JIOC-I was the inspiration that led us to develop the layers above, and the DDF in particular.
Implementations of Layers 1 and 2 of our Data Integration Framework are being developed and tested in the Army CERDEC I2WD Information Exploitation Futures Laboratory (IXFL). As there are many possible physical implementations of the logical model, the challenge is to find one that optimally satisfies the functional (e.g. usability) and non-functional requirements (e.g. performance, manageability, and maintainability) of the Intelligence Community. Beyond the physical schema development, we have implemented a data ingest system along with processes for structuring unstructured data in order to fully exercise the system.
Other key aspects of our Data Integration Framework are described elsewhere. [Yoakum 2008 IQIS] highlights the low barrier to entry for data integration by describing the process for lossless mechanical data ingestion which requires no costly pre-processing or data-model harmonization. Data surfing, drilling, and discovery on the DDF unified data space are described in [Yoakum 2008 IQIS]. Finally, [Yoakum 2008 SIMA] addresses the utility of DDF in Situation Management – another activity that requires rapid, ad-hoc data integration. Forthcoming papers will address Layer 3, insight and results from our DDF prototype work, and fundamental aspects relating to knowledge representation.
As they are developed, Layers 3 and 4 of our Data Integration Framework will provide fertile ground for entirely new work in knowledge interaction and perception. Layer 3 will become a universal substrate on which to explore, discover, and encode relationships between knowledge models that go well beyond harmonization and integration to include, for example, dissonant perspectives which can not and should not be “harmonized.” Layer 4 provides the lenses through which the human user looks into this morass of knowledge, information, and data to explore and make sense of the object of his interest (e.g. a domain, situation, entity) according to a chosen perspective. Having all four layers present will close the loop between data and knowledge in both directions so that they may co-evolve to yield more complete and accurate understanding. Atop the immense foundation of integrated data provided by Layers 1 and 2, Layers 3 and 4 will fuel the engines of ULS systems research for a very long way into the future.
Conclusion
The Intelligence Enterprise is inexorably evolving into an Ultra Large Scale Systems world that can not, and will not, be constrained in its processes or products. The data integration problem is but one early symptom of this burgeoning reality. Although this knowledge does not provide a recipe for good solutions, it makes it rather easy to spot bad ones. Unfortunately, current data integration approaches generally represent the latter.
In this paper, we have presented the first two layers of a multi-layer Data Integration Framework that enables deep semantic data integration in a ULS systems environment. The model on which it is founded, the DDF, supports both horizontal and vertical data integration (i.e. across disparate data-models and from data to knowledge) by embracing the diversity of data / knowledge models and processes by which data is structured. More importantly, the model admits a practical implementation (i.e. “hard running code”) that accommodates artifacts of any modality (e.g. text, audio, images, video, signals) in a single unified data store that enables true data fusion and the continuous enrichment of data into knowledge. Awash in a sea of fragmented data, and driven by a palpable sense of urgency, we aspire to drive both the theory and practice of data integration forward.
References
[Batini 1986] Batini, C. et al. A comparative analysis of methodologies for database schema integration, ACM Computing Surveys, (18) 4, 1986.
[Bernstein 2007] Bernstein P., Ho, H. Model Management and Schema Mappings: Theory and Practice, Proceedings of VLDB Conference, 2007.
[CWM 2001] Object Management Group “Common Warehouse Model (CWM) Specification”, OMG, 2001. http://www.omg.org/docs/ad/01-02-01.pdf
[Date 2004] Date, C. An Introduction to Database Systems, 8th edition, Addison Wesley, 2004.
[DRF 2005] Federal Enterprise Architecture Program “The Data Reference Model”, 2005. http://www.whitehouse.gov/omb/egov/documents/DRM_2_0_Final.pdf
[Halevy 2005] Halevy, A. et al. Enterprise information integration: successes, challenges and controversies, Proceedings of 24th International Conference on Management of Data, Baltimore, 2005.
[Izydor 2007] Izydor, C. and McCollum, P. BI, Process and Integration Trends. DM Review Magazine, August 2007. http://www.dmreview.com/issues/20070801/1089409-1.html?portal=data_integration
[Kimball 2002] Kimball, R. and Ross, M. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, Wiley, 2002.
[Lee 2006] Lee, Y., Pipino, L., Funk, J., Wang, R. Journey to Data Quality, The MIT Press, Cambridge, MA, 2006
[Melnik 2000] Melnik, S. and Decker, S. A layered approach to Information Modeling and Interoperability on the Web. Proc. ECDL’00 Workshop on the Semantic Web, Lisbon, Portugal, Sept 2000. http://infolab.stanford.edu/~melnik/pub/sw00/.
[MOF 2000] Object Management Group “MetaObject Facility (MOF) Specification”, OMG, 2000. http://www.omg.org/docs/formal/00-04-03.pdf
[Northrop 2006] Northrop, L., et al., Ultra-Large-Scale Systems The Software Challenge of the Future, Pittsburgh: Carnegie Mellon University, 2007. http://www.sei.cmu.edu/publications/books/engineering/uls.html
[Omelayenko 2001] Omelayenko, B. and Fensel, D. An Analysis of B2B Catalogue Integration Problems. Proceedings of the International Conference on Enterprise Information Systems (ICEIS-2001), July 7-10, 2001, p. 945-952.
[Parent 1998] Parent, C. and Spaccapietra, S. Issues and approaches of database integration, Communications of the ACM, 41(5), 1998.
[RDF 2004] RDF Core Working Group “Resource Description Framework (RDF)”, W3C, 2004. http://www.w3.org/RDF/.
[Steinberg 1998] Steinberg, N., Bowman, C. L. and White F. E. Revision to the JDL Data Fusion Model, Joint NATO/IRIS Conference, Quebec City, October 1998.
[Yero 2008] Yero, J. Logical vs. Physical Data Integration: A Practical Decision Guide, The DAMA International Symposium & Wilshire Meta-Data Conference. San-Diego, CA, 2008.
[Yoakum 2008 IQIS] Yoakum-Stover, S. and Malyuta, T. Unified Architecture for Integrating Intelligence Data, Proceedings of MIT Information Quality Industry Symposium, MIT, Cambridge, MA, 2008.
[Yoakum 2008 DAMA] Yoakum-Stover, S. and Malyuta, T. Unified Integration Architecture for Intelligence Data, Proceedings of DAMA International Europe Conference, London, UK, 2008.
[Yoakum 2008 SIMA] Yoakum-Stover, S. and Malyuta, T. Unified Data Integration for Situation Management, Proceedings of the 4th IEEE Workshop on Situation Management (SIMA 2008) at MILCOM 2008, San Diego CA, 2008.
Appendix - Processing
Ingestion
Consider first, processes that load indigenous artifacts into Layer 1 either physically or virtually so that they may be unambiguously referenced within Layer 2. Typically these are called ingestion processes. Such processes insert either the entire indigenous artifact, or a reference to its location within the authoritative data source, into Layer 1. In addition, both artifact and process metadata are recorded in the appropriate metadata tables. The former essentially provides a card catalogue for the artifact and the latter provides information assurance.
Unstructured Information
Processes that structure unstructured artifacts generate SemanticFact and AssociationFact records in Layer 2. Each such process necessarily entails a particular data-model. This data-model is persisted in Layer 3. Concepts and predicates from the data-model (or references to them) are also persisted in the Concept
and Predicate dimension tables of Layer 2 along with sufficient metadata to identify and retrieve the data-model source artifact (i.e. schema, ontology, etc..).
Unstructured information processing typically identifies all instances of the concepts within its data-model or type system. For example, a given text extractor may identify all ocurrences of ‘IBM’ and associate them with the concept ‘Company.’ Each such instance is represented as a DDF mention. The position of each mention within the source artifact is recorded in the Mention table (e.g. using beginChar, endChar) and a single record is added to the Sign table using, for example, the actual contents of the span (‘IBM’) as the sign value. Each disambiguation ocurrence (i.e. the association made by the text extractor between a mention and a concept) is recorded in the SemanticFact table along with appropriate process metadata, and a term consisting of <sign, concept> is created in the Term table (if such term does not already exist).
Further semantic processing may identify relationships between elements within the artifact. The elements themselves would have already been recorded as SemanticFacts. For each such relationship, an AssociationFact is recorded along with appropriate process metadata, and a Statement table entry is created.
Unstructured information processing of other than text artifacts is similar. The main differences being that entries in the Mention table will have a different spanCoordinateType, and the method for assigning a sign value will be different. For example, consider object recognition software that extracts faces from within an image of a crowd. For each extracted face, the corresponding rectangular area of the image could be recorded in the Mention table with the help of pixelUpperLeft and pixelLowerRight, and a sign (e.g. ‘faceImage’) would be assigned to all extracted mentions.
Extract-Transform-Load
Consider next, Extract-Transform-Load (ETL) processes that pull data from other structured data sources, typically databases, into Layer 2. The initial phase of the ETL loads the source data-model (e.g. database data dictionary) into Layer 3, and concepts and predicates (or their references) into in the Concept and Predicate
dimension tables of Layer 2. Sufficient metadata necessary to identify and retrieve the data-model source artifact (i.e. schema), are also stored. Subsequent ETL processing, which entails a mapping to the DDF structure, inserts signs, terms, and statements into the SemanticFact and AssociationFact tables along with appropriate process metadata.
Because the ETL process needs only to capture the explicit semantics of the underlying model of the source (e.g. relational, hierarchical, graph…), one ETL can be developed for a whole class of data stores. For example a discussion of ETL for relational stores may be found in [Yoakum 2008 IQIS].
Interactive
Finally, consider an interactive user interface that enables an analyst to assert semantic and association facts directly into the DDF. The analyst will have the option to use existing concepts, predicates, terms, and statements or to create new ones. In the case of the latter, recorded and asserted mentions will reference the source analyst. Metadata recorded for manual processes with also reference the source analyst.
[...] Full Paper Posted on January 1, 2009 13:16 E-mail | Permalink | Comments (0) | Trackback [...]
Pingback by Unified Architecture for Integrating Intelligence Data — January 1, 2009 @ 10:16 pm
[...] Full Paper This was written by work. Posted on Thursday, January 1, 2009, at 1:16 pm. Filed under Uncategorized, work. Tagged work. Bookmark the permalink. Follow comments here with the RSS feed. Post a comment or leave a trackback. [...]
Pingback by AndyEick.com › Unified Architecture for Integrating Intelligence Data — November 15, 2009 @ 12:31 pm