Using the Collection pattern
DDI-Views introduces a generic Collection pattern that can be used to model different types of groupings, from simple unordered sets to all sorts of hierarchies, nesting and ordered sets/bags.
A collection is a container, which could be either a set (i.e. unique elements) or a bag (i.e. repeated elements), of Members. Collections can also be extended with richer semantics (e.g. generic, partitive, and instance, among others) to support a variety of DDI 3.x and GSIM structures, such as Node Sets, Schemes, Groups, sequences of Process Steps, etc. Collections together with their related classes provide an abstraction to capture commonalities among a variety of seemingly disparate structures.
A Collection consists of zero, one or more Members (Figure 1). A Member could potentially belong to multiple Collections. A Collection is also a Member, which allows for nesting of Collections in complex structures. Members have to belong to some Collection, except in the case of nested Collections where the top level Collection is a Member that doesn’t belong to any Collection.
Figure 1. Collections class

This pattern can be used via a special type of association called realizes. DDI-Views uses realizes to say that a class “behaves” like a Collection. For instance, consider a Set that consists of Elements, they implement the Collection pattern as follows: Set realizes Collection and Element realizes Member.
To realize this pattern all classes involved must be associated in a way that is compatible with the pattern.
As a rule of thumb, a more restrictive type of association than the one that appears in the pattern is compatible, a looser one is not. For instance, since the collection pattern has an aggregation association (denoted by the empty diamond), classes realizing the Collection pattern need to be related by either an aggregation or a composition, nothing else. In addition, source and target, when applicable, must also match, e.g. the diamond of the aggregation/composition needs to be on the class realizing Collection, not Member.
Similar compatibility rules apply to cardinality. Furthermore, all associations must be realized, with the exception of IsA associations, which are usually part of the pattern definition and do not apply to individual realizations in the same way. Renaming associations does not affect compatibility as long as the documentation clearly explains how they map to the association in the pattern.
For instance, consider Figure 2.. In this example, a Set class is defined as being composed of at least one Element, i.e. no empty Sets are allowed. In addition, an Element always belong to one and only one Set, which means that deleting the Set will also delete its Elements. Such an association is compatible with the contains association in the Collection pattern and thus Set and Element can realize Collection and Member, respectively. In contrast, Schema and XML Instance cannot realize the pattern: the association is neither an aggregation nor a composition, Schema is not a grouping of XML Instances, and the association points from XML Instance to Schema. None of this is compatible with the Collection pattern, in particular with the semantics of the contains association between Collection and Member.
Figure 2. Compatibility with Collections class

Collections can be structured with Binary Relations, (Figure 3) which are sets of pairs of Members in a Collection. Binary Relations can have different properties, e.g. totality, reflexivity, symmetry, and transitivity, all of which can be useful for reasoning.
Figure 3. Binary Relations

A Binary Relation is said to be symmetric if for any pair of Members a, b in the associated Collection, whenever a is related to b then also b is related to a. Based on this property we define two specializations of Binary Relation: Symmetric Binary Relation, when the property is true, and Asymmetric Binary Relation, when the property is false. Symmetric Binary Relations can be viewed as collections of Unordered Pairs themselves, whereas Asymmetric Binary Relations can be viewed as collections of Ordered Pairs. However, for simplicity, we do not model Relations themselves with the Collection pattern.
We can further classify Binary Relations based on additional properties. We say that a Binary Relation is total if all Members of the associated Collection are related to each other. We call it reflexive if all Members of the associated Collection are related to themselves. Finally, we say it is transitive if for any Members a, b, c in the associated Collection, whenever a is related to b and b is related to c then a is also related to c. [Refer to Dan’s document on Relations for more details.]
These properties can be combined to define subtypes of Binary Relations, e.g. Equivalence Relation, Order Relation, Strict Order Relation, Immediate Precedence Relation, and Acyclic Precedence Relation, among others. Equivalence Relations are useful to define partitions and equivalence classes (e.g. Levels in a Classification). Order Relations can be used to represent lattices (e.g. class hierarchies, partitive relationships), Immediate Precedence Relations can define sequences and trees (e.g. linear orderings, parent-child structures) and Acyclic Precedence Relation can represent directed acyclic graphs (e.g. molecular interactions, geospatial relationships between regions).
Figure 4. Binary Relations specialization

These subtypes can also have various semantics, e.g. Part-Of and Subtype-Of for Order Relations, to support a variety of use cases and structures, such as Node Sets, Schemes, Groups, sequences of Process Steps, etc. Note that some of them include temporal semantics, e.g. Strict Order Relation and Acyclic Precedence Relation.
A modeller can use the different semantics types as a guide when trying to decide what type of Binary Relation to realize. For instance, if the new class to be added to the model is a Node Set containing Nodes that will be organized in a parent-child hierarchy, the modeller can define a Node Hierarchy class with PARENT_OF semantics to structure the Node Set. The type of Binary Relation to realize then is Immediate Precedence Relation because it is the one that has the required semantics in its Semantics Type.
Alternatively, a modeller familiar with the definitions of the Binary Relation properties, i.e. symmetry, reflexivity and transitivity, could make the choice based on what combination represents the type they are looking for. For instance, a parent-child hierarchy requires the Binary Relation to be ANTI_SIMMETRIC (if a Node is the parent of another, the latter is not the parent of the former), ANTI_REFLEXIVE (a Node cannot be a parent of itself) and ANTI_TRANSITIVE (a Node is not the parent of its children’s children). It is easy to see that the only one that satisfies that criteria is the Immediate Precedence Relation.
Figure 5 shows an example of the realization of the pattern. We can model Node Hierarchy and Node Hierarchy Pair classes as realizations of Immediate Precedence Relation and Ordered Pair, respectively.
Figure 5. Node Set

Let us illustrate how this model works with a simple instance. Consider a geographical Statistical Classification with Classification Items (Figure 6) representing Canada, its provinces and cities.
Figure 6. Node Set example

Since Statistical Classifications are Node Sets and Classification Items are Nodes, we can view Classification Items such as Canada, Ontario, Quebec, Toronto, etc. as Members in a Collection structured by a Node Hierarchy Relation. Node Hierarchy Pairs represent the parent-child relationships in the Node Hierarchy Relation. For instance, (Canada, Ontario) is a Node Hierarchy Pair in which Canada is the parent and Ontario is the child. Other Node Hierarchy Pairs are (Canada, Quebec) and (Canada, Toronto).
Note that by maintaining the hierarchy in a separate structure, i.e. the Node Hierarchy, Items can be reused in multiple Classifications. For instance, in another geography Statistical Classification provinces grouped into regions, Ontario can be made the child of the Central Region instead of Canada without changing the definition of the Classification Items involved, i.e. Canada, Ontario and Central Region in this case.
Interestingly enough, Binary Relations might not be enough for some purposes. First of all, some structures cannot be reduced to binary representations, e.g. hypergraphs. In addition, a Binary Relation could be too verbose in some cases since the same Member in a Collection could appear multiple times in different pairs, e.g. one-to-many relationships like parent-child and ancestor-descendent. An n-ary Relation provides a more compact representation for such cases. Like Binary Relations they come in two flavours: Symmetric Relation and Asymmetric Relation. Let us consider the asymmetric case (Figure 7) first.
Figure 7. Asymmetric relations

Asymmetric Relations provide an equivalent, yet more compact, n-ary representation for multiple Ordered Pairs that share the same source and/or target Members. In addition, they can be used to model Ordered Correspondences to map Collections and their Members based on some criterion (e.g. similarity, provenance, etc.). Ordered Collection and Member Correspondences realize Asymmetric Relation and Ordered Tuple, respectively.
Consider now a geography classification tree like the Canadian example (Figure 6) above. All Node Hierarchy Pairs that have the same parent Member could be represented with a single Node Hierarchy Tuple that realizes the Ordered Tuple in the model. The realization will also rename source as parent and target as child.
Although only two cities are shown in the example, Ontario has hundreds of them. Using a Node Hierarchy realizing and Asymmetric Relation, all pairs that have Ontario as parent, e.g. (Ontario, Toronto), (Ontario, Ottawa), (Ontario, Kingston), etc., could be joined into a single n-ary Node Hierarchy Tuple with Ontario as parent and Toronto, Ottawa, Kingston, etc. as children. With this representation we replaced multiple pairs with a single tuple and avoid the repetition of Ontario hundreds of times for each individual pair.
Symmetric Relations are similarly structured as Asymmetric Relations (Figure 8). They provide an equivalent, yet more compact, n-ary representation for multiple Unordered Pairs that have some Members in common. In addition, they can be used to model (unordered) Correspondences between Collections and Members.
Figure 8. Symmetric relations

Back to our geography Statistical Classification example (Figure 6), we can have two variants with similar structure as shown in Figure 9.
Figure 9. Example showing both Asymmetric and Symmetric relations

Using the Process pattern
The process pattern consists of classes to describe business functions and workflows that can be mapped to process execution languages like BPEL or BPMN.
A Process is a Sequence of Process Steps that perform one or more business functions. Each Process Step can be performed by a Service. There are two types of Process Steps: Acts and Control Constructs. Acts represent actions and are atomic Process Steps, i.e. they cannot be composed of other Process Steps. An Act is similar to an instruction in a programming language and a terminal in the production rules of a formal grammar. A Control Construct describes logical flows between Process Steps.
Figure 1. Process Step class

Process Steps can be nested and thus describe processes at multiple levels of detail. The nesting of Process Steps always terminate in an Act. All nesting of Process Steps occur via Sequences, a specialization of Control Constructs.
Control Constructs include Sequence and Conditional Control Construct (Figure 2). The former models linear execution of Process Steps whereas the latter includes three types of iterative constructs: repeatWhile, repeatUntil and Loop. The Sequence at the end of the contains association represents the body of the Conditional Control Construct, which is executed depending on the result of the condition evaluation. The specialized sub-classes determine whether the Sequence is executed in each iteration before the condition is evaluated (RepeatUntil), or after (RepeatWhile, Loop). The Loop also provides a counter with initialValue and stepValue that can be used to specify in the condition how many times the Sequence in the body is executed.
Figure 2. Control Construct class

In addition to the iterative constructs, Conditional Control Constructs includes IfThenElse, which provides a means to specify branching control flows. It contains the condition inherited from the parent class and two associations: contains (also inherited from the parent class), to the Sequence of Process Steps that is executed when the condition is true, and containsElse, to an optional Sequence to be executed if the condition is evaluated to false. Optionally, IfThenElse can also have an associated ElseIf construct to model switch statements.
It is important to note that the model also covers parallel processing: Conditional Control Constructs can contain multiple Sequences that could be executed in parallel (more on this later).
A Sequence can be viewed as a Collection whose Members are Process Steps that can be ordered in three different ways: with a Sequence Order (traditional design-time total ordering), with one or more Temporal Interval Relations (design-time temporal constraint or “fuzzy” ordering), or with a Rule (constructor) to determine ordering at run-time.
We discuss Sequence Order first. A Sequence Order realizes Collection pattern as follows (Figure 3).
Figure 3. Sequence Order

Sequence Order and Sequence Order Pair realize Strict Order Relation and Ordered Pair, respectively. Let us remember from the Binary Relations Specialization diagram [hyperlink to Collections Figure ???] that Strict Order Relation is an Asymmetric Binary Relation that is ANTI_SYMMETRIC, ANTI_REFLEXIVE and TRANSITIVE. In addition, the Sequence Order realization has totality=TOTAL and semantics=SUCCESSOR_OF, which means that it can specify a total order of the Process Steps in the Sequence where the order semantics is given by SUCCESOR_OF.
We discuss next Temporal Interval Relations. They provide a mechanism for capturing Allen’s interval relations, one of the best established formalisms for temporal reasoning. Temporal Interval Relations can be used to define temporal constraints between pairs of Process Steps, e.g. whether the execution of two Process Steps can overlap in time or not, or one has to finish before the other one starts, etc. This also supports parallel processing.
There are thirteen Temporal Interval Relations: twelve asymmetric ones, i.e. precedes, meets, overlaps, finishes, contains, starts and their converses, plus equals, which is the only one that has no converse, or rather, it is the same as its converse. Together these relations are distinct (any pair of definite intervals are described by one and only one of the relations), exhaustive (any pair of definite intervals are described by one of the relations), and qualitative (no numeric time spans are considered).
Following Allen’s, Temporal Interval Relations are defined as follows.
Figure 4. Allen’s Temporal Interval Relations

In DDI-Views, each one of the asymmetric Allen’s interval relations is a Temporal Interval Relation that realizes different Binary Relations with specific temporal semantics. All asymmetric Temporal Interval Relation contains Ordered Interval Pairs whereas the only symmetric one, i.e. Equals, contains Unordered Interval Pairs (Figure 5). For instance, the Precedes Interval Relation realizes the pattern as follows.
Figure 5. Precedes Interval Relation

Precedes Interval Relation and Ordered Interval Pair realize Strict Order Relation and Ordered Pair, respectively. If we look back to the Binary Relations Specialization diagram in the previous section we notice that Strict Order Relation has the TEMPORAL_PRECEDES semantics, among others, which means that the Process Step at the end of the source association in the Ordered Interval Pair has to finish before the one at the end of target starts.
The Equals Interval Relation (Figure 6) is a slightly different case because it is an equivalence relation rather than an asymmetric one and therefore it contains Unordered Interval Pairs.
Figure 6. Equals Interval Relation

Equals Interval Relation and Unordered Interval Pair realize Equivalence Relation and Unordered Pair, respectively. Equivalence Relation is simply a Symmetric Binary Relation that is REFLEXIVE and TRANSITIVE with some additional semantics, among which we find TEMPORAL_EQUALS, the one required by the Equals Interval Relation. This means that the execution of the two Process Steps at the end of the maps association in Unordered Interval Pair begin and end at the same time.
Temporal Interval Relations and Sequence Orders can be combined in the same Sequence. For instance, consider Figure 7:
Figure 7. Combination of Interval Relations

Question Q2 requires the answer of question Q1 so it has to be executed after Q1. That is an example of the traditional sequence ordering given by the Sequence Order Relation. However, note that there is no dependency between Q2 and Q3 since both require only the answer of Q1. Therefore Q2 and Q3 could be executed at the same time, which can be expressed with the Equals Interval Relation.
The Variable Cascade
The DDI-Lifecycle standard is intended to address the metadata needs for the entire survey lifecycle. This particular document is dedicated to a description of variables as part of the DDI-Lifecycle. It contains a UML (Unified Modeling Language) class diagram of a model for describing variables, and the model is part of the larger development effort called DDI Moving Forward to express DDI-Lifecycle using UML.
Typical models for describing variables take advantage of much reuse, and the model provided here is no exception. It is reuse that makes metadata management such an effective approach. However, effectiveness is due to other factors as well, and an important one is to keep the number of objects described to a minimum. For finding relevant data is much more complicated as the number of objects rises.
For variables, reusable descriptions are brittle in the sense that if one of the related records to a variable changes, then a new variable needs to be defined. This is especially true when considering the allowed values (the Value Domain) for a variable. Many small variations in value domains exist in production databases, yet these differences are often gratuitous (e.g., simple differences in the way some category is described that do not alter the meaning), differences in representation (e.g., simple changes from letter codes to numeric ones), or differences in the way missing (or sentinel) values are represented.
Gratuitous differences in expressions of meaning are reduced or eliminated by encouraging the usage of URIs (Uniform Resource Identifiers) to point to definition entries in taxonomies of terms and concepts. The principle of “write once – use many”, very similar to the idea of reuse, is employed. Pointing to an entry rather than writing its value eliminates transcription errors and simple expression differences, promotes comparability, and ensures interoperability.
Differences in representations, including codes, are simplified by separating them from the underlying meaning. This is equivalent to the terminological issue of allowing for synonyms and homonyms of terms. Through reuse, all representations with the same meaning are linked to the same concept. This is achieved through the use of Value Domains and Conceptual Domains in the model presented here.
Sentinel values are important for any processing of statistical or experimental data, as there are multiple reasons some data are not obtainable. Typically, these values are added to the value domain for a variable. However, each time in the processing cascade the list of sentinel values changes, the value domain changes, which forces the variable to change as well. Given that each stage of the processing cascade make require a different set of sentinel values due to processing requirements, the number of variables mushrooms. And this variable proliferation is unmanageable and unsustainable.
The model developed here is based on two important standards for the statistical community, GSIM (Generic Statistical Information Model) and ISO/IEC 11179 (Metadata Registries). In fact, the model in the conceptual section of GSIM is based closely on the metamodel defined in ISO/IEC 11179. Both models help to perpetuate the problems described here, if each standard is followed in a conforming way. They reduce redundancy by separating value domains and conceptual domains. However, they do not directly support the use of URIs and they do not separate value domains from sentinel value lists. Also, they do not fully exploit the traceability that inheres in certain relationships between the value and conceptual domains to create a continuum of connected variables that further reduces redundancy.
The purpose of this document is to present a model that significantly reduces the overhead described above. In particular, we separate the sentinel values from the substantive ones. This separation allows us to greatly reduce the number of value domains, and thus variables, that need to be maintained. With this separation, now there are three classes of connected variables in which the represented variable specializes a conceptual variable by adding a value domain, and the instance variable specializes the represented variable by adding a sentinel value domain.
Figure 1. Variable Cascade

Example
Detergents
Imagine we are assessing environmental influences at the household level. One question we might ask is “What detergents are used in the home?” In connection with this question we prepared show cards. Each show card lists a series of detergents. There are multiple show cards because products vary by region. Each show card corresponds to a code list. There are overlaps among the code lists but there are differences too.
In our model there is a conceptual variable. It has an enumerated conceptual domain that takes its categories from a category set. Here the category set is an unduplicated list of detergents taken from all the show cards put together. The conceptual domain might be exposure to chemicals in household detergents. The unit type might be households.
In our survey we ask a question corresponding to multiple represented variables, one corresponding to each show card. Each represented variable is measured by an enumerated value domain that takes its values from the show card code list.
All the represented variables here are derived from the same conceptual variable. This is the main point of the example: a conceptual variable under the right conditions can connect multiple represented variables.
Sentinel Values
The represented variable code list in our model excludes sentinel values. Sentinel values were introduced into ISO/IEC 11404 in 2007. ISO/IEC 11404 describes a set of language independent datatypes and defines a sentinel value as follows: a sentinel value is a “signaling” value such as nil, NaN (not a number), +inf and –inf (infinities), and so on. Depending on the study, sentinel values may include missing values. ISO 21090 is a health datatypes standard based on ISO/IEC 11404. ISO 21090 identifies 15 “nullflavors” that correspond to the concept of sentinel values introduced in ISO 11404 .
In our model the instance variable adds a sentinel value domain to the represented variable from which it is derived. In the process it grows the code list it derived from the represented variable to include a set of sentinel values. These sentinel values reference a category set of sentinels called the sentinel conceptual domain. The sentinel values included in any instance variable need not cover all the members of the sentinel conceptual domain. Instead they may refer just to a subset.
During data acquisition, we ask a question that allows don’t know and refused, which an interviewer may ask or not, depending on the skip logic. We create an instance variable corresponding to this question based on a represented variable. The instance variable includes sentinel values. Note that the value domain of the represented variable need not be an enumerated value domain. Instead it can be a described value domain. We choose to include three sentinel values corresponding to three ISO 21090 nullflavors:
Nullflavor |
Description |
UNK |
A proper value is applicable, but not known. Corresponds to Refused. |
ASKU |
Information was sought but not found Corresponds to Don’t Know. |
NA |
No proper value is applicable in this context (e.g., last menstrual period for a male) |
Subsequently, we prepare the collected data for processing by SAS. SAS has its own set of sentinel values. For each sentinel category the data acquisition instance variable accounts for, SAS has its own set of sentinel values. As a consequence, the answers that correspond to sentinel values are different, and in the process of constructing a SAS file, a second instance variable is created for the purposes of data processing.
However, both the data acquisition instance variable and the data processing instance variable are derived from the same represented variable. That is the point of this example.