DIMS Project Final Report v 1.0.1

This document (version 1.0.1) is the Final Report for the Distributed Incident Management System (DIMS) project, DHS Contract HSHQDC-13-C-B0013.

Introduction

This document presents the Final Report for the Distributed Incident Management System (DIMS) Project.

Project Overview

The DIMS Project was intended to build upon a long history of DHS sponsored efforts to support security information event sharing among State, Local, Territorial, and Tribal (SLTT) government entities in the Pacific Northwest region known as the Public Regional Information Security Event Management (PRISEM) project. The PI served as the Director of Research and Development for the PRISEM Project and managed its networking and hardware at the University of Washington from the project’s inception in 2008.

For more on the background on the relationship between the PRISEM and DIMS projects, see DIMS Operational Concept Description v 2.9.0, DIMS System Requirements v 2.9.0, and DIMS Commercialization and Open Source Licensing Plan v 1.7.0.

The period of performance for the DIMS Project ran from August 16, 2014 to August 15, 2017. Over the period of performance, the people listed in Table Project Participants were involved in project:

Project Participants
Name Organization(s) Role(s)
David Dittrich Applied Physics Laboratory, University of Washington; Center for Data Science, University of Washington Tacoma Principal Investigator
Linda Parsons Next Century Subcontract programmer
Scott Warner Next Century Subcontract programmer
Mickey Ross Ross Consulting Services Subcontract program management, administrative services
Eliot Lim Applied Physics Laboratory, University of Washington; Center for Data Science, University of Washington Tacoma System administration, hardware support
Stuart Maclean Applied Physics Laboratory, University of Washington Programmer
Megan Boggess University of Washington Tacoma; Center for Data Science, University of Washington Tacoma Graduate Student RA; Programming, system administration.
Jeremy Parks Center for Data Science, University of Washington Tacoma Programming, system administration
Jeremy Johnson Critical Informatics Subcontract programming, system administration
Katherine Carpenter Critical Informatics Subcontract program management, administrative services

The Value Proposition

This section discusses the value proposition for the products of the DIMS project.

The Need

You can’t have good system security without good system administration. Organizations need to have strong system administration skills in order to have a secure foundation for their operations. That 1/3 of attacks are due to mistakes and misconfigurations identified in Verizon’s DBIR reflects a painful reality. And 100% of those breaches occurred in companies who employ humans.

Of course, all humans make mistakes, or miss things. Or they may not know better when trying to just figure out how to get their job done and blindly follow someone’s lead, opening themselves and their organization up to a major security hole (as seen in Fig. 1 from Don’t Pipe to your Shell).

Piping insecure content directly into a privileged shell

Piping insecure content directly into a privileged shell

Mistakes are easier to make in situations where it is difficult to see what is going on, or where someone is forced to deal with something new that they have never dealt with before and have little expertise. Paul Vixie has described the pain (in terms of operations cost and impact on security posture) that results from complexity in today’s distributed systems and security products. [Vix16]

Increased complexity without corresponding increases in understanding would be a net loss to a buyer. […]

The TCO of new technology products and services, including security-related products and services, should be fudge-factored by at least 3X to account for the cost of reduced understanding. That extra 2X is a source of new spending: on training, on auditing, on staff growth and retention, on in-house integration.

As knowledge and experience increase, the quality of work output increases and the errors and omissions decrease. Finding and procuring the talent necessary to operate at the highest level, however, is neither easy, fast, nor cheap.

This all raises the question, “What can our organization do bring the capabilities of multiple open source products into a functioning whole with the least amount of pain and best operating security outcome?”

Our Approach

Our approach is to provide a reference model for establishing a secure and maintainable distributed open source platform that enables secure software development and secure system operations. The DIMS team (now implementing the third iteration of some of the core elements) has experienced the pain of this process, which will reduce the cost for those who adopt our methodology.

The DIMS project brings together multiple free/libre open source software (FOSS) tools in a reference model designed to be built securely from the ground up. The two primary outcomes of this effort are:

  1. An example platform for building a complex integrated open source system for computer security incident response released as open source software and documentation. These products provide a working and documented model platform (or DevOps infrastructure) that can facilitate the secure integration of open source components that (in and of themselves) are often hard to deploy, and often are so insecurely implemented that they are effectively wide open to the internet. This not only solves some of the infrastructure problems alluded to by the Linux Foundation, but also addresses Vixie’s example of supporting organizations wanting to use open source security tools in concert to address their trusted information sharing and security operations needs.
  2. Transitioning this platform into the public sector to support operational needs of SLTT government entities. DIMS project software products were included in a draft proposal for evaluation by the PISCES-NW not-for-profit organization for use in the Pacific Northwest. The last modification to the DIMS contract includes a pilot deployment for use by the United States Secret Service for their Electronic Crimes Task Force (ECTF) membership.

The DIMS System Requirements v 2.9.0 documents security practices and features that we have incorporated to the greatest extent possible, in a way that can be improved over time in a modular manner. The system automation and continuous integration/continuous deployment (CI/CD) features help in implementing and maintaining a secure system. (Red team application penetration testing will further improve the security of the system through feedback about weaknesses and deficiencies that crept in during development and deployment.)

Golden nugget

Over two decades of system administration and security operations experience underlies the architectural model that we have been researching, developing, implementing, and documenting. The barrier to entry is the amount of time and learning necessary to acquire this same expertise in order to be competitive.

Benefits per Cost

The value of the DIMS products and methodology comes from altering the cost equation described by Vixie, which can be expressed this way:

_images/cost-equation.png

The benefit to customers is maximized by the ability to construct and operate a secure incident response monitoring platform, expand it with additional open source tools as needed, saving a large part of the 2x multiplier in implementation cost in system administration and operations overhead cited by Vixie. We enable this by helping make a less complex, more transparent, source controlled, and easier to secure open source platform than may otherwise be produced by someone leveraging multiple unfamiliar open source security tools from scratch. That means standing up a new server and adding new services so it can be reduced from taking hours or days per system to just a few minutes of effort. If that task has to be repeated dozens (or possibly hundreds) of times, the cost savings can be significant.

The DIMS team created and used a CI/CD model using Git, Jenkins CI, and Ansible for taking software source code, system automation instructions, software configuration, and documentation, to build a prototype for an open source software integration project. The resulting product can be used by an internal security operations group (or managed security service provider) to create an open source incident response capability. It also provides many of the elements called for in the CII Badge Program from the GitHub Security and Heroku Security policies.

Note

To see more detail about the full set of tools, techniques, and tasks that DIMS team members were expected to know or learn, see DIMS Job Descriptions v 2.9.1.

The impact of the effort expended in this project goes beyond implementing one set of open source service components for a single group. This model can be replicated widely and improved upon by others faced with the same set of challenges in developing an affordable and scalable incident response capability.

Note

Over the course of the project, we have learned of several other efforts to address a similar set of goals and have reached out (as time permitted) to find common ground and try to develop collaborative relationships that will have broad impact over time. This is expanded upon in Section Commercialization Plan.

Competition and Alternatives

The common way that organizations go about implementing open source products is by following whatever installation instructions may be provided by the authors. Avoiding the security problems illustrated by Fig. 1 involves searching the Internet to (hopefully) find some thread like Alternatives to piping the install script into your shell. #90 (from GitHub fisherman/fisherman, a “plugin manager for Fish,” and no, we haven’t heard of it before either.)

When it comes to the more difficult task of integrating multiple open source products into a functional distributed system, the research required to debug and solve a seemingly endless series of installation, configuration, and tuning problems.

Open Source Security Toolsets

Some of the open source security tools that an incident response team would want to consider implementing are covered in the following subsections.

Each of these systems is composed from several existing open source tools, combined with new open source scaffolding, glue, custom interfaces, and additional missing functionality that is necessary to achieve the resulting distributed system.

At the same time, each of these distributed open source systems relies upon their own chosen base operating system, libraries and languages, subordinate services (e.g., database, email transport agent, message bus, job scheduling, etc.) All too frequently, the choices made by each group are mutually exclusive, or left to the customer to work out on their own.

Note

To underscore Vixie’s complexity and cost of implementation observation, Ubuntu 14.04 and Debian 7 have differences in how common services are configured that require debugging and custom configuration steps that vary between distributions, while the use of systemd for managing service daemons in Ubuntu 16.04 and Debian 8 are major impediments to migrating installation of all required components of these multi-service systems from Ubuntu 14.04 and Debian 7. Adding in RedHat Enterprise Linux, CentOS, or Fedora (all part of the same RedHat family) adds further complexity to the equation, which is a major reason why containerization is gaining popularity as a mechanism for isolating these dependency differences in a more manageable (but arguably less secure) fashion.

The Trident portal

The Trident portal is written in Go. Only Debian 7 (wheezy) is supported at this time, though Ubuntu 14.04 is on the list of future operating systems. Trident relies on PostreSQL for database, NGINX for web front end, and Postfix for email transport.

The Collective Intelligence Framework (CIF)

The Collective Intelligence Framework (CIF) is the primary offering from the CSIRT Gadgets Foundation. CIF is only supported on Ubuntu Linux. It is written in Perl and uses PostgreSQL, Apache2, BIND, Elasticsearch, ZeroMQ, and can support Kibana as an alternative interface to the indexed data in Elasticsearch.

A monolithic EasyButton installation script is available in the PlatformUbuntu section of the CIF wiki to automate the installation steps.

The Mozilla Defense Platform (MozDef)

The Mozilla Defense Platform (MozDef) was developed by Mozilla to replace a commercial SIEM product with open source alternatives. They report processing over 300 Million records per day with their internal deployment.

MozDef uses Ubuntu 14.04 as the base operating system. It has components for front-end user interface written in Javascript using Meteor, Node.js, and d3, and back-end data processing scripts written in Python using uWSGI, bottle.py, with MongoDB for a database, RabbitMQ for message bus, and NGINX for web app front end.

For installation, there is a demonstration Dockerfile for creating a monolithic Docker image with all of the MozDef components in it. (This is not the way Docker containers are intended to implement scalable microservices, but it does provide a very easy way to see a demonstration instance of MozDef). The manual instructions are more elaborate and must be followed carefully (including considering the admonitions related to security, e.g., “Configure your security group to open the ports you need. Keep in mind that it’s probably a bad idea to have a public facing elasticsearch.”)

GRR Rapid Response

Another example of a system made up of multiple components, packaged together into a single easy-to-install package, is the GRR Rapid Response system, a “forensic framework focused on scalability enabling powerful analysis.”

GRR runs on Ubuntu 16.04. To ease installation of the server components, the GRR team, like CIF and MozDef, provide both a monolithic installation script for a VM installation and a Dockerfile to run in a container. They also have packages for installing the client components on Windows, OS X, and Linux.

Attention

The GRR team chose to move to systemd, rather than continue to support the older upstart, init.d, or supervisord service daemon systems that are used by other products described in this section. This means you must support three (or four) different service daemon management mechanisms in order to incorporate all of the tools described here into a single integrated deployment.

GRR’s documentation similarly includes admonitions about security and functionality that is left to the customer to implement. Take Fig. 2, a question from their FAQ as an example:

Question about the logout button from GRR FAQ

Question about the logout button from GRR FAQ

Integrated Open Source Solutions

The DIMS project began in Q4 2013. In the second half of 2015 two very similar efforts were identified that use some of the same tools for the same reasons. Both validate the model being established by DIMS and the value proposition for adopters.

Summit Route Iterative Defense Architecture

An organization named Summit Route has described what they call the Iterative Defense Architecture (see Fig. 3) that is very similar in form and content to what the DIMS project has focused on producing.

Summit Route Integrated Defense Architecture

Summit Route Integrated Defense Architecture

OpenCredo

A consultancy in the United Kingdom named OpenCredo is also working on a similar architecture to the DIMS project (see Fig. 4). Some of the specific components differ, but conceptually are the same and would meet the same requirements for the foundation (minus the dashboard, portal, etc.) that is specified in DIMS System Requirements v 2.9.0.

OpenCredo core building blocks

OpenCredo core building blocks

The remainder of this report is divided into the following sections:

  • Section Referenced Documents summarizes referenced documents (with links to those available online for convenience).
  • Section Outcomes covers the value, expected outcomes, impacts, products, problems to be solved by, and benefits of this project.
  • Section Challenges Encountered covers some of the technical challenges that were encountered over the course of the project.
  • Section Needed Enhancements discusses needed enhancements and directions that follow-on projects could take, building from the state of released code, configuration, and documentation products.
  • Section Recommendations for Follow-on Projects includes recommendations by the PI for consideration in planning follow-on projects, whether they use DIMS products or not, intended to help reduce friction in the software development process.
  • Section License includes the open source software license under which DIMS products are to be released.

Note

Some of the content of this report comes from other previously delivered project documents, or references found in working documents and/or the PI’s (Dave Dittrich (@davedittrich)) home page (which served as a general project reference on a number of topics).

Referenced Documents

  1. HSHQDC-13-C-B0013, “From Local to Gobal Awareness: A Distributed Incident Management System,” Draft contract, Section C - Statement of Work
  2. DIMS Ansible playbooks v 2.13.1
  3. DIMS Operational Concept Description v 2.9.0
  4. DIMS System Requirements v 2.9.0
  5. DIMS Job Descriptions v 2.9.1
  6. DIMS Commercialization and Open Source Licensing Plan v 1.7.0
  7. https://staff.washington.edu/dittrich/home/

Outcomes

Summary of Project Outcomes

As described in DIMS Operational Concept Description v 2.9.0, the DIMS project started out with two primary expected outcomes: (1) an example platform for building a complex integrated open source system for computer security incident response, and (2) to transition this platform into the public sector to support operational needs of SLTT government entities. The latest modification to the contract includes a pilot deployment for use by the United States Secret Service in addition to the open source release of source code and documentation.

This project successfully implemented a prototype for a deployable open source distributed system. As described in this section and Section Introduction, other projects have contemporaneously pursued a similar goal of producing a generally usable system comprised of open source components. The DIMS project includes some features not found in these other projects (e.g., the integrated bats tests, the breadth and depth of documentation, the features to support managing multiple simultaneous deployments with private configuration).

The outcome of the DIMS Project is by no means a production-ready commercially marketable system, but the open source products are competitive in many aspects with other projects created by larger teams of software engineering professionals at commercial software companies. Further refinement within an entity staffed and focused on bringing a product or service to the market could quickly get there, but what is public now is ahead of many of the open source code examples that one can find by searching the internet, and the list of resources in Section Reference Links far exceeds any similar collection that could be identified by the DIMS team. Integration of DIMS features with those of the other projects described herein to produce a full-featured and production capable system would be the ideal, though the source of funding for such an effort is unclear.

What follows are sections identifying some of the key high level achievements.

Ansible Playbooks

The most significant achievement of this project was the production and refinement of a set of Ansible playbooks, roles, and task files, capable of instantiating a small-scale distributed system comprised of Ubuntu 14, Ubuntu 16, Debian 8, and Container Linux by CoreOS systems, implementing an etcd, consul, and Docker Swarm cluster. These playbooks share many similarities with those of some other publicly available projects that were developed contemporaneously with the DIMS Project, including the OpenCredo and Summit Route Iterative Defense Architecture projects mentioned in Section Integrated Open Source Solutions, as well as the Fedora Project’s Ansible playbooks, Intel Corporation’s Trusted Analytics Platform, and DebOps.

Note

The PI reached out to each of these groups listed (with the exception of the Debops group) and two other projects using Ansible for multi-system deployment, to see if there was any possibility to collaborate or take development of DIMS Ansible playbooks. The majority of these inquiries resulted in no response at all (despite multiple attempts in some cases.) One outreach resulted in a conversation that was a dead end, and another response suggested the chances of potential funding were low. Only one person engaged in multiple follow-on conversations, but no funding opportunities to support a collaboration could be identified.

Two other projects were identified in the final days of the DIMS Project while investigating options for using hashicorp/terraform and Digital Ocean (Cisco Systems’s Mantl and Mesosphere’s DC/OS). These projects appear to have significantly larger and more well-resourced development and marketing teams. The fact that so many similar projects exist does confirm the viability of the direction taken in the DIMS Project.

The Ansible playbooks created by the DIMS project differ from each of these other projects in several ways. One primary difference is the separation of customization and configuration from the playbooks repository to facilitate continued development and integration of new tools capable of being managed independently of the public ansible-dims-playbooks repository.

These playbooks include the following features:

  • Support for Ubuntu 14.04, Ubuntu 16.04, Debian 8, Container Linux by CoreOS, and partial support for Mac OS X (Darwin) and RedHat Enterprise Linux.
  • Installation and pre-configuration of Trident trust group management portal.
  • Integrated multi-level testing using bats.
  • Support for official SSL certificates using Letsencrypt and self-signed SSL certificates using https://github.com/uw-dims/ansible-role-ca
  • Support for automated backup and restoration of Trident PostgreSQL database and Letsencrypt certificates.
  • Support for version pinning of core subsystems across development and production hosts for improved distributed system stability.
  • Support for automated system-wide checks for availability of package updates, application of updates, and detection of required reboot, with option for email notification.
  • Support for isolated Vagrant Virtualbox virtual machines (including local copies of Ansible playbooks for testing branches and improved distributed system stability). This includes automated VM suspension upon host shutdown using, and multi-VM resumption after, using the dims.shutdown script.

Trident Portal

The project began using the original Perl-based ops-trust portal system using a hosted portal that was pre-configured. An initial Ansible playbook to deploy a local instance was produced, but the team continued to use the hosted server. In the final year of the project, Ansible playbook support for the new Trident portal (re-written in Go, with both a command line and web application graphical interface) was finally added and the ability to replicate the Trident portal was achieved. Features to support customization of the portal’s graphical elements (banner image, default icon image for users who have not loaded their own photo, logo image, and CSS style sheet settings for font and web page colors) were added to support custom branding.

As mentioned in the previous section, along with the playbook for installing Trident the ability to backup and restore both the Trident database and the Letsencrypt SSL/TLS certificates was added. This allows easier development, testing, and training with the Trident portal by simplifying deployment of two portal servers at once (one for dev/test/training and the other for “production” use.) Combined with the re-written Jenkins build scripts, an improved mechanism for debugging and development of new Trident features is now possible. (Testing of these features with volunteers associated with the Trident portal in use by the ops-trust community is being discussed and will continue as an independent project after this project’s end date.)

Pilot Deployment

A deployment of the https://github.com/uw-dims/ansible-dims-playbooks code on a stand-alone baremetal server hosting two virtual machines running instances of the Trident portal, customized and branded specifically for the U.S. Secret Service Electronic Crimes Task Force (ECTF) following Customizing a Private Deployment, was produced for use in a pilot project. Included are a Training Manual (https://trident-training-manual.readthedocs.io) and User Manual (https://trident-user-manual.readthedocs.io) focused on the Trident portal.

Continuous Integration/Continuous Deployment

Very early on, the project team established a set of Git source repositories that were focused on discrete component services or functionality. Splitting things up into discrete and focused repositories was done to establish a model of modularity (to help make it easier to add new open source tools over time) and to allow independent open source release of repositories. In all, over 40 discrete repositories were created (some now deprecated, but the majority providing functioning components addressing all of the requirements listed in the contract and detailed in the DIMS System Requirements v 2.9.0 document).

Next, a Jenkins CI server was set up and tied to the Git repositories using Git post-commit hooks that trigger build jobs for source code and documentation. Some build jobs then, in turn, trigger deploy jobs that push the built products onto the systems that use them (see Continuous Integration for more detail on this process).

Throughout this entire workflow, log entries are generated (using a program logmon) that publishes them on an AMQP channel where they can be monitored from the DIMS Dashboard, monitored from a terminal session using the same logmon program, or collected from the logging channel for indexed storage.

Install and Build Automation

System administrators are familiar with the steps of setting up a computer systems, be it a server or a desktop development workstation, by starting with an operating system installation ISO image, creating a bootable CD-ROM or USB drive, creating accounts for the system administrator and some users, selecting additional packages to install, and finally installing third-party open source tools as needed.

This is a relatively simple process, and works well if the number of servers and workstations is small, if the number of project members is small (and turnover in staff is low and the team does not grow), if the software being developed is limited in size and scope, and if things don’t change very quickly. Developers can even set up their own workstations and manage them.

Integrated Tests

One of the requirements of the project was testing and validation of the system components. A great deal of effort was spent in writing comprehensive test plans and in performing two system-wide tests. After the experience of doing these test plans and tests, a decision was made to integrate the simplest set of tests as possible into the normal operation of the system. The Bats: Bash Automated Testing System was chosen for its simplicity. A structured mechanism for embedding tests into Ansible Playbook roles was developed, along with a script to facilitate running tests named (not surprisingly) test.runner. This testing methodology is described in Section Testing System Components of DIMS Ansible playbooks v 2.13.1.

Successful test run from command line
 $ test.runner --level system --match pycharm
 [+] Running test system/pycharm
  ✓ [S][EV] Pycharm is not an installed apt package.
  ✓ [S][EV] Pycharm Community edition is installed in /opt
  ✓ [S][EV] "pycharm" is /opt/dims/bin/pycharm
  ✓ [S][EV] /opt/dims/bin/pycharm is a symbolic link to installed pycharm
  ✓ [S][EV] Pycharm Community installed version number is 2016.2.3

 5 tests, 0 failures
Failed unit test in Ansible playbook
 $ run.playbook --tags python-virtualenv
 . . .
 TASK [python-virtualenv : Run unit test for Python virtualenv] ****************
 Tuesday 01 August 2017  19:02:16 -0700 (0:02:06.294)       0:03:19.605 ********
 fatal: [dimsdemo1.devops.develop]: FAILED! => {
     "changed": true,
     "cmd": [
         "/opt/dims/bin/test.runner",
         "--tap",
         "--level",
         "unit",
         "--match",
         "python-virtualenv"
     ],
     "delta": "0:00:00.562965",
     "end": "2017-08-01 19:02:18.579603",
     "failed": true,
     "rc": 1,
     "start": "2017-08-01 19:02:18.016638"
 }

 STDOUT:

 # [+] Running test unit/python-virtualenv
 1..17
 ok 1 [S][EV] Directory /opt/dims/envs/dimsenv exists
 ok 2 [U][EV] Directory /opt/dims/envs/dimsenv is not empty
 ok 3 [U][EV] Directories /opt/dims/envs/dimsenv/{bin,lib,share} exist
 ok 4 [U][EV] Program /opt/dims/envs/dimsenv/bin/python exists
 ok 5 [U][EV] Program /opt/dims/envs/dimsenv/bin/pip exists
 ok 6 [U][EV] Program /opt/dims/envs/dimsenv/bin/easy_install exists
 ok 7 [U][EV] Program /opt/dims/envs/dimsenv/bin/wheel exists
 ok 8 [U][EV] Program /opt/dims/envs/dimsenv/bin/python-config exists
 ok 9 [U][EV] Program /opt/dims/bin/virtualenvwrapper.sh exists
 ok 10 [U][EV] Program /opt/dims/envs/dimsenv/bin/activate exists
 ok 11 [U][EV] Program /opt/dims/envs/dimsenv/bin/logmon exists
 not ok 12 [U][EV] Program /opt/dims/envs/dimsenv/bin/blueprint exists
 # (in test file unit/python-virtualenv.bats, line 54)
 #   `[[ -x /opt/dims/envs/dimsenv/bin/blueprint ]]' failed
 not ok 13 [U][EV] Program /opt/dims/envs/dimsenv/bin/dimscli exists
 # (in test file unit/python-virtualenv.bats, line 58)
 #   `[[ -x /opt/dims/envs/dimsenv/bin/dimscli ]]' failed
 not ok 14 [U][EV] Program /opt/dims/envs/dimsenv/bin/sphinx-autobuild exists
 # (in test file unit/python-virtualenv.bats, line 62)
 #   `[[ -x /opt/dims/envs/dimsenv/bin/sphinx-autobuild ]]' failed
 not ok 15 [U][EV] Program /opt/dims/envs/dimsenv/bin/ansible exists
 # (in test file unit/python-virtualenv.bats, line 66)
 #   `[[ -x /opt/dims/envs/dimsenv/bin/ansible ]]' failed
 not ok 16 [U][EV] /opt/dims/envs/dimsenv/bin/dimscli version is 0.26.0
 # (from function `assert' in file unit/helpers.bash, line 13,
 #  in test file unit/python-virtualenv.bats, line 71)
 #   `assert "dimscli 0.26.0" bash -c "/opt/dims/envs/dimsenv/bin/dimscli --version 2>&1"' failed with status 127
 not ok 17 [U][EV] /opt/dims/envs/dimsenv/bin/ansible version is 2.3.1.0
 # (from function `assert' in file unit/helpers.bash, line 18,
 #  in test file unit/python-virtualenv.bats, line 76)
 #   `assert "ansible 2.3.1.0" bash -c "/opt/dims/envs/dimsenv/bin/ansible --version 2>&1 | head -n1"' failed
 # expected: "ansible 2.3.1.0"
 # actual:   "bash: /opt/dims/envs/dimsenv/bin/ansible: No such file or directory"
 #

 PLAY RECAP ********************************************************************
 dimsdemo1.devops.develop   : ok=49   changed=7    unreachable=0    failed=1
 . . .

Python Virtualenv Encapsulation

A frequently experienced point of friction within the team had to do with differences in the tools being used by developers. One team member has git version 2.1 and the other has version 1.8 and can’t access the repo the night before a deadline. One person has the hub-flow tools and the other does not, but they also don’t know how to merge and push branches so their code is not available to the team. Someone installs a broken version of an internal tool and doesn’t realize it when they try to test someone else’s commits, so their test fails when it should succeed and nobody knows why it is happening.

As a means of isolating and encapsulating a Python based shell environment to facilitate development, testing, working on branches, and generally experimenting in a non-destructive manner, the use of a standardized Python virtual environment called dimsenv was implemented. This is a little heavier-weight use of the Python virtualenv mechanism, encapsulating more than just Python interpreter and pip installed packages.

The python-virtualenv role builds a specific version of Python, installs a specific set of version-pinned pip packages, and also adds a series of programs to the bin/ directory so as to ensure the full set of commands that have been documented in the dimsdevguide are available and at the same revision level.

This not only saves time in setting up a development environment, but makes it more consistent across systems and between development team members. Things like testing new versions of Ansible is trivial. You just clone the dimsenv environment (which has all the development tools in it already), use workon to enable the new virtual environment, and pip install ansible==$DESIRED_VERSION. Then run the playbooks you want to test. It is easy to switch back and forth, allowing development and debugging of playbooks to be able to migrate to the latest version of Ansible more easily, while still being able to fall back to the standard to get back to a stable build environment. While this is an unconventional use of Python virtualenv, it works pretty well and saves lots of time.

DIMS Dashboard

A functional dashboard web application was developed using distributed system features provided by several VM compute servers over AMQP, with single-signon tied to Google authentication. This dashboard supported user stories defined in the dimssr with built-in test capabilities. This was the most production-ready and well-engineered components of the system.

DIMS Dashboard

DIMS Dashboard

Ingest of STIX Documents

Java bindings for STIX were produced to facilitate ingest of STIX version 1.1 documents into the DIMS system. (The current release of STIX is now version 2.0.)

Software Products and Documentation

The following table provides links to public source code repositories and documentation.

Software Products and Documentation
Source repository Documenation
https://github.com/uw-dims/ansible-dims-playbooks https://ansible-dims-playbooks.readthedocs.io
https://github.com/uw-dims/device-files (No additional documentation)
https://github.com/uw-dims/dims-ad https://dims-ad.readthedocs.io
https://github.com/uw-dims/dims-adminguide https://dims-adminguide.readthedocs.io
https://github.com/uw-dims/dims-dashboard https://dims-dashboard.readthedocs.io
https://github.com/uw-dims/dims-devguide https://dims-devguide.readthedocs.io
https://github.com/uw-dims/dims-jds https://dims-jds.readthedocs.io
https://github.com/uw-dims/dims-ocd/ https://dims-ocd.readthedocs.io
https://github.com/uw-dims/dims-sr/ https://dims-sr.readthedocs.io
https://github.com/uw-dims/dims-swplan https://dims-swplan.readthedocs.io
https://github.com/uw-dims/dims-training-manual https://dims-training-manual.readthedocs.io
https://github.com/uw-dims/dims-tp/ https://dims-tp.readthedocs.io
https://github.com/uw-dims/dims-user-manual https://dims-user-manual.readthedocs.io
https://github.com/uw-dims/fuse4j (No additional documentation)
https://github.com/uw-dims/java-native-loader (No additional documentation)
https://github.com/uw-dims/stix-java (No additional documentation)
https://github.com/uw-dims/trident-training-manual https://trident-training-manual.readthedocs.io
https://github.com/uw-dims/trident-user-manual https://trident-user-manual.readthedocs.io
https://github.com/uw-dims/tsk4j (No additional documentation)
https://github.com/uw-dims/tupelo (No additional documentation)
https://github.com/uw-dims/xsdwalker (No additional documentation)
[Vix16]Paul Vixie. Magical Thinking in Internet Security. https://www.farsightsecurity.com/Blog/20160428-vixie-magicalthinking/, April 2016.

Challenges Encountered

This section describes many of the challenges that were encountered during the project’s period of performance. Some where eventually overcome in the final months of the project, though others were not. Suggestions on dealing with some of these issues are found in Section Needed Enhancements and Section Recommendations for Follow-on Projects.

Understanding the Tools Being Used

It is a myth that humans only use 10% of their brain. [1] But it is common for programmers and system administrators to only learn a small portion of the features and capabilities of any given program or tool that they may need to use. The simple thing to do is to search StackExchange for system administration tasks, or StackOverflow for programming tasks, and simply copy/paste what someone posts there in order to quickly “solve” the problem at hand and move on to the next thing.

Taking such short-cuts, trying to avoid the investment of time required to learn the capabilities of a given tool can result in problems later on. Scripts may be written that restrict or limit the utility of the underlying programs they call, or that perform extra work that could be done in a more straight-forward or idiomatic way using advanced features of the underlying programs. At the very minimum, everyone sharing use of a common tool must have a sufficient baseline familiarization with the content of available documentation by skimming through it completely to be able to quickly get past a blocker.

When someone is tasked with solving a particular problem, given a set of requirements or use cases that should be satisfied, it is important that they take responsibility for studying the tool(s) being used to understand how best to perform the task at hand and to share their knowledge with other team members using the same tool. It does not work for the project lead to have to become the expert in every tool and micro-manage how team members do their work.

This problem was exacerbated in the DIMS Project due to the large number of new and rapidly changing tools and technologies that are necessary to assemble a system with the complexity defined in the project scope of work. Certain project team members had experience in specific programming languages, limiting their ability to contribute in some situations. Most had some familiarity with Unix system administration on their own development workstations, but were not able to branch out to unfamiliar Linux distributions. During this project every team member was pushed to learn new programming languages, new data storage mechanisms, new inter-process communication protocols, new software development tools and processes, new types of security event data, and new concepts in network and host forensics required to process security event data, threat intelligence streams, and malware artifact metadata.

Staffing Challenges

The document DIMS Job Descriptions v 2.9.1 was produced to list the full set of skills and experience required by those working on a project of this nature. As mentioned in the previous section, every member of the team was pushed beyond their technical limits and had to constantly research and learn new technologies and rapidly acquire new software engineering, network engineering, or system administration skills. Not everyone is capable of, or happy with, being pushed beyond their limits on a daily basis and some turnover within the project was directly related to this pressure to deliver.

Over the course of the project, the team size was typically 3-5 people, with most of the team working at less than 100% FTE. One contractor was 100% FTE for the majority of the project and had to step in to perform tasks that were not being performed by other team members.

The team was also partially virtual, with one, two, and sometimes three staff members working on the East Coast, while the rest of the team was on the West Coast (mostly in Seattle, WA, but at one point split between Seattle, Tacoma, and Bremerton, WA and one person in California.) Regular “scrum” meetings were held using online tools (at various times Adobe Connect, Skype, and Google Hangout were all used, to varying degrees of effectiveness or frustration.) This made the problem of trying to bring team members up to speed on new concepts and skills difficult, due to lack of physical presence and availability.

Another difficulty resulted from political issues as opposed to strictly technical issues. In order to meet the objectives of the contract, the team was being pushed far beyond their capabilities. Some people respond to this by putting in the extra time it takes to improve their skill set (either inside working hours, or seeing it as an investment in their professional career, doing some extra-curricular learning.) Others respond by pushing back, focusing their efforts on only those tasks they are comfortable with and no more, or otherwise not following the established development path. Some documentation was not produced as requested (in some cases the PI was able to make up for the deficit, but this was not possible when the PI did not write the software.)

One risk to a project that is hard to avoid is a dependency on external software products or standards that are outside of the control of the project (e.g., see STIX Development Libraries). Such situations can cause larger organizational weaknesses and personnel issues to surface that simply cannot be solved without commitment and full-throated support from higher up in the organization. The best that can sometimes be achieved is to learn from the situation, find a way to move on, and carry the lessons forward to do better in the future.

DNS Challenges

Naming Computers

Naming computers is not easy. One of the tenets of secure design is separation of services, which historically has driven system administrators to limit services to one service per computer. You have a DNS server that does DNS, a database server for data storage, a web server that provides HTTP/HTTPS web application services for browsers, an FTP file server that only serves static files, etc.

In such simple deployments, naming a computer based on the service it provides seems to make sense and to be simple. Until you decide it makes more sense to combine some related services on that one computer, at which point one of two things happens:

  1. The computer’s name now only matches one of the two services, and it becomes harder to know what computer name to use when trying to connect to the second service. (“The Git repos are on git, and Jira is on jira. We put Jenkins on one of those two servers, but was it on git or on jira?).
  2. The service is put on another computer (possibly a virtual machine) and the computer’s name now matches the service. But now there is also another computer host to manage, with iptables rules, accounts and passwords allowing administrator access, the need to copy in SSH keys, etc. As more computers are added, management and use gets harder and harder.

Part of this problem is handled by adopting a policy of not naming computers after services, but instead using more generic host names (like colors like red and orange, or generic names like node01 through node09). Those host names are then mapped with DNS A records (and associated PTR records to properly reverse-map the IP to name) and using CNAME entries that create aliases in DNS name space, allowing URLs to be formed with the service name as part of the DNS name. (E.g., trident.devops.local may map to yellow.devops.local via a CNAME).

The drawback to this is that the administration of A records, PTR records, and CNAMES is more difficult than simple /etc/hosts entries, and requires a deeper understanding of DNS internals by all involved. The final implementation of DIMS Ansible playbooks generates DNS host name mappings using Jinja templating to generalize creating DNS entries.

Another problem that must be dealt with when placing multiple services on the same system is TCP port mappings. You can only have one service listening to port 80/tcp, port 443/tcp, etc. That requires that services like Trident, a web application service, etc., all have their own unique high-numbered service ports (e.g., 8080/tcp for Trident, 8000/tcp for the web application service, 8500/tcp for Consul’s UI, etc.) But now how do you remember which port to use to get to which service on which host? Adopting a prefix with the service’s name and using a CNAME that aliases the host allows an easier to remember mechanism to reach services, though at the cost of complexity in NGINX reverse proxy configuration. You can now access Trident using https://trident.devops.local/trident and https://consul.devops.local/consul to get to the Consul UI. What is more, using multiple DNS records for each Consul node in a cluster allows for round-robin access to distribute the connections across cluster nodes:

$ dig consul.devops.local +short
192.168.56.23
192.168.56.21
192.168.56.22

Separating DNS Name Spaces

Adding to the complexity of DNS and host naming is the situation of multi-homed hosts. Most people are accustomed to one computer with one or two interfaces (like a laptop with either a wired Ethernet interface, or a WiFi interface, only one of which is active at any given time). That means the computer always has just one active IP address, and since laptops are usually used for connecting as a client to remote services, they don’t even need to have a DNS name!

Layered, segmented networks that involve external firewalling, Virtual Private Network (VPN) access to multi-segmented Virtual Local Area Network (VLAN) switched or virtual machine network environments cause problems when it comes to host naming and DNS naming.

The early implementation of DIMS DNS used a single DNS namespace, with multiple names per host that were arbitrarily chosen with some hosts having four or more names using A records, some in the prisem.washington.edu namespace, even though they only existed in the internal DNS server and not in the external authoritative name servers.

For example, a DNS name like jira.prisem.washington.edu would exist in the internal server, mapping to an IP address in the 140.142.29.0/14 network block. Doing dig @128.95.120.1 jira.prisem.washington.edu (an official UW name server) or dig @8.8.8.8 jira.prisem.washington.edu (one of Google’s name servers) would fail to get an IP address, but making the request of the internal server would work. Since Jira was running behind a reverse proxy, however, the host that was actually running the Jira server was not the one using the address on the 140.142.29.0/24 network block, so a second DNS name jira-int.prisem.washington.edu (also non-existent externally) would map to the internal IP address, which was only accessible over a VPN. This resulted in a huge amount of confusion. Which host was actually running Jira? What port? What order for DNS servers has to exist to ensure the request goes to the internal DNS server first, not the external DNS servers that don’t know the answer?

The proper way that multi-homed network namespace management is handled is through the use of Split horizon (or split-brain) DNS. This requires multiple DNS servers, multiple DNS zones, and careful mapping of the IP addresses and DNS names for each of the zones, as necessary to route packets properly through the correct interface. Again, this requires a much deeper understanding of DNS than is common.

Handling Dynamic Addressing on Mobile Devices

Yet one more issue that complicates connectivity is the use of mobile devices like laptops, which must use a VPN to connect to access-controlled hosts behind firewalls. If split-horizon DNS is used, with one DNS server behind the VPN such that it is only accessible when the VPN is connected, the mobile device may experience significant delays in DNS requests that cannot be sent to the unavailable DNS server. This requires complicated dynamic DNS resolver configuration that is difficult to set up and to debug without expertise in advanced network configuration on the operating system being used (in this case, Mac OS X and Ubuntu Linux were the two predominant operating systems on laptops.)

One of the ramifications of mobile devices using Ubuntu Linux is the role of NetworkManager, a notoriously problematic service in terms of network configuration management. It is very difficult to take control of services like dnsmasq for split-horizon DNS, or use VPNs (especially multiple VPNs, as was implemented in this project from the start), without running into conflicts with NetworkManager.

The DIMS project started using the Consul service as a means of registering the IP address of a client using a VPN, such that the current address and accessibility status is available using Consul’s DNS service. As Consul was going to be used for service health monitoring as well, this seemed like a good choice. One downside is further complexity in DNS handling, however, since not all hosts in the deployment were configured to run Consul using Ansible playbooks.

Distributed Systems Challenges

There are several challenges to building even a small-scale distributed system compromising multiple operating systems on multiple network segments with multiple layers of baremetal, virtual machine, and/or containerization.

Physical Distribution

One of the core challenges when building distributed systems results from using separate DNS host names and physically separate data centers and/or logically separated subnets.

At the start of the DIMS project, hardware was physically located in two server rooms in two separate buildings operated by the Applied Physics Laboratory, with staff being located on a separate floor in one of the buildings. What is more, some staff had computers using static IP addresses with direct access to the internet, while others used dynamic IP addresses behind a separate APL “logical firewall” device. This meant use of four separate IP address ranges on four subnets behind two different firewalls. Other hardware was located in the main UW Data Center in the UW Tower building on a fifth network. Add to this one hypervisor on a system in the APL server room and another in the UW Tower, each with a separate OpenVPN server, with the necessity to route traffic between virtual machines on the two hypervisors. (Both hypervisors, by the way, were different and ran on two different operating systems.)

On multiple occasions, hardware had to be moved from one location to another (which meant changing IP addresses on both bare-metal hosts and virtual machines, changing routes, and changing VPNs.) The last time hardware was moved, in order to consolidate it all into one data center, the entire system became unstable and non-functional.

One of the machines being moved served as the hypervisor for approximately a dozen virtual machines making up the core of the DIMS development environment. At least three previous attempts were made to task team members with documenting the “as-built” configuration of all of these components, their IP addresses and routes, and mechanisms for remote control, in order to plan for the configuration changes needed to perform the move. Each previous time a move had been planned it had to be put off because higher priority tasks needed to be addressed and/or team members had left the project before they had completed the tasks necessary for migration. When the hardware finally had to be hastily moved due to the impending extended leave of a key participant, the hastily performed move caused the entire DIMS network to become non-functional and the PI and two team members spent the next five days working to get the system functional and stable again. This process revealed that the configuration of the DIMS systems was significantly below the quality level previously assumed. System configuration settings were not adequately documented, were almost entirely hand-crafted (as opposed to being under Ansible configuration control as was specified), used two different hypervisors (KVM and Virtualbox) on two different operating systems (RedHat Enterprise Linux 6 and Debian) and the networking relied heavily on something known as Project 172 private address routing combined with internal virtual networks that were administered by just one former team member using remote desktop services and/or X11 forwarding from a workstation that was no longer available as an option to use. The instability and outages caused by this long-delayed (yet required) hardware move set the team back significantly and had ripple effects on other deadlines and events that could not be adjusted or canceled.

Stability

Due to the inherent inter-relationships between subcomponents in a distributed system, stability of the overall system is a constant challenge. Not only are relocations of hardware like those described in an earlier Section a contributor to instability, but so are software changes. As the DIMS project is using open source operating systems and tools that may be updated on as frequent as a monthly basis, often resulting in parts of the system “breaking” when an update happens.

As the entire distributed system was not put under Ansible control from the start, and “as-built” documentation was lacking in several areas, some architectural changes resulted in critical system components breaking, with no clear way to fix them. This could lead to days of running tcpdump and strace, watching syslog log file output, and poking at servers (after clearing the browser cache frequently to eliminate problems due to erroneous cached content) in order to diagnose the problem, reverse engineer the solution, and meticulously put all of the related configuration files under Ansible control. This was complicated by the fact that the team members who set up some of these systems were no longer on the project and could not assist in the cleanup.

One of the solutions that was attempted was to use Docker containers for internal microservices. The hope was to avoid some of the complexities of out-of-date libraries, version incompatibilities in programs, and differences in operating systems. The project team looked at several ways to deploy Docker containers in a clusterized environment and chose to use CoreOS (now called “Container Linux by CoreOS”). While this allowed clusterization using etcd, consul, and eventually Docker Swarm mode, it also resulted in a trade-off between leaving the three servers running CoreOS for clustering stable (and thus drifting apart in versions from the regularly updated development hosts running Ubuntu 14 and Debian 8), or dealing with changes to configuration files that had to be ported to Vagrant Virtualbox “box” files and the bare-metal cluster at the same time. As these systems were not easily controlled with Ansible at first, this caused a lot of frustration that was never fully eliminated. As the baremetal servers were re-purposed for pilot deployment work, the central cluster services degraded and took some formerly working services with them.

Software Engineering Challenges

The software engineering skill levels and experience of the team members varied widely, as did their individual coding styles, language preferences, and debugging abilities. This resulted in several points of friction (both technically and politically) over time. It also made it difficult to rely on documented requirements and white board sessions to provide sufficient direction for programmers to independently produce “production” quality system components. A project of this scope requires more direct interaction between the PI (who knows the user requirements and what needs to be built to meet them) and individual team members (who are tasked with building those components). This requires a greater level of institutional support and commitment, or a more highly-skilled and experienced engineering team, than was available.

Using Agile

Problems with achieving and maintaining a cadence with Agile/Scrum and software releases, were exacerbated by the issues of team member physical distribution, time zone differences and work schedule differences. All team members were new to using Git, which has a steep learning curve to begin with. Differences in versions across workstations caused problems in sharing code using Git. Getting everyone to adopt common processes and tools proved to be difficult. The most prevalent model for branching, described by Vincent Driessen’s “A successful Git branching model” was chosen as the right model to follow. Getting all team members to learn it, and follow it, was not entirely achieved. (A diagram of the model is shown in Figure Vincent Driessen Git branching model).

The dimsdevguide was produced, with sections laying out things like policy (Development and Core Tool Policy) and guidance on using Git (Source Code Management with Git).

Vincent Driessen Git branching model

Vincent Driessen Git branching model

What tended to happen over and over was a situation where a large number of disruptive changes and bugfixes would all be placed onto a single long-lived feature branch (sometimes going for weeks at a time) before merging them back into the develop branch, let alone released to the master branch. In order to test successfully (and sometimes just be be able to have a stable build at all) would require multiple repositories all being on the same feature branch. The worst case was that one part of the system would only work on one feature branch and another part would only work on a different feature branch, creating an impasse where a full build of multiple VMs would not work. This caused repeated states of instability and high stress leading up to demonstrations.

It wasn’t until Q2 2017 that stability was achieved on the master branch, regular merges from feature branches to develop and vice-versa kept both develop and feature branches stable, and hotfix branches used more diligently to improve master and develop branches without losing these fixes on long-lived feature branches. In retrospect, “release early, release often” and “build from master” to validate merged features should be the mantra. (This process was adopted leading up to the pilot deployment, which was built almost exclusively from the public master branch of https://github.com/uw-dims/ansible-dims-playbooks).

Backward Compatibility

In Section Stability, the problem of version drift between like components in a distributed system was discussed. The right answer is to put everything under Ansible control from the very start and to handle subtle variations in how things are installed and configured by using the minimum necessary “glue scripting” so as to stay in sync with versions across all subsystems. This is a difficult task that takes expertise that was not commonly available across all team members.

Backward compatibility issues also arose with one of the core components the DIMS project was using: the Trident portal. Open source projects (DIMS included) move forward and change things at whatever cadence they can follow. Sometimes this means some fairly significant changes will happen quickly, requiring some effort to keep up. This results in a challenge: stay on the cutting edge by focusing effort as soon as changes are made, or try to maintain some stability by pinning to older versions that are working?

In order to keep stability in the development environment to make forward progress on a number of fronts, the Trident version was pinned to 1.3.8. The pilot deployment, however, would need to be done using a newer version (at the time 1.4.2, currently 1.4.5). There were at least two significant changes made between the 1.3.8 and 1.4.2 versions: The CSS style sheets used by the Trident portal GUI went from two files to one file, changing names at the same time, and there were some incompatible changes to the command set for the tcli command line interface that was used by Ansible to install and configure Trident. These changes required some reverse engineering of the changes by extracting files from the two packages and differencing everything in order to then use conditional logic and dictionaries to quickly switch between version 1.3.8 and 1.4.2 in order to keep a stable working demo and simultaneously prepare for the pilot deployment. (A variation of this technique is illustrated in the code block Excerpt of client.py showing version support). This diverted a significant amount of energy for a period of time that pushed other tasks to the background.

External Dependencies and Pacing

One of the most laudable goals of this project was the use of open source tools to be integrated into an affordable distributed system capable of scaling to the degree needed to handle millions of security events per day. The flip side of this is that every one of the open source tools that come from outside entities are produced on someone else’s whim (including pace of release, quality of testing, rate of disruptive changes in code, time available to respond to interactions, etc.)

For example, keeping up with the pace and direction of change in STIX core development, and difficulties in maintaining development momentum within the project team, limited this avenue and it could not be sustained. (See STIX Development Libraries.) Other challenges listed in this section caused the pace internal to our team to be much slower than desired, resulting in difficulty in our reaching out and interacting with developers of the Trident portal. The friction within the project slowed some of our internal development, requiring that we play “catch-up” late in the project and not being able to provide as much input as we had hoped to their developers towards features we needed.

Testing

The contract included requirements for adherence to a specific software design standard and for two iterations of producing a full-system test plan and test report. The prime contractor organization had no previous experience with these standards and no formal in-house training or resources to support production of the test plan or test report. The sub-contractor providing project management assistance procured a software testing subject matter expert with experience at a large aerospace company. The initial plan developed by this expert (while perhaps typical for a large project in a large organization with specialized staff dedicated to testing) went far beyond what the DIMS Project’s staffing and budget resources could support to manage the test planning, execution, and reporting, not to mention the cost of the commercial testing tools being recommended.

The PI identified MIL-STD-498, described at A forgotten military standard that saves weeks of work (by providing free project management templates). A simpler and more manageable plan was developed following the MIL-STD-498 Software Test Plan (STP.html), along with the Software Test Report (STR.html). Even with this simpler plan, the initial test consumed the majority of the project effort for several weeks leading up to the deliverable deadline.

Prior to the second system-wide test cycle, the PI spent time towards automating production of the Test Report from machine-parsable inputs. The second test took less effort than the first, but the amount of manual effort was still large and one team member did not produce any input for the test report until the week after the report was delivered to the sponsor, despite numerous requests in the weeks leading up to the deadline.

[1]All You Need to Know About the 10 Percent Brain Myth, in 60 Seconds, by Christian Jarrett, July 24, 2014.

Needed Enhancements

Perfection is finally attained not when there is no longer anything to add, but when there is no longer anything to take away.

Antoine de Saint Exupéry, Terre des Hommes (1939)

There are several areas in the DIMS development architecture that are more complex than is desirable, where things are difficult to use and bugs often get in the way. As the quote above attests, it is a challenge to make things simple, elegant, and robust. Given limited resources, time deadlines, and pressures to get things working, some components of the DIMS project have attained the success of a prototype and are now in need of reimplementation as a cleaner, tighter, next major version. This section describes some of those components and thoughts on what to do next.

Packer and Vagrant Workflow Process

When the DIMS project began, and the team first began to work with multiple Linux distributions (to follow guidance of each open source tool producer’s specified supported platform requirements), the decision was made to use Packer for creating virtual machine images from distribution disk image files (colloquially known as ISOs, short for ISO 9660 format read-only boot images.).

To facilitate applying generic configuration settings and package choices to multiple Linux distribution boot images, some helper Makefile rules files were created, allowing the dependency chain to be defined such that Unix make can then optimize the production of products. You don’t want to have to perform every step in a lengthy process (that involves downloading over a Gigabyte of package files) every time you want to create a new Vagrant virtual machine.

This process pipeline eventually included a Jenkins server that would trigger ansible-playbook execution to implement a complete continuous integration/continuous deployment environment. This process looked like that depicted in Figure Packer/Vagrant Workflow.

Packer/Vagrant Workflow

Packer/Vagrant Workflow

The options at the time were to use something like Chef, Puppet, Heat, or Terraform. The choice had been made to use Ansible for system configuration automation, which the team did not see as being compatible with Chef and Puppet, and programs like Heat and Terraform were designed for much more larger and more complicated multi-region cloud service deployments. We wanted DIMS deployments to fit in a single server rack using a small external network footprint, since the PRISEM project on which DIMS was to be built was built that way.

In September of 2015, well into the DIMS project, Hashicorp came out with “otto” and “nomad”. [1] These looked promising, but were immature and looked costly to implement. In August 2016, Hashicorp announced they were decommissioning and abandoning “otto”. [2] There is still a need for a tool like this, but we continued to use the tools we had developed despite their limitations. Continued simplification of these tools and integration with Ansible through use of the inventory and templating scripts, Packer .json files, and Vagrantfile configuration files would help smooth things out.

In the long term, a solution that falls within the gap between a single server rack with custom Makefile and scripts and something as complex as OpenStack or AWS CloudFormation is desired. This could be Packer and Terraform with custom provisioners. Experiments using Packer to create Amazon instances was successfully performed and a prototype of Terraform to provision Digital Ocean droplets has been initiated and is anticipated to be completed after the project is completed for use in subsequent follow on projects using the DIMS software products.)

Normalization of Kickstart

Along with a standardized and simplified virtual machine instance build process, a related simplified bare-metal boot capability is needed for more efficient deployment of servers. Debian has a mechanism known as Kickstart that allows pre-configuration of steps needed to perform an unattended (“hands-off”) installation of the operating system at boot time. This mechanism is used in DIMS as part of the Packer workflow, and as part of the customized USB thumb drive installer. It can also be made to work by Virtualbox (or other hypervisors, for that matter) directly.

  • The Packer workflow uses inline commands to perform some initial system setup steps necessary to then use Ansible for the remainder of the system configuration.
  • The Vagrant workflow for Ubuntu and Debian uses some inline commands in the Vagrantfile for pre-Ansible customization, and some external scripts.
  • The Virtualbox Box file preparation for CoreOS uses external scripts to prepare CoreOS for Ansible, and other Vagrantfile inline commands for boot-time customization.
  • The automated USB installer workflow uses the Kickstart pressed.cfg file for what preparatory steps Kickstart is capable of performing, and a secondary pre-boot customization script that is downloaded and executed at install time for other pre-Ansible customization.
  • Manual creation of virtual machine guests or baremetal systems using the default Ubuntu or Debian installer (without using Kickstart) requires manual steps be performed to prepare the system for Ansible control.

The problem is, each of these workflows was created by separate team members at different times, much of this without coordination or integration. Multiple attempts were made to task team members with identifying all of the above and reducing or refactoring the steps into a coherent and consistent set of commonly-used scripts resulted. This resulted in some degree of simplification and integration, but there is much work remaining to be done here.

Rather than having multiple incompatible inline shell mechanisms (which are the easiest to implement, but least compatible means of accomplishing the same tasks), a cleaner way to handle this situation is to reduce the steps required in preseed steps to the bare minimum necessary to enable external Ansible control. Then these simpler pressed steps can be included as necessary by each tool during Kickstart or the install-time tasks can be performed in Bash shell scripts that can be called by each tool. This makes all of the install-time steps consistent, configurable using Ansible, and shared across tools. The remaining steps can then be turned into Ansible playbooks that can be applied post-boot, again in a completely consistent manner.

Configuration Management Database

At the start of the project, a combination of variables stored in files that could be exported through the shell’s environment into scripts, Makefile rules, and Ansible vars files, was used. These mechanisms were not fully integrated and it was difficult to switch between different sets of variables to support multiple simultaneous deployments. For this reason, the team clung to a single deployment for far too long.

In terms of Ansible, the use of the simplistic and limited INI style inventory, with group and host variable files, was easy to learn, but proved difficult to manage for multiple deployments and for this reason its use held the project back for a long time.

Having multiple deployments was always a project objective, but how to achieve it using free and open source tools was not obvious to the team. It was clear that a configuration management database was needed that supported a more object-oriented “inheritance” style mechanism of defining variables that would more easily accommodate managing multiple simultaneous deployments.

The need here is for a system that behaves something like the way OpenStack supports a CLI for getting and setting variables in concert with a “cloud” configuration file to control high-level storage locations that allow a single interface to operate across multiple configuration databases. Ideally, this database would serve as what is called a “single point of truth” or “single source of truth” about not only hardware in a data center (e.g., servers and network equipment, rack slot allocations, switch ports, VLANs), but also configuration specifics that would drive Ansible playbooks for configuration and templating of scripts that run on the systems. A lot of research was done, but nothing seemed to be a good fit. Commercial tools like Ansible Tower [3] may solve this problem, but that was neither in the project’s budget, nor did that conform with the objective of using only free and open source software tools. Other solutions were similarly focused on enterprise-level deployments and were not suitable for our use.

The tools that seem to exist are all focused on large-scale cloud deployments for massively-scaled, multi-datacenter deployments using a federated model. Trying to add them to the mix would be too costly and divert too much attention from other critical elements of system integration. What is needed by projects like this is a mechanism for many small-scale, single-datacenter deployments that are configured locally, but pull much of their code from the public repositories on GitHub.

The solution that was settled upon in the DIMS project was a combination of most variables being defaulted in roles with a separate “private” directory tree for each deployment that holds customization details in the form of Ansible YAML style inventory files and local customized files and templates that playbooks in the public ansible-dims-playbooks repository use before looking for generic equivalents in the public repository. This allowed the ability to operate multiple deployments in parallel with the public repository with less hassle, though this is still not the ideal solution.

Continued Reimplementation and Integration of Services

Due to some of the issues listed in Section Challenges Encountered, several of the sub-systems in the original development deployment that were never fully under Ansible control and had been hand-configured became unstable and failed. The DIMS dashboard web application, the Tupelo server, the Jenkins server, were all built on older Ubuntu 12.04 LTS and a Linux appliance virtual machine that was one of the first servers installed. As these base operating systems were manually created and managed, and the person who had originally set them up was no longer working on the project, rebuilding them would prove difficult and took lower priority to completion of other tasks. Some of the service, such as the dashboard web app, where also constructed using older Ansible playbooks that did not conform with the newer standards used for later playbooks and would similarly take extra time to be fully brought up to current standards. These tasks were added to Jira tickets, along with rebuilding all of the other central components (e.g., the Jenkins build server that failed when accidentally upgraded to a version with non-backward compatible features).

In the final months of the project, effort was put into re-implementing as many of the original (version 1) deployment services as possible. The RabbitMQ service, Jenkins with Git+SSH and Nginx file service, and Trident portal were all reimplemented and replicated on a new server. The Tupelo, PRISEM RPC services, and Lemon LDAP (for single-signon service) server roles remain to be re-implemented and updated from their original Ansible roles and the hand-crafted Jira system implementations. The DIMS Dashboard, Redis server, and ELK stack Ansible roles (which were all working in prototype form in year 2, prior to moving the project to UW Tacoma) should be easy to port after that, but it is likely that the Javascript Dashboard and Java Tupelo code are now out of date and will require experienced Javascript and Java programmers to bring them up to current coding standards.

Secrets as a Service

In the first year of the project, many secrets (passwords, non-public sensitive sample data, private keys, and SSL/TLS certificates) were committed to source code at worst, or passed around manually. This is neither a secure way to deal with these secrets, nor does it scale well. Ansible Vault and use of a separated private directory were prototyped as mechanisms to deal with the storing of shared secrets, but passwords were not entirely eliminated in favor of a ubiquitous single-signon mechanism. (Single-signon was implemented for Jira, Jenkins, and the DIMS Dashboard server, but no farther.) Trident uses a Javascript Web Token (JWT, pronounced “jot”). LDAP and JWT tokens could be extended, a service like FreeIPA or HashiCorp Vault (both used in the system illustration in Figure OpenCredo core building blocks), or Docker’s built-in secrets management feature (see Introducing Docker Secret Management and Manage sensitive data with Docker secrets) could be used. There are many tradeoffs and service integration issues in this area that make this a non-trivial problem for all open source projects of this scope.

Testing and Test Automation

Section Testing describes effort put in by the PI to automate system-wide testing. This primarily centered on post-processing output of BATS tests and JSON files created using a simple user interface that collected information related to the tests as described in the DIMS Commercialization and Open Source Licensing Plan v 1.7.0. Another team member created scripts in Jira that produced these same JSON files describing the output of manual tests managed using Jira, reducing the amount of effort to perform and report on user interface tests. The final product was a structured set of RST files that could be processed with Sphinx to produce the test report in HTML, PDF, and epub formats. Such test automation decreased effort required to perform test cycles and supported automated production of reports with very little need for manual input.

The larger vision here was to scale test production and reporting by orchestrating the process using Ansible. For example, an Ansible playbook could invoke the test.runner script (see Running Bats Tests Using the DIMS test.runner) on every system, placing output into a known file, which can then be retrieved using the fetch module into a hierarchical directory structure based on the system names. The contents of this directory tree can then be turned into separate RST files and an index.rst file generated that is then rendered using Sphinx.

Further automation of the testing process along these lines would decrease the cost and disruption to regular development efforts, allowing more frequent testing and decreasing the effort spent on resolving failed tests. (This task was on the to-do list, but had to take a lower priority to other more important tasks.)

[1]https://www.hashicorp.com/blog/otto/
[2]https://www.hashicorp.com/blog/decommissioning-otto/
[3]A couple months before the DIMS project end of period of performance, RedHat released the Ansible Tower product in open source form as the AWX Project. There was no time to fully learn how to use and evaluate this product, though it appears it would be relatively easy to add it to the ansible-dims-playbooks as a role and deploy it along with other system components.

Recommendations for Follow-on Projects

This section includes recommendations for consideration in planning follow-on projects, whether they use the DIMS Project software products or not, intended to help reduce friction in the software development process. Any group wanting to form a regional information sharing collaboration building on the DIMS software base, or a small open source development project wanting to use the platform for secure software development, will want to consider these suggestions.

During the time of this project, we encountered all of the typical problems that a team would have in the lifecycle of designing, deploying, and maintaining a small-scale (on the order of dozens of server components) distributed system. In order to have isolated development, test, and production systems, the difficulty factor goes up. To perform multiple production deployments and update code over time further increases the difficulty factor. Eventually, the lack of automation becomes a limiting factor at best, or leads to an extremely unstable, fragile, and insecure final product at worst.

The benefit to those who chose to follow our lead will be a faster and smoother journey than we experienced during the DIMS project period of performance. All of the hurdles, mistakes, struggles, and ultimately the many successes and achievements in distributed system engineering were not easily found in the open source community. The DIMS System Requirements v 2.9.0 documents security practices and features that we have attempted to incorporate to the greatest extent possible, in a way that can be improved over time in a modular manner. The system automation and continuous integration/continuous deployment features help in implementing and maintaining a secure system. (Red team application penetration testing will further improve the security of the system through feedback about weaknesses and deficiencies that crept in during development and deployment.)

Focus on System Build Automation

From the first days of the project, the PI constantly told the team to not build things by hand, since that does not scale and cannot be replicated. We didn’t need one hand-built system, we needed multiple systems for development, testing, production, and anyone wanting to use the DIMS system needed to be able to quickly and easily stand up a system. This can only be accomplished using stable and well-documented build automation.

Ansible was chosen early on as what looked like the most promising system build automation tool, so in this sense saying “focus on system build automation” means “focus on mastering Ansible.” Anyone wanting to build on the project’s successes, and avoid some of its challenges, must ensure that all team members involved in development or system administration master using Ansible. That can’t be stressed enough, since any system that is not under Ansible control is at risk of instability and very costly effort to fix or replace should something happen to it. Any host that is fully under Ansible control can be quickly rebuilt, quickly reconfigured, and much more easily debugged and diagnosed.

Rather than use SSH to log into hosts, whenever possible use ansible ad-hoc mode. The ability to invoke modules directly using ansible not only helps learn how the module works, but it also allows very powerful manipulation of any or all hosts in the inventory at once. This makes debugging, configuring, cleaning up, or any other task you need to perform, much easier and more uniform. Avoiding logging into hosts and remotely using the command line shell also helps focus on controlling the configuration and using the automation rather than hand-crafting uncontrolled system changes.

Using ansible-playbook with custom playbooks for complex tasks, or by invoking a master playbook that includes all other playbooks, facilitates performing an action across any or all hosts to keep things better in sync. It documents the steps in a way that they can immediately be performed, rather than documenting in English prose with code blocks that need to be cut/pasted and edited manually to apply them when needed. The friction caused by manual configuration of hosts is one of the biggest impediments to building a complex and scalable system.

Last, but not least, by putting all configuration files under Ansible control from the start enables much easier change management and incremental configuration adjustment with less disruption than with manual system administration. It is far easier to search Git history to figure out what changed, or search a few directory trees to locate where a particular variable is set or configuration file was customized. The ansible_managed line in configuration files on the end systems tells you precisely which file was used by Ansible to create the current file, allowing you to make changes and commit them to the Git repository to maintain history and maintain control. Editing a file on the end host and introducing an error, or accidentally deleting the file, make recovery difficult, while reliably re-applying a role is simple and easy.

Standardize Operating Systems

As much as possible, standardize on a small and manageable number of base operating systems and versions, and strive to keep up with the most recent release (and perhaps one previous release) to avoid supporting too many disparate features and one-off workarounds. Every major or minor version difference (e.g., 12.04.4 vs. 14.04.4 for Ubuntu Linux) or distribution difference (e.g., Fedora vs. RedHat Enterprise Linux vs. CentOS) can have implications for computability of sub-components, be they programs, libraries, or add-ons and utilities.

While this recommendation sounds simple, it is not. This task is made difficult by the choices of supported base operating system(s) made by each of the open source security tools you want to integrate. Great care needs to be taken in making the decisions of which operating systems to support, balanced with available expertise in the team for dealing with required debugging and configuration management tasks.

Using a configuration management program like Ansible helps by expressing installation steps using different Ansible modules or plays, though it does require engineering discipline to deal with complexity (above and beyond what a Bash script would entail, for example) and to ensure the right plays work the right way on the right operating system. This could mean maintaining a large set of group variables (one for each alternative operating system), using variables in inclusion directives to select from those alternatives, and/or using “Ansible facts” derived at run time with logic (e.g., when: ansible_os_family == "Debian" as a conditional in a playbook) Developing Ansible playbooks in a modular way that can easily accommodate generalized support for multiple operating systems (e.g., using a “plug-in” style model) is a more sophisticated way of writing playbooks that requires a greater level of expertise of those writing the playbooks. Such expertise, or institutional support for employee training to achieve it, are not always available.

Standardize on Virtual Machine Hypervisor

Attempting to use different hypervisors on different operating systems vastly increases the friction in moving virtual machine resources from host to host, from network to network, and from manual to automated creation. While it is often possible to export a VM image, convert it to another format, and import that VM back into another hypervisor, these steps require additional planning, time and effort, and data transfer and storage resources that add friction to the development process.

Start with a preferred hypervisor to support and take the time to migrate legacy virtual machines to that preferred hypervisor, rather than attempting to support part of the system with one hypervisor and the rest with another. If it becomes necessary to support additional hypervisors, require replication of the entire system of systems in a separate deployment (i.e., fully independent, not sharing any resources in a way that couples the heterogeneous systems) to ensure that tests can be performed to validate that all software works identically using the alternate hypervisor.

The vncserver role [1] was created to make it easier to remotely manage long-running virtual machines using a GUI hypervisor control program. Using CLI tools is also necessary, however, to more easily script operations so they can be parallelized using Ansible ad-hoc mode, or scheduled with cron or other background service managers.

[1]https://github.com/uw-dims/ansible-dims-playbooks/blob/master/roles/vncserver/tasks/main.yml

Manage Static Config Files Differently than User-controlled Files

Managing files in /etc is different than $USER/.gitconfig. Let users customize things, and add (merge) group content rather than wholesale replacing files based on templates. Blindly installing configuration files is not idempotent, and causes regression problems for users when an Ansible playbook or role wipes out changes a user has made and takes the configuration file back to an initial state.

There are several ways to do this, some more complicated than others. One of the easiest ways is to start with a generic file that has very little need for customization and will run on all systems, which in turn uses a drop-in inclusion mechanism to in turn support inclusion of two types of files:

  1. Adding operating-system specific additions that are selected by some variable, such as output of uname -s as a component of the file name, or:
  2. Allowing users to control their own customizations by including a file with some string like local in its name.
  3. Supporting the ability for users to place their account configuration files in a personal Git repository that can be cloned and pulled to development systems so as to make the configurations consistent across hosts.

Robust, Flexible, and Replicable Build Environment

Some of the DIMS tools were initially prototyped using the Unix make utility and Makefile rules files. The make utility is nice in that it supports dependency chaining. Things don’t need to be rebuilt if the constituent files used to build them have not changed. This works great for source code, since programs are all static files (e.g., .c and .h files for C programs) that can easily have timestamps checked to see if they require recompiling to create new libraries or executable files. It is a little more difficult when a script is produced from a template, which is produced from a complex set of inventory files, host variable files, group variable files, and command line variable definitions as is supported by Ansible. In that case, the Makefile model is harder to use, especially for those who are not experts in how make works and may not have the skills required to efficiently debug it with remake or other low-level process tracing tools.

Tools like Jenkins or Rundeck provide a similar kind of dependency chaining mechanism which may be preferable to make, provided that programmers carefully use variables and templating to produce the build jobs such that they can be deployed to development, testing, staging, and production environments without having to manually change hard-coded paths, etc. This level of generality may be difficult to set up, but is necessary to be able to scale and replicate the build environment. This may sound like a “nice to have” thing, but when cloning the system for deployment requires manually copying build artifacts out of the one-and-only development build server, manually setting up a mechanism allowing virtual machines to access the files, and manually keeping it up to date as things change, the “must have” nature makes itself painfully obvious.

Avoid Painting Yourself into a Corner with Versions

From the start, build everything to support at least two operating system release versions (the current release and one release back, or N and N-1) and try to move as quickly as possible to the current release to avoid getting locked in to older systems. This process is made easier if everyone writing scripts and configuration files follows a “no hard-coded values” rule for things like version numbers, hashes of distribution media for integrity checking, file names of ISO installation disk images, etc.

If all of the required attributes of an operating system release (e.g., version major and minor number, CPU architecture type, ISO download URL, SHA256 hash of ISO, etc.) were referenced with variables and those variables used consistently throughout the OS build and Ansible deployment and configuration process, alternating between the two is a simple matter of alternating between two sets of variable definitions. This is where dictionaries (also known as “maps”) come in handy, allowing a single key (e.g., “ubuntu-14.04.5”) to serve as an index to obtain all of the constituent variables in a consistent way. If the Packer build process, the Kickstart install process, and the Ansible playbooks, all define these attributes in different ways, it becomes very difficult to upgrade versions.

Since operating systems are incrementally improving over time, the build environment must take this into consideration to keep you from getting painted into a metaphorical corner and finding it difficult to get out (without spending a lot of time that should otherwise be directed to more productive tasks). Requiring support for version N and N-1 simultaneously not only provides a mechanism for testing package and configuration updates across versions, but means that it will be much simpler when version N+1 is released to upgrade, test and plan a system-wide migration to the new OS release.

Similarly, source code and system configuration (e.g., Ansible playbooks) should also support versioning. An example of how to do this is found in the GitHub source repository for openstack/python-openstackclient. The source code for client.py (starting at line 24 in client.py, and highlighted in the following excerpted code block) shows how this is done by defining the DEFAULT_API_VERSION (which can be changed via the --os_identity_api_version command line option), and mappings of the option strings to directory names found in the directory of openstack/python-openstackclient and to module names.

Excerpt of client.py showing version support
 DEFAULT_API_VERSION = '3'
 API_VERSION_OPTION = 'os_identity_api_version'
 API_NAME = 'identity'
 API_VERSIONS = {
     '2.0': 'openstackclient.identity.client.IdentityClientv2',
     '2': 'openstackclient.identity.client.IdentityClientv2',
     '3': 'keystoneclient.v3.client.Client',
 }

 # Translate our API version to auth plugin version prefix
 AUTH_VERSIONS = {
     '2.0': 'v2',
     '2': 'v2',
     '3': 'v3',
 }

Of course this requires greater engineering discipline when programming, but had this technique been known and used from the start of the project it would have resulted in a much more organized and structured source directory tree that can support deprecation of old code, transition and migration to new versions, as well as clean deletion of obsolete code when the time comes. Using this mechanism of uniformly handling version support is much more modular than using conditional constructs within programs, or mixing old and new files in a single directory without any clear way to delineate or separate these files.

Budget for System Maintenance

To paraphrase a joke in the programming world: “You have a problem. You decide to solve your problem using free and open source software tools and operating systems. Now you have two problems.” Sure, its a joke, but that makes it no less true.

Trying to compose a system using open source parts that are constantly changing requires constantly dealing with testing upgrades, updating version numbers in Ansible playbook files, applying patches, debugging regression problems, debugging version inconsistencies between systems, and updating documentation. The more software subsystems and packages that are used, the greater the frequency of changes that must be dealt with. Assume that somewhere between 25% to 50% of the project working time will be spent dealing with these issues.

The automation provided by Ansible, and the integration of unit and system tests (see Testing System Components) helps immensely with identifying what may be misconfigured, broken, or missing. Be disciplined about adding new tests. Regularly running tests saves time in the long run. Make sure that all team members learn to use these tools, as well as spend time learning debugging techniques (see Debugging with Ansible and Vagrant).

Testing

To avoid the issues described in Section Testing, follow-on projects are strongly advised to use these same MIL-STD-498 documents (leveraging the Sphinx version of the templates used by the DIMS Project, listed in Section Software Products and Documentation) and the simpler BATS mechanism to write tests to produce machine-parsable output.

We found that when BATS tests were added to Ansible playbooks, and executed using the test.runner script after provisioning Vagrant virtual machines, it was very easy to identify bugs and problems in provisioning scripts. Friction in the development process was significantly reduced as a result. This same mechanism can be extended to support the system-wide test and reporting process. (See Section Testing and Test Automation).

License

Berkeley Three Clause License
=============================

Copyright (c) 2014-2017 University of Washington. All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors
may be used to endorse or promote products derived from this software without
specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Appendices

STIX Development Libraries

Subject: Re: [cti-users] Re: Announcing python-stix2

Sender: <cti-users@lists.oasis-open.org>
Delivered-To: mailing list cti-users@lists.oasis-open.org
Received: from lists.oasis-open.org (oasis-open.org [66.179.20.138])
    by lists.oasis-open.org (Postfix) with ESMTP id B1A965818580;
    Wed, 15 Mar 2017 09:48:17 -0700 (PDT)
Message-ID: <58C96F55.4030104@apl.washington.edu>
Subject: Re: [cti-users] Re: Announcing python-stix2
Date: Wed, 15 Mar 2017 09:44:05 -0700
From: Stuart Maclean <stuart@apl.washington.edu>
To: "Back, Greg" <gback@mitre.org>, Eyal Paz <eyalp@checkpoint.com>,
        "cti@lists.oasis-open.org" <cti@lists.oasis-open.org>,
        "cti-users@lists.oasis-open.org" <cti-users@lists.oasis-open.org>
References: <AAD76FB4-D5FE-41FA-9BDD-1EE6BBE4062D@mitre.org>
        <2956157C13955D458A922DDE028FEE7E0209A8F61D@IL-EX10.ad.checkpoint.com>
        <D3938F78-9B27-48B2-87BA-E57803DEBA81@mitre.org>
In-Reply-To: <D3938F78-9B27-48B2-87BA-E57803DEBA81@mitre.org>

On 03/15/2017 09:22 AM, Back, Greg wrote:
>
> Hi Eyal,
>
> MITRE isn't working on anything, and I'm not personally aware of
> anyone else who is either. But if someone is, hopefully they're on one
> of these lists and will respond.
>
> Greg

Greg, all,

I have followed the STIX development over the past couple of years, and
even developed a Java library for STIX manipulation, see
https://github.com/uw-dims/stix-java. That was when STIX 1.1 was current.

I watched with some dismay as the XML schema way of doing things in STIX
1.1 was dismantled in favor of JSON for STIX 2.  While XML schemas are
complicated, they are very precise, and a document can be checked
against a schema to assert its validity.  Further, the richness of tools
for XML schemas, notably the JAXB/xjc tools for Java, gives developers a
huge leg-up in implementing an API for STIX manipulation.  I just
cranked the JAXB handle over the .xsd file set, and hey presto I had a
set of Java classes with which to start my API.

Going out on a limb here, I also think that Java developers LIKE more
discipline that Python developers.  Static typing vs duck typing?  The
XML schema way of representing information is disciplined, and leads to
fewer 'You meant X? I thought you meant Y' gotchas at runtime.
Extrapolating, I think the JSON way of data representation is less
disciplined than the XML way, hence the natural inclination of Python
developers to prefer JSON.

Looking at the STIX 2 specs, I admire effort the authors must have put
in.  But from a library builder's point of view, there being no
machine-ingestable docs (ie the xsd files in STIX 1.1), I am back to
square one (if I have missed any machine-readable docs, I apologize).

So, in a nutshell, my own feeling is that there will be no Java-language
STIX 2 manipulation tools, a sad fact, and I sincerely hope I am wrong
in this respect.  I do know that I won't add to my STIX 1.1 Java effort.

Stuart

----

Section author: David Dittrich dittrich@u.washington.edu

Copyright © 2017 University of Washington. All rights reserved.