Failify

Introduction

Failify is a test framework for end-to-end testing of distributed systems. It can be used to deterministically inject failures during a normal test case execution. Currently, node failure, network partition, network delay, network packet loss, and clock drift is supported. For a few supported languages, it is possible to enforce a specific order between nodes in order to reproduce a specific time-sensitive scenario and inject failures before or after a specific method is called when a specific stack trace is present.

Prerequisites

To use Failify, you need to install the following dependencies on your machine:

  • Java 8+
  • Docker 1.13+ (Make sure the user running your test cases is able to run Docker commands. For example, in Linux, you need to add the user to the docker group)

It is also recommended to use a build system like Maven or Gradle to be able to include Failify’s dependency.

Quick Start

Failify is a Java-based end-to-end testing framework. So, you will need to write your test cases In Java, or languages that can use Java libraries like the ones that can run on JVM, e.g. Scala. Failify can be used alongside the popular testing frameworks in your programming language of choice e.g. JUnit in Java. Here, we use Java and JUnit . We also use Maven as the build system.

Adding dependencies

First, create a simple Maven application and add Failify’s dependency to your pom file.

<dependency>
    <groupId>io.failify</groupId>
    <artifactId>failify</artifactId>
    <version>0.2.1</version>
</dependency>

Also add failsafe plugin to your pom file to be able to run integration tests.

<project>
  [...]
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-failsafe-plugin</artifactId>
        <version>3.0.0-M3</version>
        <executions>
          <execution>
            <goals>
              <goal>integration-test</goal>
              <goal>verify</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
  [...]
</project>

Creating a Dockerfile

Next, you need to create a Dockerfile for your application and that Dockerfile should add any dependency that may be needed by your application. In case you want to use the network partition capability of Failify, you need to install iptables package as well. Network delay and loss will also need the iproute package to be installed. Here, we assume the application under test is written in Java. So, we create a Dockerfile in the docker/Dockerfile address with the following content:

FROM java:8-jre
RUN apt update && apt install -y iptables iproute

Important

In case you are using Docker Toolbox (and consequently boot2docker) on Windows or Mac, be aware that your currently installed boot2docker image may be missing sched_netem kernel module which is included in most of the linux distributions and is needed for tc command in the iproute package to work. So, unless you upgrade your boot2docker image (normally through running docker-machine upgrade [machine_name], you won’t be able to use the network operation capabilities of Failify.

Adding a Test Case

Now, create a JUnit integration test case (ending with IT so failsafe picks it up) in the project’s test directory. Here, we are assuming the final distribution of the project is a zipfile in the Maven’s target directory. Also, we are assuming the zip file contains a project-[PROJECT_VERSION] directory and that directory itself contains a bin directory which contains a start.sh file to start the application.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
public class SampleTestIT {
    protected static FailifyRunner runner;

    @BeforeClass
    public static void before() throws RuntimeEngineException {
        String projectVersion = "0.2.1";
        Deployment deployment = Deployment.builder("sampleTest")
            // Service Definition
            .withService("service1")
                .applicationPath("target/project.zip", "/project", PathAttr.COMPRESSED)
                .startCommand("/project/project-" + projectVersion +
                     "/bin/start.sh -conf /config.cfg")
                .dockerImage("project/sampleTest:" + projectVersion)
                .dockerFileAddress("docker/Dockerfile", false)
                .tcpPort(8765)
                .serviceType(ServiceType.JAVA).and()
            // Node Definitions
            .withNode("n1", "service1")
                .applicationPath("config/n1.cfg", "/config.cfg".and()
            .withNode("n2", "service1")
                .applicationPath("config/n2.cfg", "/config.cfg".and()
            .withNode("n3", "service1")
                .applicationPath("config/n3.cfg", "/config.cfg".and()
            .build();

        FailifyRunner runner = FailifyRunner.run(deployment);
    }

    @AfterClass
    public static void after() {
        if (runner != null) {
            runner.stop();
        }
    }

    public void test1() throws RuntimeEngineException {
        ProjectClient client = ProjectClient.from(runner.runtime().ip("n1"),
            runner.runtime().portMapping("n1", 8765, PortType.TCP));
        ..
        runner.runtime().clockDrift("n1", 100);
        ..
        runner.runtime().networkPartition(NetPart.partitions("n1", "n2", "n3")
            .connect(1,3));
        ..
        runner.runtime().networkOperation("n2", NetOp.delay(100).jitter(10),
             NetOp.loss(10));
        ..
    }
}

Each Failify test case should start with defining a new Deployment object. A deployment definition consists of a a set of service and node definitions. A Service is a node template and defines the docker image for the node, the start bash command, required environment variables, common paths, etc. for a specific type of node. For additional info about available options for a service check ServiceBuilder’s JavaDoc.

Line 9-16 defines service1 service. Line 10 adds the zip file to the service at the /project address and also marks it as compressed so Failify decompresses it before adding it to the node (In Windows and Mac, you should make sure the local path you are using here is shared with the Docker VM). Line 11 defines the start command for the node, and in this case, it is using the start.sh bash file and it feeding it with -conf /config.cfg argument. This config file will be provided separately through node definitions later. Line 15 marks tcp port 8765 to be exposed for the service. This is specially important when using Failify in Windows and Mac as the only way to connect to the Docker containers in those platforms is through port forwarding. Line 16 concludes the service definition by marking it as a Java application. If the programming language in use is listed in ServiceType enum, make sure to mark your application with the right ServiceType.

Important

If your program runs on JVM and your programming language in use is not listed in the ServiceType enum, just choose ServiceType.Java as the service type.

Lines 18-23 defines two nodes named n1, n2 and n3 from service1 service and is adding a separate local config file to each of them which will be located at the same target address /config.cfg. Most of the service configuration can be overriden by nodes. For more information about available options for a node check Node Builder’s JavaDoc.

Line 26 starts the defined deployment and line 32 stops the deployment after all tests are executed.

Line 37-38 shows how the runner object can be used to get the ip address and port mappings for each node to be potentially used by a client. Line 40 shows a simple example of how Failify can manipulate the deployed environment by just a method call. In this case, a clock dirft of 100ms will be applied to node n1. Line 42 shows how a network partition can be defined and imposed. Here, each of the nodes will be in a separate partition and the first (n1) and third (n3) partition will be connected together. Line 45 shows an example of imposing network delay and loss on all the interfaces of a specific node. Here, a network delay from a uniform distribution with mean=100 and variance=10 will be applied on n2 and 10% of the packets will be lost. For more information about available runtime manipulation operations check LimitedRuntimeEngine’s JavaDoc.

Logger Configuration

Failify uses SLF4J for logging. As such, you can configure your logging tool of choice. A sample configuration with Logback can be like this:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <appender name="Console" class="ch.qos.logback.core.ConsoleAppender">
        <layout class="ch.qos.logback.classic.PatternLayout">
            <Pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</Pattern>
        </layout>
    </appender>

    <logger name="io.failify" level="DEBUG"/>

    <root level="ERROR">
        <appender-ref ref="Console" />
    </root>
</configuration>

Running the Test Case

Finally, to run the test cases, run the following bash command:

$  mvn clean verify

Deterministic Failure Injection

Although injecting a failure by calling a method in the middle of a test case is suitable for many of the scenarios, there exists scenarios where it is needed to inject failures in a very specific moment. With Failify, for a few supported languages, it is possible to inject a failure right before or after a method call where a specific stack trace is present. This happens through defining a set of named internal and test case events, ordering those events in a run sequence string, and let the Failify’s runtime engine enforce the specified order between the nodes.

Internal Events

Internal events are the ones that happen inside a node. Realizing internal events requires binary instrumentation, and as such, is only supported for a few programming languages. You can find more information in Run Sequence Instrumentation Engine page. Available internal events are:

  • Scheduling Event: This event can be of type BLOCKING or UNBLOCKING and can happen before or after a specific stack trace. The stack trace should come from a stack trace event definition. When defining this kind of events, the definition should be a pair of blocking and unblocking events. Basically, make sure to finally unblock everything that has been blocked. This event is useful when it is needed to block all the threads for a specific stack trace, do some other stuff or let the other threads make progress, and then, unblock the blocked threads.
.withNode("n1", "service1")
    .withSchedulingEvent("bast1")
        .after("st1") // The name of the stack trace event. An example comes later
        .operation(SchedulingOperation.BLOCK)
    .and()
    .withSchedulingEvent("ubast1")
        .after("st1")
        .operation(SchedulingOperation.UNBLOCK)
    .and()
    // The same events using shortcut methods
    .blockAfter("bast1", "st1")
    .unblockAfter("ubast1", "st1")
.and()
  • Stack Trace Event: This event is kind of like a scheduling event except that nothing happens between blocking and unblocking. All the threads with the defined stack trace will be blocked until the dependencies of the event are satisfied (based on the defined run sequence). The blocking can happen before or after a method. This event can act as an indicator that the program has reached a specific method with a specific stack trace. To specify the stack traces, the default is to have a list of method signatures with [package].[class].[method] where the last called method comes at the end. As some languages may not have the concept of class or package, you may want to check Run Sequence Instrumentation Engine as well for additional instructions for specific languages.

    It is important to note that, the method signatures are not required to be present exactly in the given indices in the current stack trace. Only the right order of appearance is sufficient.

.withNode("n1", "service1")
    .withStackTraceEvent("st1")
        .trace("io.failify.Hello.worldCaller")
        .trace("io.failify.Hello.world")
        .blockAfter().and()
    // The same event using a shortcut method
    .stackTrace("st1", "io.failify.Hello.worldCaller,io.failify.Hello.world", true)
.and()
  • Garbage Collection Event: This event is for invoking the garbage collector for supported languages e.g. Java.
withNode("n1", "service1").
    .withGarbageCollectionEvent("gc1").and()
and()

Test Case Events

Test case events are the connection point between the test case and the Failify’s runtime engine. Internal events’ orders are enforced by the runtime engine, but it is the test case responsibility to enforce the test case events if they are included in the run sequence.

new Deployment.Builder("sample")
    .testCaseEvents("tc1","tc2")

The Run Sequence

Finally after defining all the necessary events, you should tie them together in the run sequence by using event names as the operands, * and | as operators and parenthesis. * and | indicate sequential and parallel execution respectively.

new Deployment.Builder("sample")
    .runSequence("bast1 * tc1 * ubast1 * (gc1 | x1)")

This run sequence blocks all the threads in node n1 with the stack trace of event st1 (bast1), waits for the test case to enforce tc1, unblcoks the blocked threads in node n1 (ubast1), and finally, in parallel, performs a garbage collection in n1 (gc1) and kills node n2 (x1).

At any point, a test can use the FailifyRunner object to enforce the order of a test case event. Enforcement of a test case event in the test case is only needed if something is needed to be done when the event dependencies are satisfied, e.g. injecting a failure.

runner.runtime().enforceOrder("tc1", 10, () -> runner.runtime().clockDrift("n1", -100));

Here, when the dependencies of event tc1 are satisified, a clock drift in the amount of -100ms will be applied to node n1, and tc1 event will be marked as satisfied. If after 10 seconds the dependencies of tc1 are not satisfied, a TimeoutException will be thrown. If the only thing that the test case needs is to wait for an event or its dependencies to be satisfied the waitFor method can be used.

runner.runtime().waitFor("st1", 10);

Here again, if the event dependecies are not satisfied in 10 seconds, a TimeoutException will be thrown.

Run Sequence Instrumentation Engine

Failify’s deterministic failure injection requires binary instrumentation. Different programming languages require different instrumentors, and thus, if you are going to use this feature, you need to specify the programming language for involved services.

.withService("service1")
    .serviceType(ServiceType.JAVA)

Next, for each service, you may need to mark some paths as library or instrumentable paths. Check specific language instructions as this may differ based on the programming language in use.

Java

You need to choose ServiceType.JAVA as your service type. AspectJ is used for Java instrumentation. AspectJ 1.8+ should work perfectly with Failify. You need to install Aspectj on your machine and expose ASPECTJ_HOME environment variable pointing to the home directory of AspectJ in your machine. Also, you need to include AspectJ and Failify runtime dependencies to your project. Example dependencies to be added to your pom file with AspectJ 1.8.12 are as follows:

<dependency>
    <groupId>io.failify</groupId>
    <artifactId>failifyrt</artifactId>
    <version>0.2.1</version>
</dependency>
<dependency>
    <groupId>org.aspectj</groupId>
    <artifactId>aspectjrt</artifactId>
    <version>1.8.12</version>
</dependency>

Finally, you need to mark:

  • all the required jar files or class file directories to run your application as library path.
  • all the jar files or class file directories which contain a method included as the last method in one of the stack trace events as instrumentable path
.withService("service1")
    .applicationPath("./projectFiles", "/project")
    // It is possible to use wildcard paths for marking library paths
    .libraryPath("/project/libs/*.jar") // This is a target path in the node.
    .applicationPath("target/classes", "/project/libs/classes", PathAttr.LIBRARY)
    .applicationPath("./extraLib.jar", "/project/libs/extraLib.jar", PathAttr.LIBRARY)
    .instrumentablePath("/project/libs/main.jar") // This is a target path in the node
    .instrumentablePath("/project/libs/classes")
.and()

Scala

You need to choose ServiceType.SCALA as your service type. The requirements for Scala is the same as Java as again AspectJ is used for the instrumentation. There is only a subtle point when specifying the stack traces with Scala. When it is intended to instrument a Scala object, you need to add a trailing $ to the name of the object. This is because internally when such a code compiles to JVM bytecodes, a new class with trailing $ will be created and the original class will proxy calls to itself to that class. However, if internal methods of your Scala object call each other, the proxy class will be bypassed. As such, in order to be in the safe corner, it is advisable to use a trailing $ when referring to an Scala object in your stack traces. Here is an example:

object Object1 {
    def method1(): Unit = {
        ..
    }
}

..

withNode("n1", "service1")
    .stackTrace("e1", "Object1$.method1")

As you can see, when defining the stack trace e1, a $ is present after the name of Object1 object.

Adding New Nodes Dynamically

It is possible to add new nodes dynamically after a defined deployment is started. New nodes can only be created out of pre-defined services and they can’t include any internal events. In the following code, service1 service is first created similar to the one in Quick Start. Then, at line 31, a new node named n2 is being created out of service1 service. Node.limitedBuilder method returns an instance of Node.LimitedBuilder which then can be further customized by chaining the proper method calls. This builder wouldn’t allow the definition of internal events for the node. However, all the other node configurations are available.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
public class SampleTestIT {
    protected static FailifyRunner runner;

    @BeforeClass
    public static void before() throws RuntimeEngineException {
        String projectVersion = "0.2.1";
        Deployment deployment = Deployment.builder("sampleTest")
            // Service Definition
            .withService("service1")
                .applicationPath("target/project.zip", "/project", PathAttr.COMPRESSED)
                .startCommand("/project/bin/start.sh")
                .dockerImage("project/sampleTest:" + projectVersion)
                .dockerFileAddress("docker/Dockerfile", false)
                .serviceType(ServiceType.JAVA).and()
            // Node Definitions
            .withNode("n1", "service1").and()
            .build();

        FailifyRunner runner = FailifyRunner.run(deployment);
    }

    @AfterClass
    public static void after() {
        if (runner != null) {
            runner.stop();
        }
    }

    public void test1() throws RuntimeEngineException {
        ..
        runner.addNode(Node.limitedBuilder("n2", "s1"));
        ..
    }
}

The current limitation of this capability is that if there is a network partition applied to the current deployment, the new node wouldn’t be included in that network partition. Introduction of new network partitions will include the new node in generating blocking rules for iptables. This limitation will be removed in future releases.

Creating a Service From JVM Classpath

In case your application is using a JVM-based programming language and is able to include Java libraries, you can use the current JVM classpath to create a service.

1
2
3
4
new Deployment.Builder("sample")
    .withServiceFromJvmClasspath("s1", "target/classes", "**commons-io*.jar")
        .startCommand("java -cp ${FAILIFY_JVM_CLASSPATH} my.package.Main")
    .and()

This method will create a new service by adding all the paths included in the JVM classpath as library paths to your service. Also, any relative, absolute or wildcard paths that comes after the service name, if exists in the class path, will be added to the service as an instrumentable path. This method will return a ServiceBuilder object, as such, all the regular service configurations are available.

As can be seen in line 3, the new classpath based on the new target paths is provided in the FAILIFY_JVM_CLASSPATH and can be used in the service’s start command.

Running Failify in Docker

Why it may be needed?

There could be two reasons that you want to run Failify test cases in a Docker container:

  • Your CI nodes are Docker containers and you don’t have any other options
  • You are developing in a non-Linux operating system (e.g. MacOS or Windows) and the final binary is native to your build environment. As such, you are not able to run the built artifact in a docker container which is Linux-based. This will require doing the whole build for testing inside a container.
  • Your client needs to access the nodes using their hostname or on any port number (without exposing them). Either of these cases requires the client to be in the same network namespace as the nodes and that is only possible if you run Failify in a Docker container.

How to do this?

  1. Create a docker image for running your test cases. That image should at least include Java 8+. You may want to install a build system like Maven as well. Also, install any other packages or libraries which are needed for your test cases to run and are already installed in your machine. In case you need instrumentation for your test cases, install the required packages for your specific instrumentor as well.
FROM maven:3.6.0-jdk-8
ADD /path/to/aspectj
ENV ASPECTJ_HOME="/path/to/aspectj"

2. Change the current directory to your project’s root directory. Start a container from the created image with the following docker run arguments:

  • Share your project’s root directory with the container (-v $(pwd):/path/to/my/project)
  • Make the project’s root directory mapped path the working directory in the container (-w /path/to/my/project)
  • Share the docker socket with the container (-v /var/run/docker.sock:/var/run/docker.sock)

Your final command to start the container should be something like this:

$  docker run --rm -v /var/run/docker.sock:/var/run/docker.sock -v $(pwd):/path/to/my/project
-w /path/to/my/project myImage:1.0 mvn verify

Changelog

v 0.2.1 (04/17/2019)

  • Fixes NPE when service type is not specified
  • Changes the return value of the runtime’s ip method when using Docker Toolbox from localhost to VM’s ip address

v 0.2.0 (04/02/2019)

  • Ability to impose network delay and loss in a node
  • Scala support for Run Sequence Instrumentation Engine
  • The runner is now thread-safe and multiple test cases can run in parallel and in different threads
  • Stack matcher now matches against the right order of traces in a stack trace instead of the exact given indices
  • New nodes can be added based on a pre-defined service and using a limited node builder after a deployment is started