Monday, May 9, 2011

Testing in the Clouds

The Google Testing Blog is a good read and highly recommended for tester-wannabes like me. :) It offers great insights on the evolution of software testing. A recent post made me reflect on the testing of Java code of my main work project (the Voyeur Tools cloud-based text analytics platform).


In a recent gridblade.net blog post (Virtualizing TestNG unit tests storage), I suggested to virtualize the storage of Voyeur Tools as a prerequisite to virtualize its unit tests storage. In this blog post, I'll share a few more thoughts on that topic.


Voyeur Tools is at its core a suite of text analytics tools that can be applied to a corpus (a bunch of documents). These tools can be classified into four categories:

  • corpus indexing tools
  • corpus analytics tools
  • corpus reporting tools
  • corpus update tools

Corpus update tools notwithstanding, Voyeur Tools therefore does two things: (1) store and index incoming documents and (2) provide derived data about the indexed documents through a visually-rich web interface.

The visually-rich web interface is independent from the analytics back-end, save for a basic API consisting of a list of key-value parameters as input data and a JSON data structure as output data. The testing of the web interface is done by Stéfan SinclairVoyeur Tools project lead.

Testing Voyeur Tools, from my perspective, thus consists of testing a large software that relies on a set of clearly identified I/O channels with its environment:

  • it receives input data as a list of key-value parameters and outputs text serialized into JSON
  • it downloads HTTP-addressable data from the web and also reads files from local or distributed filesystems
  • it writes and reads index data through a Java API (about a dozen interfaces with read/write methods for specific data types), which makes it straightforward to virtualize its storage
  • it also writes and reads files to and from /tmp (which will also be soon virtualized and put behind an interface so that an in-memory temporary storage space can be easily configured)

In order to allow their virtualization, the unit tests of Voyeur Tools involving outside I/O channels fetch files from JAR resources and download web data from a test web server running on localhost. Both of these concerns (reading local data files, downloading web data) could also be put behind an interface and thus virtualized. The default implementation could rely on the classic java.io and java.net APIs and the virtual implementation could rely on JAR resources and a locally-hosted test web server.

Coming back to the Google Testing Blog post that started my reflections and led to this blog post, the next step is to investigate how Voyeur Tools could be fully tested in the cloud.

There are many benefits to running unit tests in the clouds, among them scaling out (thus speeding up) unit tests even more than by relying on mutlithreading, as well as facilitating the use of a continuous integration server.

I'll restrict this discussion to Java-based PaaS clouds (e.g. Google App Engine or Appscale) in order to control the whole testing process from within the JVM. IaaS clouds (e.g. Amazon web services) are, here, too broad in scope as they would require extra configuration work to configure the testing process from outside of the JVM.

App Engine (Java flavor) offers a servlet container that nonetheless offers a more restricted API that what is available from a typical JVM, and has a completely different storage model and a completely different multithreading model. Assuming that unit tests are run using in-memory storage, and assuming that outside I/O channels of Voyeur Tools are also virtualized in a near future, the only feature that I foresee would require extra work is multithreading. Indeed, App Engine servlets can't launch new threads but instead must rely on the Task Queue API.

To wrap up, I'll just add that I'm very enthusiastic about investigating in the next few months how Voyeur Tools could be fully tested, from the bottom up, upside down, in the cloud. It opens up the way to architectural improvements as well as day-to-day productivity benefits, and also to learn a lot more about software testing. :)


UPDATE (WEDNESDAY, MAY 11, 2011)
With the release of App Engine 1.5.0 yesterday, coincidentally one day after the publication of this blog post, the Pull Queues API brings the multithreaded task processing model of App Engine one step closer to the standard JVM API (java.util.concurrent). This is great news, as it definitely facilitates the virtualization of multithreaded task processing in Voyeur Tools.

Saturday, February 26, 2011

Virtualizing TestNG unit tests storage

Running unit tests typically becomes time-consuming and resource-intensive as Java applications grow in scale and complexity.

A first step is to use a unit testing framework, such as TestNG, that leverages multicore processing by running independent unit tests in parallel. Taking advantage of modern, powerful CPUs, this results in a decreased overall completion time of running a test suite, as described in a previous gridblade.net blog post (see also: configuring TestNG in the Maven Surefire plugin).

A second step is to virtualize the Internets, also as described in a previous gridblade.net blog post. This results in removing dependencies on Internet connectivity, on web connectivity as well as on DNS resolution.

However, there remains one major source of contention: disk access. Reading and writing large number of files is both time-consuming and resource-intensive.

Software engineering best practices suggest high cohesiveness and loose coupling at all levels of the software architecture. If the software you're working on is already structured so that all storage I/O are handled in a dedicated layer (e.g. by reading/writing through an interface rather than through implementations of storage I/O operations), there's a further step that can be made towards improving the speed and resource consumption of unit tests: virtualizing unit tests storage.

A third step is indeed to virtualize the storage of the application, so that unit tests can read and write data directly from/to memory, rather than from/to  a hard disk. Once the application storage is virtualized (and this can take some time...), all that remains to do is configure the build environment to automate the use of in-memory storage by the unit tests.

For the sake of example, let's assume that the storage layer is configured through an enumeration value passed to the storage driver. This value can be passed to a TestNG unit test using the @Parameter annotation.

For instance, a unit test can be configured as follows:
@Parameters("storageLabel")
@BeforeClass
public void setUp(String storageLabel) throws Exception {
  StorageType storageType = storageType.valueOf(storageLabel);
  StorageDriver driver = new StorageDriver(storageType);
  // ...
}

And Maven's pom.xml is configured like this:
<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-surefire-plugin</artifactId>
  <version>2.5</version>
  <configuration>
    <systemPropertyVariables>
      <storageLabel>INMEMORY</storageLabel>
      <!-- licit values: HARDDISK, INMEMORY -->
    </systemPropertyVariables>
  </configuration>
</plugin>

References:

To run one unit test from the comfort of Eclipse, the parameter can be passed as a JVM system property:
-DstorageLabel=INMEMORY

References:

Bottom line: all unit tests write and read data (with the possible exception of .jar resources) to and from memory, reducing the overall completion time of running the test suite. In my case, this saved about 50% on my dev laptop that has a slow disk.

Enjoy :-)


I feel the need...
    - Maverick

...the need for speed! 
    - Maverick, Goose


Tuesday, December 14, 2010

Programmatic access to Hadoop embedded web server

This gridblade.net blog post describes a little thought experiment on making Hadoop nodes web-accessible in a RESTful way.

Hadoop web servers are currently intended for human use only

When recently rereading the excellent Apache Hadoop: Best Practices and Anti-Patterns blog post (http://developer.yahoo.com/blogs/hadoop/posts/2010/08/apache_hadoop_best_practices_a/), the following paragraph made me think:


<<Implementing automated processes to screen-scrape the web-ui is strictly prohibited. Some parts of the web-ui, such as browsing of job-history, are very resource-intensive on the JobTracker and could lead to severe performance problems when they are screen-scraped.>> Moreover, as of 0.21.2 (I have yet to check the current and trunk versions), the pages of Hadoop's web interface are yet to be (X)HTML-compliant (which would make screen-scraping easier).


Why is that?

I would guess that refreshing these stats requires the Hadoop daemon to enter into its main loop, enter a synchronized code chunk (effectively constituting a sync barrier for other threads), get the information, and only then let the loop proceed, thus blocking other more important operations. Given the high level of multithreading in Hadoop daemons, such data gathering can indeed become very costly. In practice, I've also observed this pattern numerous times in a similar context, when implementing the CanoPeer P2P Grid middleware (http://www.canopeer.org).

Towards web-accessible, RESTful Hadoop nodes

Given the numerous benefits (loosely-coupled architecture, such things...) that can be derived from automatically extracting middleware stats in a RESTful way, this compels me to wonder if it'd be possible to extract stats in an eventually-consistent fashion, instead of blocking the operations of the Hadoop node.

Indeed, if one does not require (near-)real-time freshness when extracting these stats, I wonder what would prevent to collect them using an asynchronous message queue, and subsequently exposing them through a RESTful web interface running in a separate thread?

The path to Hadoop

This blog post offers two lists of web resources that helped me learn how to design MapReduce algorithms, implement MapReduce jobs and operate Hadoop clusters.

They're listed in the order I believe is the most relevant when discovering Hadoop, i.e. the order in which it would make the most sense to read.

Enjoy


Designing and implementing MapReduce applications




Operating MapReduce clusters

Wednesday, October 27, 2010

All-Pairs as a MapReduce application: beyond typical use cases, still adhering to sound MapReduce patterns

I'm presenting this week a poster paper at the Grid 2010 conference.
This poster paper showcases ongoing work
at McMaster University (Ontario, Canada)
with my postdoc advisor, Prof. Stéfan Sinclair.

    C. Briquet and S. Sinclair.
    Structuring All-Pairs as a MapReduce Application.
    In Poster Proc. Grid conference, Bruxelles, Belgium, 2010. 

The showcased research project basically consists of implementing the All-Pairs computing abstraction using the MapReduce framework (relying in practice on the Apache Hadoop implementation), and making it available as a tool of  the VoyeurTools cloud-based text analytics platform.

Our approach to implementing All-Pairs as a MapReduce application revolves around providing Hadoop map tasks with data designation information only, not the actual input data. Input data are transferred as so-called dictionary data, using an external (i.e. non-mapper-controlled) data transfer channel, e.g. a distributed filesystem / file sharing system that can be accessed by application-level code running on Hadoop compute nodes.

This means that our implementation of All-Pairs does not rely on Hadoop's ability to handle massive amount of input data. Nonetheless, input data is transferred, as dictionary data. The total number of computations thus remains identical. Orthogonally, All-Pairs benefits from Hadoop's excellent out-of-core implementation of the shuffle algorithm, so that compute nodes can handle massive amounts of computations.

A recent blog post by Apache Hadoop committer Arun Murthy, Apache Hadoop: Best Practices and Anti-Patterns, introduces the concept of Hadoop design (anti-)patterns. A very compelling read. It covers every aspect of implementing and tuning a MapReduce application.

Our proposed implementation of All-Pairs as a MapReduce application, and running it on Hadoop, seems compatible with these design patterns. Within the design space of MapReduce applications, it turns out that All-Pairs would be classified as a << rare [MapReduce use] case >>, based on its data access patterns. Indeed, given the very structure of the All-Pairs problem, each computation in All-Pairs shares input data with many other computations, opening the possibility of massive data reuse.  This is how the structure of the All-Pairs problem significantly differs from the use cases originally intended for MapReduce applications.

It is worth examining  our proposed implementation of All-Pairs as a MapReduce application with respect to Arun Murthy's proposed Hadoop design patterns. Firstly, transferring large amounts of input data as dictionary data indeed does not scale with respect to Hadoop's distributed cache mechanism. This is why we suggest to rely on a distributed file system. We argue that this is definitely worth trying, as tremendous progress in these technologies have  been made the recent years. Secondly,  getting map tasks to read large amounts of dictionary data indeed looks like a design anti-pattern. This is why each map task of our implementation of All-Pairs includes a hierarchical cache, in order to minimize accesses to the dictionary data.

Our proposed implementation of All-Pairs as a MapReduce application adheres to other design patterns as well. To obtain a total order on the output data, a custom partitioner assigns output data items to reducers based on a precomputed partition table. Load balancing and fast fault-recovery are achieved by calibrating the number of input data items designated to each map task.

To summarize:
  • All-Pairs does not structurally fit typical (i.e. originally intended) MapReduce use cases
  • All-Pairs, while not requiring Hadoop's ability to handle massive amount of input data, can certainly benefit from the out-of-core implementation of the shuffle algorithm; even more so than this is a very important design objective of the VoyeurTools cloud-based text analytics platform from which All-Pairs will soon be made publicly available
  • our implementation of All-Pairs as a MapReduce application, described in our Grid 2010 poster paper, seems compatible with the Hadoop design patterns proposed by Arun Murthy in his recent blog post

Thursday, October 21, 2010

Maverick Meerkat: I feel the need, the need for speed

Just updated the operating system of my laptop to the newest release of Ubuntu Linux.
Ubuntu 10.10, codename Maverick Meerkat.

Wow. A perfect 10/10.

Really, there's nothing more to add. Fully automated update process. Only one reboot required to complete the process. The update software requested me to confirm that a couple of files could be updated. I agreed, of course, and it went on all smooth.

After the update everything continued to operate correctly. No Firefox extension mangled this time. No configuration file to troubleshoot. Several minor UI updates here and there as a bonus. Ubuntu 10.10 available for download from http://www.ubuntu.com/



Oh, and the speed. Oh yes, the speed. Maverick Meerkat is really snappy. The user experience just flows through the screen. I can't describe it in words.

Go Maverick! I feel the need, the need for speed.

Saturday, October 2, 2010

Virtualizing the Internets

Testing distributed systems is notoriously difficult. Writing unit tests for components of a distributed system is sometimes more an art than a science. It is indeed not straightforward to determine the optimal level of granularity of what is being tested.

Until today, two unit tests of my main work project (Voyeur Tools cloud-based text analytics) exhibited dependencies on the global Internet. Spefically, one code chunk downloads HTTP-addressable resources to obtain test data.

On one hand, a dependency on the global Internet is too coarse-grained. Each time test data can't be downloaded (web servers and wifi links indeed may or may not work), both unit tests fail. Quite annoying. It makes the developer's work harder by preventing some unit tests to run properly.

On the other hand, downloading HTTP-addressable test data makes sense for these two unit tests. Totally removing the dependency on HTTP-addressable resources is not an option.

Looks like an opportunity to get creative.

Fortunately, the project relies on an embedded web server (Jetty) for some system-level tests. This means that the project's POM is already configured to make the Jetty libs available to the whole code base.

Virtualizing the Internets for unit tests is therefore straightforward:

STEP 1: copy the HTTP-addressable test data to the project's resources directory

STEP 2: write a wrapper for the embedded web server (cf. infra)

STEP 3: configure an embedded web server for each unit test (using a different TCP port each time), relying on the facilities provided by the unit testing framework (TestNG, cf. infra)

STEP 4: update each URL to virtualize:

    new URL("http://someserver.net/resource.html");
    =>
    new URL("http://localhost:"+webServer.getPort()+"/resource.html");

That's right, each unit test invoving HTTP-addressable resources gets its own mini-copy of the Internets (as far as it's concerned), by running a dedicated embedded web server.

As far as unit tests involving HTTP-addressable resources are concerned, virtualizing the Internets is straightforward. After all, the Internets is only a series of tubes ;)



Configuring the web server in each unit test:

    private EmbeddedWebServer webServer; 

    @BeforeClass 
    public void setUp() throws Exception { 
        this.webServer = new EmbeddedWebServer(8080,
            getClass().getResource("/path/to/resources/").getPath()); 
        this.webServer.start(); 
    }

    @AfterClass 
    public void cleanUp() throws Exception { 
        this.webServer.stop();
    }


Writing a wrapper for the embedded web server:

import java.io.File;
import org.mortbay.jetty.Connector;


import org.mortbay.jetty.Server;
import org.mortbay.jetty.nio.SelectChannelConnector;
import org.mortbay.jetty.webapp.WebAppContext;

/**
 * @author Cyril Briquet
 */
public class EmbeddedWebServer {

    private final Server server;
    private final int port;

    public EmbeddedWebServer(int port, String contentsDirectory) throws Exception {
        if (contentsDirectory == null) {
            throw new NullPointerException("illegal contents directory");
        }
     if (new File(contentsDirectory).exists() == false) {
            throw new IllegalArgumentException("illegal contents directory "+contentsDirectory);
        }

        this.port = port;
        this.server = new Server();

        final Connector connector = new SelectChannelConnector();
        connector.setPort(port);
        this.server.setConnectors(new Connector[] { connector });

        final WebAppContext webapp = new WebAppContext();
        webapp.setContextPath("/");
        webapp.setWar(contentsDirectory);
        //webapp.setDefaultsDescriptor("web.xml");
        this.server.setHandler(webapp);
    }

    public int getPort() {
        return this.port;
    }

    public synchronized void start() throws Exception {
        this.server.start();
    }

    public synchronized void stop() throws Exception {
        this.server.stop();
    }

}