Dependable Distributed Systems

(Défago Lab.)

Main Research Interests


Our research interests are on highly resilient distributed systems and algorithms, in particular applied toward the support of coordination of sensor networks and groups of mobile robots.

This combines research on algorithms, networking, fault-tolerance, computational geometry, and software engineering (middleware design). We are especially interested at the interface between theory and practice.

Some of the research issues on which we have been working recently are as follows.

Recent projects

Byzantine Agreement

Without going into details, the goal of this research is to design systems that can survive in spite of the presence of some malicious nodes (traitors). This focuses on the design of distributed algorithms that can make consistent decisions even after some of the nodes have been compromised. Such nodes may act as traitors by lying to others, with the purpose of disrupting the entire system. These nodes are said to be subject to Byzantine faults and are often called Byzantine nodes.

Agreement is a core problem in distributed systems, as it is the key to ensure consistent operations and coordination in such systems. It is at the core of generic coordination services such as Chubby (Google) and ZooKeeper (Apache). Byzantine agreement is a variant that must be resistant to the presence of Byzantine faults.

Fault propagation

In large systems, it is essential to prevent the propagation of failures from process to process, or else a minor issue could escalate and eventually cause the entire system to collapse. To reduce this risk, it is important to design adequate mitigating strategies.

Fault-tolerance in multi-robots systems

We aim at building a multi-robot system based on formally proven algorithms. To do so, we are working on wireless ad hoc network protocols aimed specifically at robot systems. While many ad hoc protocols consider mobility as an input parameter, robot systems must often consider mobility as the main output parameter to control.

Low-energy network protocols designed for sensor networks, such as Xbee/Zigbee, can be optimized based on information provided for instance by a motion planning component. At the same time, motion planning can calculate routes such that the system is less likely to face partitions. Or, when partition do occur, a better estimation about their duration can be used to improve the efficiency of disruption-tolerant network protocols.

In our research aimed at multi-robot systems, we focus on improving the robustness of the system as a whole. While multi-robot systems are distributed systems, they have characteristics that set them apart from conventional distributed systems such as cloud systems. In particular, in addition to real-time constraints, the mobility of the robots becomes an important part of the problems to solve.

Robot cooperation algorithms

The problem of cooperative autonomous mobile systems (e.g., Internet-enabled robots) is to control a group of mobile entities in such a way that they cooperate to perform some collaborative actions. This general problem is subject to very active research but, so far, most research in the field has taken an ad-hoc or bottom-up approach; by studying how some complex global behavior can emerge from the interaction of many entities with just a very simple local behavior. In contrast, we try to follow a different (complementary) approach with a top-down perspective. The idea is to consider practical tasks involving large groups of autonomous mobile systems, and extract abstract problems common to many tasks. After formally specifying those problems and developing an adequate computational model, the goal is to find the minimal conditions under which some particular tasks can be achieved.

Performance and scalability of distributed algorithms

The performance and the scalability of distributed algorithms and protocols are often not evaluated as thoroughly as they should be. One important aspect of our research is concerned with improving this situation through the development of adequate tools. These tools include a set of performance metrics for distributed algorithms, a prototyping environment, as well as running adequate network experiments. This research activity has led us to develop the Neko framework, as well as a novel approach to network failure detectors, called accrual failure detectors. One of our proposed instance of such failure detectors are being used in the Apache Cassandra project (see also below).

Software for robots

Finally, we are studying middleware architecture and flexible protocol composition as part of our efforts to devise a fault-tolerant multi-robot middleware framework. Our efforts are based on the Neko protocol framework that was designed to support the prototyping and the evaluation of distributed protocols. The specific interaction models of robotic software components (sensor/actuators, filters, real-time constraints, etc.) are interesting challenges that make it difficult to consider protocols as black boxes.

As an extension, we aim to develop a new generation of fault-tolerance middleware for multi-robot systems, written in the Scala language and to be interfaced with ROS; an open-source middleware for robots.

Some earlier projects

Failure detection

In earlier research, we have worked on the implementation of failure detectors in large-scale distributed systems. We have designed the noton of accrual failure detectors and its first implementation called Phi. In short, an accrual failure detector continuously monitors network traffic and maintains a variable representing the probability that a given process has crashed. This model allows to cleanly separate the questions of monitoring network traffic (a network issue) from when and how to react in case of suspicion (an application issue).

Among others, accrual failure detectors have been implemented in the NoSQL database Apache Cassandra, the Akka actor toolkit, and the Appia communication framework.