Local project leader
Prof. Hans P. Reiser
Research team members
The goals of OptSCORE's second funding phase include the finalization of previous goals. These include the adaption of system parameters to the specific needs of arbitrary applications and system environments. This way, we will reach an autonomous optimization of State Machine Replication (SMR) systems. Apart from that, we will extend our system with preventive as well as reactive fault handling measures and add new optimization dimensions. We will assess the various evaluation aspects with extensive testing using custom evaluation strategies in both artificial and realistic application scenarios. This research project aims at making SMR based replication more practicable and finding efficient implementation measures.
SMR is a promising approach for providing resilience guarantees to IT systems. Replicated state machines are able to mask Byzantine failures and guarantee a strong consistent view on the replicated data. The transition from a simple service to a replicated service implies typically a much higher resource utilization as well as further performance decreases like a smaller throughput. In the first funding phase measurements like deterministic multithreading (DMT) and individually weighted replicas have been studied to remedy those performance decreases. In addition to that, a variety of configuration parameters that are reconfigurable during run-time have been identified and analyzed.
In the second funding phase, an autonomous and automatic adaption of those parameters will now be realized, so that the throughput and the request latency are optimized for given applications, usage scenarios and execution conditions like network latencies and error rates. We further want to integrate machine learning to cope with the immense complexity introduced by those configurable parameters and to enable an effective coordination in order to optimize every aspect of the system performance. Hereby, the biggest challenges are the selection of a suitable machine learning approach as well as the creation of appropriate training data. Furthermore, the adaption of the system components must be done in a way that dynamic adaptions of parameters are always possible during runtime and that the consistency and availability guarantees are not affected.
Additional goals are on the one hand the development of a security concept that addresses the yet unsolved problem that nowadays systems are satisfied with only tolerating faults. Such systems undertake no efforts in detecting such faults and in recovering from them. Doing so, however, should minimize the effect of faults on the performance and increase the resilience compared to current SMR systems. Apart from that, our prototype system from the first funding phase will be extended with further optimization strategies. Hereby we want to find out if replacing the total order of request with a partial one will always result in better performance in the area of both the group communication and the deterministic scheduling. Moreover, we want to minimize the overhead that is introduced by the periodic state checkpointing.