Performance of web applications and a smooth user experience are key for today’s online business. Even small increases in response times impact a user’s experience on a web page what leads to lower conversion rates. So anomalous behavior of a company’s web applications can negatively impact their revenue. At the same time, more and more web applications are provided through a large number of interacting services across different machines. This is the reason, why companies are employing distributed tracing to track the way the requests take through different services while they are processed.
In this thesis a prototype is implemented that is able to detect anomalies based on distributed tracing data. The anomalies that are targeted by the anomaly detection are application errors, violations of defined thresholds and increased response times compared to the normal behavior of a service. This is achieved by running three different anomaly detection algorithms, implemented based on Apache Spark, in parallel on the incoming data from distributed tracing.
The reported anomalies are then processed by a second module that is based on Apache Spark. It sets the anomalies into a context, that represents the dependencies among the services, that reported them. This context is used to prioritize the reported anomalies that are seen to be the root cause of the set of anomalies.
The evaluation on a small-scale demo application shows, that the targeted anomalies can be detected by the prototype. This means, that it is possible to perform anomaly detection and root cause analysis based on distributed tracing data.