Event report: ObservabilityCON "On the Road" London 2024

Last week I travelled down to London for Observabilitycon On The Road London. The event was organised by Grafana Labs and covered all their latest developments. I was one of over 200 attendees watching the single-track of talks. These talks were mainly delivered by Grafana Labs employees, with a couple given by Grafana Cloud Enterprise customers. In this post I’ll highlight the talks I thought were the most interesting.

The talks showcased a mixture of improvements to existing features and brand-new features. Some of what was shown is not yet publicly available. For any preview features, I’ve included when Grafana are intending to rollout the features to OSS/Enterprise/Cloud.

Keynote

First up was the keynote. This was delivered by Grafana Labs’ CTO and outlined the company’s strategy for the coming year. Their plan is to make it easier for developers to introduce observability and then simplify existing solutions. This was covered by three separate areas:

Explore: Easier use of existing logs, metrics, traces and profiles through their no-code querying
Asserts: Easier incident investigation with enhanced Asserts tooling
AI/ML: Making it easier for non-observability experts to analyse telemetry and improve existing systems

These topics were covered in greater detail in later talks so I’ll keep this section short and go into more detail for each talk.

Talk 1: Grafana Alloy, Grafana Beyla, and OpenTelemetry

The first talk covered 3 topics: Grafana Alloy, Grafana Fleet Management and Grafana Beyla. These were all new concepts to me so I found this talk interesting.

First up was Grafana Alloy. This is the Grafana Labs equivalent of the OpenTelemetry Collector, it’s a sink for all your telemetry, and your applications send their data to this application which then forwards it to the telemetry backends like Prometheus , Mimir, and Loki for example. They claim it provides a more efficent interface when forwarding telemetry to the Grafana Labs backend stack, but they didn’t seem too confident in this. They also said it provides a superior programming experience to the OTEL Collector in terms of transformation pipelines, debugging the pipelines and GitOps. I hadn’t heard of this before but the positioning of this tool was interesting. They are building their own tool but are also embracing OpenTelemetry. I think this is a good sign for the direction of their tooling.

The presenters then showcased Grafana Fleet Management. This is Grafana’s equivalent of OTEL’s OpAMP and is used to manage the configurations of deployed instances of the Alloy collector. You can tag Alloy instances to group them and apply configuration updates to the tag groups through a portal or through webhooks. This was interesting but not yet in public preview so I won’t say much more.

Finally they introduced Grafana Beyla. This was really exciting. Beyla instruments your application with no source code changes required. It works through the eBPF feature of the Linux kernel (hence Beyla being Linux only). From my previous experience of eBPF I knew that it could be used to monitor simple kernel metrics such as the number of open file descriptors and resource utilisation. Beyla goes beyond this. For example, by monitoring networking system calls, it’s able to understand the protocol, parse it and export the data as a trace. This provides observability into network requests for HTTP(S), gRPC and SQL with no code changes.

After the talk I spoke with Mario (one of the presenters / Beyla contributors) and I asked him if Microsoft’s efforts to bring eBPF to the Windows kernel meant that Beyla would soon work on Windows. He said that the only part of eBPF currently supported by Windows is XDP. As this only lets user-space code process packets, this makes Beyla’s kprobe monitoring approach impossible. I hope that Microsoft expands the Windows eBPF features to support what is required for Beyla as it currently only works on Linux.

This talk was one of my highlights of the event, here’s a link to a similar talk at the main ObservabilityCON event covering the same topics as the one I saw in London. I’d recommend watching it if you want to learn more about Beyla.

Talk 3: Grafana Asserts

The third talk was a showcase of Grafana Asserts. It was demo-based so I’ll summarise the main takeways.

Grafana Labs acquired Asserts a year ago. When they were acquired, Asserts was able to automatically generate a graph network of your system from existing telemetry and provided a debugging view called Root-Cause-Analysis Workbench which identified anomalous telemetry and showed the relationship between parts of your system.

I thought that this is an interesting concept as it provides what I consider the gold standard of generated documentation: low upkeep and never going out of date. They didn’t stress this point too much in the talk, but I think this is something which should be explored further (using your observability to document your architecture, compared to static diagrams which compartmentalise your systems and go out of date quickly). What Grafana have done is link in the LGTM stack with Asserts, this means that once you have the map built, you can now view your logs, metrics and traces in a contextualised analysis. This seems like a really smart way to automatically structure your data.

Asserts also provides a Root-Cause-Analysis Workbench, this uses machine learning to identify anamolous telemetry and provide quicker insight into issues. Asserts works with Explore telemetry so you are able to generate no-code queries. This seems very powerful when combined with the system map. The demo of the RCA Workbench was very slick but I will have to see how it performs with real-world, messy, telemetry before accepting Grafana’s panacea claims.

Yet Asserts comes with one big caveat: it’s currently only available in Cloud Advanced ($$$), and they are targeting a 2025Q1 rollout for the lower tiers of Grafana Cloud.

Talk 5: AI/ML

This talk was about Grafana’s strategy of applying AI and ML to observability. The keynote opened with a strong statement that they want to empower developers rather than unemploy them, this talked showed the three focuses to achieve this.

The first was adding dynamic alerting from time series data. This takes the existing threshold-based alerting and adds a forecasting algorithm to generate expected ranges. This was a pretty standard forecasting algorithm which can factor in seasonality. However, they have a pretty slick GUI to tune the parameters of the model and see which points in historical data would have generated alerts.

To optimise telemetry, they are adding “Adaptive Telemetry”. The system is able to recommend logs, metrics, and traces which are underqueried and can be safely deleted. This already existed for metrics, but now Grafana have expanded this to include all forms of telemetry. They determine this by calculating the ratio between the number of entries in the database and the number of times it is queried. This also works on labels and fields so it seems like this provides a good way to reduce the cardinality of your telemetry data. However this only works through dropping values from the backends (Loki, Tempo…). For maximum effectiveness you’d want to shift the removal closer to the applications, such as in the collector or in the source code. An audience member asked if there was an easy way to instruct Grafana Alloy to not forward the telemetry but that was not possible. This, combined with a basic UI, showed that this was a tech demo and not yet ready for production. When this reaches GA I’d be interested to see how it performs.

They also showed more standard applications of LLMs. The most interesting was using an LLM to explain complex flamegraphs but most of the LLM integrations felt like hype.

Talk 6: End-to-end IRM

Talk 6 was a demonstration of Grafana’s Incident Resource and Management tools. This is their attempt to create a single source of truth of investigations. It looked really good. You can create SLOs and automatically visualise compliance to that SLO using Grafana. They also showed Grafana OnCall which is similar to PagerDuty and ServiceNow to manage who is on call and links alerting with the SLOs. Finally they showed the Grafana Incident app. It allowed you to “declare an incident” which is like a workbook. From that you could attach dashboards and queries to the workbook, you can manage tasks (which can be automatically populated from a runbook). They have intergrations with MS Teams / Slack and Jira, so actions taken in the Incident app are populated back to existing communication channels.

This looked great however it’s currently only available in Grafana Cloud and the speaker didn’t know when it’s coming to Enterprise or OSS.

Talk 8: Explore Telemetry

Next up was a talk on the upcoming no-code querying in Grafana. This comes from a collection of query tools called the Explore apps suite.

Currently Grafana supports two ways of querying data: you can write a query by hand in the query language (e.g. PromQL, LogQL, …) or you can use the low-code builders to produce the queries from a GUI. This is going to expand to have a third way of creating queries, there’s going to be a higher abstraction than the low-code: no-code.

This is going to let you start out with a suggested query from automatic data analysis then Grafana will suggest ways to “drill-down” that query and you can click on more filters to apply them to the current view. For example in Explore Traces, Grafana will identify any anomalous traces compared to the detected baseline and try to cluster labels and fields occuring in the bad traces. They showed another example where Grafana is able to identify features of slow traces, and again correlate those slow traces with labels and fields to provide a starting-point for an investigation.

This talk was mostly a demonstration working through the new Explore interface. For their demo investigation, it worked very well - however, real production use cases will provide a better evaluation of the technology. I will definitely try this when it’s released.

Talk 9: Synthetic Monitoring and K6

The final talk of the day was about Grafana’s frontend monitoring.

This was a really good talk, the speaker introduced the core web vitals and explained how you can monitor your application’s performance on the benchmarks.

First, the presenter introduced Grafana Faro. This is Grafana’s frontend monitoring solution, similar to Sentry. This tracks user sessions and allows plotting a user’s journey through your application. To help debug issues, you are able to link source maps to Grafana Cloud, this lets you view any errors in the context of the JavaScript source, not the minified bundled JavaScript which is generated by the toolchain which seemed to me like a handy way to simplify exception tracebacks.

The second part of the talk showed updates to the k6 integration with Grafana. I was aware of this technology before the event but I was impressed by the improvements shown. k6 tests are written in JavaScript but to make it simpler Grafana have created a dev-tools recorder. After clicking the “record” button in the dev-tools, the plugin will record your actions and generate a k6 script from what actions you take in the browser. This makes it easier to get started with k6. This change enters k6 into a league alongside Playwright and Cypress so it will be interesting to see how it develops. After the talk someone asked about the ability to set this up in CI, the speaker’s answer of “it’s complicated” shows it’s still not ready to compete alongside existing tools in the browser-testing space. To me, the killer feature from the talk is that now you can launch k6 tests from Grafana itself. This launches a headless browser and runs the given tests. Once it completes, you are able to see the results from k6, but also any telemetry generated by the tests and their actions. This looked like a really good way to pinpoint any bottlenecks from the tests.

Conclusion

Overall, ObservabilityCON gave me a better understanding of Grafana’s observability tooling vision, and showed me parts of application observability I was unaware of. I think the most promising technology shown was Beyla, and I’m looking forward to taking a deeper dive into it to learn more.

Keynote#

Talk 1: Grafana Alloy, Grafana Beyla, and OpenTelemetry#

Talk 3: Grafana Asserts#

Talk 5: AI/ML#

Talk 6: End-to-end IRM#

Talk 8: Explore Telemetry#

Talk 9: Synthetic Monitoring and K6#

Conclusion#