GURO BERGAN - VP EMEA & COLIN BURKE - Global Head of Customer Success

Craig Godfrey
1 day ago
17 min read

You Can't Govern What You Can't See

As AI rewires how engineering teams build and operate software, the organisations that win won't be the ones that deployed the most models; they will be the ones that understood what those models were doing in production. Guro Bergan, VP EMEA, and Colin Burke, Global Head of Customer Success at Honeycomb, explain why observability has become the defining infrastructure investment of the AI era.

Setting the Scene: Honeycomb& Observability

Honeycomb was built to solve what you have described as the “unknown unknowns” problem. What does that actually look like inside a modern engineering environment today?

(CB) The unknown unknowns problem is the most visible in the conversations we have. The consistent pattern I hear is: we thought we understood our system until something went wrong, and we had no framework to explain it. An engineer discovers a latency spike affecting a specific cohort of users, and every tool they had before would have told them nothing was wrong, because they were measuring averages, not individual events.

That failure isn't just an engineering problem. By the time it surfaces, a product team is fielding complaints, a customer has already had a bad experience, and the cost is real across the organisation. In a world where AI is generating more of the code shipping to production, this problem compounds across every function that depends on those systems working. The unknown unknowns aren't just questions no one thought to ask. There are increasingly questions no human was ever in a position to ask, because no human wrote the code that created them.

Observability is often confused with monitoring. Where do traditional approaches fall short, and what fundamentally changes when teams adopt true observability?

(CB) The core failure of traditional monitoring is that it asks you to decide in advance what matters. You define your metrics, set your thresholds, and build your dashboards. That works reasonably well until your system does something you didn't predict. In an AI-accelerated world, that happens constantly. The moment something new goes wrong, your monitoring tells you something is on fire, but gives you no way to understand why.

Take Intercom: when customers started reporting that Fin, their AI agent, felt slow, the problem was not visible in any conventional dashboard. The answer was buried across more than half a dozen LLMs running simultaneously. By asking questions across the full event stream in Honeycomb, they traced the issue, reduced time to first token by 60%, and cut costs in the same investigation. That is the shift I see among our customers: engineers stop asking "Is something wrong?" and start asking "What is the customer actually experiencing?" That is a fundamentally more powerful relationship with your production systems, especially in the world of AI.

You have been with Honeycomb through a key growth phase. From your perspective, what has stayed consistent about the core problem you are solving, and what has changed?

(CB) The core problem has never changed. Engineers have always needed to ask unexpected questions about production systems and get fast, accurate answers. That tension existed before microservices, before AI, before distributed architectures. It is what observability was built to resolve. What has changed is the magnitude of the problem. When systems were simpler, a missed signal meant a slow incident. Today, with AI agents executing multi-step workflows in production, a missed signal means you have lost visibility into behaviour that no dashboard was ever configured to catch. That is why established enterprise organisations are now making deliberate shifts, often from a log-heavy Splunk environment to a trace-first approach. The migration is not simple. What I tell those customers is that they are rarely afraid of the destination. They are navigating the journey. But here is what makes this moment different: AI amplifies whatever observability practice you already have. Excellence gets faster. Dysfunction gets more expensive. The organisations investing in that foundation now are determining which side of that equation they will be on.

You work with organisations like Booking.com, Frasers Group, Zurich, and Pfizer — enterprises with very different technical environments but presumably facing a common challenge. What does observability mean at that scale?

(GB) At that scale, observability becomes an organisational capability: The speed at which engineering teams learn from production and act on it.

The verticals look different, but the underlying problem is structurally identical: system complexity has outpaced what traditional monitoring was designed to handle. Pre-aggregated dashboards fail when the incident is one you never anticipated. And with AI-powered systems introducing non-deterministic behaviour at runtime, the potential for failure has become fundamentally harder to predict.

What observability means at enterprise scale is the ability to ask any question about any system behaviour and get a fast, accurate, explorable answer. Not a pre-approved view. The actual signal, with full context, is queryable in seconds.

The organisations that operate best at this level do not have the most alerts. They have the fastest feedback loops.

When you introduce Honeycomb to a new enterprise customer, what is the moment when it “clicks” for them?

(GB) It is almost always the same moment: they run a query they couldn't run in their existing tool. Not always because the tool lacked the data, but because the data was pre-aggregated, indexed, or siloed in a way that made the question difficult and time-consuming to ask. With Honeycomb, an engineer can query any span attribute across the full event stream in under ten seconds, no pre-indexing required. The look on a senior engineer’s face when they realise they can slice by customer ID, feature flag, and region simultaneously and get a result in sub-10 seconds across millions of events. That is when the conversation changes from curiosity to commitment.

The AI Shift: Velocity, Risk,and Reality

The narrative around AI has been almost entirely about generation, building faster, shipping more. But you have argued that the real challenge lies in what happens after code ships. Where are enterprise leaders underestimating that?

Guro: This is the question that actually matters right now, and most enterprise leaders are getting it wrong in a specific way.

They have measured AI success almost entirely on the input side. How much faster code gets written, how many tickets get closed, how many sprints get compressed. Those numbers look great. Velocity metrics are green. The board is happy. But production does not care about velocity metrics.

What is being underestimated is the brittleness of what is shipping. When an agent writes code, the context lives nowhere. The code works until it doesn't, and when it doesn't, there is no author you can ask, "What were you thinking here?"

And the feedback loop makes this worse. The promise of AI development is that you ship faster. But if your ability to understand production behaviour does not scale with your shipping velocity, you have not accelerated. You have moved the risk downstream. The leaders getting this right have realised that observability is not a tax on AI development. It's what makes it sustainable. Right now, most enterprises can see the code that shipped. They cannot see what it is doing.

There is a narrative that AI is driving massive productivity gains. From what you are seeing on the ground, how close is that to reality?

Guro: The productivity gains are real, but they are unevenly distributed. Teams that are already excellent at observability are accelerating. Teams that are not are accumulating invisible debt.

We and others in the space agree: AI amplifies existing practices, both good and bad. If your feedback loops are strong, which means that your teams can deploy, observe, and learn quickly, then AI supercharges that. If your feedback loops are weak, AI accelerates the rate at which you ship things you can't explain and can't debug.

The organisations seeing genuine, sustained productivity gains from AI are the ones that have invested in observability as a system of organisational learning speed, not just a monitoring checkbox.

These are organisations like our long-term customer Intercom and a more recent customer, Scribe. They both transitioned to modern observability to manage complex AI systems, resulting in faster debugging, reduced costs, and improved performance. Intercom reduced its AI agent's time to first token by 60%, while Scribe slashed debugging time from an hour to five minutes and cut costs by 75%.

You have talked about AI increasing velocity but also introducing complexity. Where are organisations underestimating that trade-off?

(CB) The trade-off that consistently gets underestimated is the knowledge gap, and we are seeing it in real time with our customers.

The 2025 DORA report found that 90% of developers now use AI tools and report productivity gains, but that same AI adoption increases software delivery instability. A Thoughtworks retreat of senior engineering practitioners identified why: AI is accelerating the inner loop, the personal cycle of writing, testing and debugging, while a new middle loop of supervisory work is forming that most organisations have not staffed or structured for. That is the gap where things break.

What it looks like in practice is not just an engineering problem. When AI-written code behaves unexpectedly in production, the debugging engineer has less context than ever, but so do the product manager explaining it to customers, the security team trying to audit an AI decision, and the finance leader watching support costs rise. The organisations navigating this well are not slowing down AI adoption. They are investing in the observability infrastructure that gives every function depending on these systems the visibility to act. The ones that have not are accumulating a risk they can feel but can’t yet see. That gap closes badly when it closes.

If AI is writing more code, who, or what, is responsible for understanding what is happening in production?

(CB) This is one of the most important questions across every function in a modern organisation, not just engineering. The honest answer is that responsibility is fragmenting, and most organisations have not caught up. Historically, the engineer who wrote the code was best positioned to understand its production behaviour. That assumption is breaking down fast. If AI wrote the code, the engineer's relationship to it is more supervisory and context-oriented.

But the product leader who owns the customer experience, the security team responsible for AI decisions, and the finance team accountable for AI-driven costs all have a stake in the answer, too. What fills that gap has to be tooling: observability that gives every function, depending on production systems, the ability to ask questions and get answers. Honeycomb's MCP integration is part of that: AI agents can now query production context directly, distributing understanding across humans and machines in a way that is genuinely new and genuinely powerful.

Are we moving towards a world where observability isn’t just for engineers, but for AI agents themselves?

(CB) Something I keep returning to in the LeadDev workshops I have run is this: AI does not change what good engineering looks like; it raises the cost of bad engineering.

The teams thriving with AI already had the fundamentals: CI/CD, peer reviews, strong testing culture, and clear ways of working. Look at the last decade: app modernisation, containerisation, Kubernetes, serverless. None of them fixed a weak foundation. They exposed it faster. AI is the same, only much faster still.

So yes, we are moving towards a world where observability serves agents as well as engineers, but the teams who will get the most from that are the ones who have already built the right foundations.

What excites me is that Honeycomb is one of the platforms making it real right now. The MCP server means AI agents can query production observability context directly, without a human in the middle. And what comes next is even more interesting: platforms where observability does not just respond to investigations but initiates them, where Canvas surfaces what matters before you know to ask. Observability as an active participant in production operations, not just a passive data store. That is the direction we are moving in.

Bridging the Gap: WhereHoneycomb Fits

Many enterprise leaders talk about a disconnect between data strategy and AI adoption. Where does observability sit in closing that gap?

(CB) The disconnect I see most often is that data strategy conversations happen at a layer removed from what is actually happening in production. Organisations know what data they have, where it lives, and how they want to use it strategically. What observability adds is the operational layer: what are your systems actually doing with that data in real time, and when something goes wrong, can you understand why?

In an AI context, this matters well beyond engineering. The CFO approving AI spend needs to know whether that investment is producing reliable outcomes. The product leader shipping AI features needs to know how they're actually performing for real users, not just in testing. Your data strategy specifies the inputs your models use. Your observability practice tells you how those models are actually behaving when real customers hit them. And when it’s not real customers but agents who hit them. Closing that gap is what turns AI into something the whole organisation can trust and build on.

What role does high-fidelity telemetry play in building trust in AI-driven systems?

(CB) High fidelity telemetry is the trust mechanism, not abstractly but operationally.

Trust breaks down in AI systems not because the model is wrong, but because no one can explain why it did what it did. When an agent makes a decision that costs money or triggers an incident, the question isn't "what happened" but "can I trust this system enough to keep running it?"

That question has no honest answer without telemetry capturing the full reasoning chain: every LLM call, every tool invocation, every agent handoff. Trust breaks down in three specific places: with engineers who can't diagnose what they can't see, with business leaders when one unexplainable failure destroys customer confidence, and with regulators who require a full audit trail. Accuracy metrics do not build trust. The ability to answer hard questions under pressure does. We frame what good telemetry requires as complete, usable, and trustworthy. The wide event, up to 2,000 fields per event, is what makes that real.

You framed Honeycomb as helping teams “level up” their understanding of what is happening in production. What does that look like in practice for a CTO organisation?

(CB) In practice, it looks like Canvas, our Agent that helps with investigations. An engineer opens an investigation, asks a natural-language question about which LLM tool call is performing the worst, and Honeycomb's Canvas AI co-pilot doesn't just answer; it explores the full event data, surfaces correlated patterns, and guides the investigation with clarifying questions. The CTO of a Fortune 500 retailer leveraged the MCP to get real-time Black Friday performance insights.

It's not a chatbot that retrieves a result. It's an AI-native investigator that works alongside the engineer. The MCP server extends this further: agents like Claude and Cursor can now query Honeycomb's full observability context, including traces, logs, metrics, SLOs, and triggers, directly from within the integrated development environment. Production context arrives where the engineer already is, not in a separate tab that opens 12 minutes into an incident.

How does Honeycomb enable teams to maintain speed without losing control?

(CB) The answer is high cardinality, event-based telemetry with no pre-aggregation. Every event, every LLM invocation, every tool call, every agent step, is stored as a queryable span with full attribute context.

You don't decide in advance what is important. You can ask any question, at any time, against any dimension of your data.

Nubank had a classic observability problem. Traditional logs and metrics were fine for debugging individual services, but a heavily microservices-based architecture means they now have 1000s of services to manage. Furthermore, new banking regulations in Brazil require them to meet standards for settling transactions in seconds.

To track and tackle latency across systems, Nubank invested in high-resolution, high-quality tracing, which quickly proved essential to meeting their SLAs. In 2026, Nubank will process 8Trillion events through Honeycomb.

Customer Reality: From Theoryto Practice

Intercom has been on a multi-year journey with Honeycomb, evolving into an AI-driven platform. What does that journey tell us about how observability needs to evolve alongside AI?

(CB) The Intercom journey shows exactly what I think the next five years of observability looks like in practice. They came to Honeycomb when the questions were relatively conventional: latency, error rates, throughput. As they built Fin into what is now the number one AI agent for customer service, those questions changed completely.

In early 2025, Fin was running on more than half a dozen LLMs simultaneously, and customers were saying it was slow. By connecting every service through Honeycomb and tracking LLM tokens, costs, and performance for each model, they improved speed and cut costs simultaneously. That outcome was not just a win for the engineering team. The product could demonstrate a better customer experience. Finance could see cost efficiency and increased revenue. What that journey tells me is that the teams that will win with AI are not the ones that deployed the most models. They are the ones who built the observability practice that let them understand what those models were doing and improve them continuously.

At the scale of customers like Booking.com and HelloFresh, AI agents are now building software and serving customers simultaneously. What does observability look like when agents are on both sides?

(CB) Organisations like Booking.com and HelloFresh were already operating at a level of maturity most companies aspire to. The challenge is not a traditional visibility gap. It is that AI agents have fundamentally changed the question you need to answer.

When a customer-facing agent behaves unexpectedly, the investigation does not start at the LLM call. It starts somewhere in the infrastructure, runs through the application, through the agent, through every tool invocation and model decision, and ends with a customer experience that was either good or not. Most tools see a slice of that journey and make you stitch the rest together under pressure, across multiple platforms, by people who were not in the room when it was built. What matters is being able to see the full picture in one place. And what becomes incredibly powerful is the ability to replay the complete timeline of what an agent did, decision by decision, in a way that answers the question anyone in the business needs to ask, not just the engineer who built it.

Transaction volumes are increasing dramatically with AI agents on both sides, building systems and using them. How does Honeycomb handle that level of scale that others struggle with?

(GB) Most observability tools handle scale through sampling and pre-aggregation, trading fidelity for cost by deciding ahead of time which questions you might ask and discarding everything else. That worked when transaction volumes were predictable and failure modes were familiar.

AI agents break both assumptions. An agent does not generate one span per request. It generates hundreds, every LLM call, every tool invocation, every retry, every handoff. And because agent behaviour is non-deterministic, you cannot pre-define what you will need to investigate.

Honeycomb was built on a different premise. We store wide, high cardinality events without pre-aggregation and query raw data at speed. When something degrades across thousands of concurrent agent sessions, you can isolate whether it's a model version, a prompt variant, a specific tool call pattern, or a downstream API - in seconds.

Pre-aggregated systems already threw away the data you need to answer that question.

Scale without fidelity just means you are confidently wrong faster.

What’s a real example where observability has directly changed an outcome for a customer?

Guro: Scribe, the AI documentation platform used by 94% of the Fortune 500, put Honeycomb’s event-based, high-cardinality architecture to the test. After bringing in Honeycomb, they cut the average incident root cause time from 1 hour to 5 minutes and reduced observability costs by 75% without sacrificing visibility as their engineering team scaled.

They're now integrating Honeycomb's MCP server with Claude Code, enabling engineers to investigate production alerts in a conversational way. That's the direction the industry is heading: AI agents that need full production context to do their jobs, served by an observability platform built to provide it.

The Edge: Governance,Not Just Adoption

There is a growing sense that the challenge is not adopting AI, it’s controlling it. Do you agree?

(CB) The challenge is understanding it, not just controlling it. Control implies you can define the boundaries in advance. With AI systems in production, that is not how it works. What you can do is build the infrastructure to observe behaviour, detect deviation, and respond quickly, and that capability needs to be accessible across the organisation, not locked inside the engineering team. The 2025 DORA report found that AI adoption increases software delivery instability even as it increases throughput. That is not an argument against AI. It's an argument for building the right infrastructure alongside it.

The teams navigating this well are not trying to control every output. They are using observability to understand what is happening in real time and course-correct with speed. That is actually a more exciting way to operate. You are not trying to anticipate everything. You are building the capability to respond to anything.

What does “governing AI through observability” actually look like in practice?

(CB) In practice, it looks like applying the same engineering rigour to AI behaviour that teams already apply to conventional systems, and then making that visibility available across the organisation. Every AI action is a traceable span. Every LLM invocation has a queryable record: model, latency, token count, input context, output, and outcome.

A product manager can see how an AI feature performs across different user cohorts. A security team can audit what data an AI agent accessed. A finance team can track what a workflow actually costs to run. But this only works if the foundations are in place. If your telemetry data is incomplete, if your pipelines have gaps, if your governance is ad-hoc, then governing AI through observability is just a good idea on paper. The organisations doing this well invested in the substrate first: mature pipelines, intelligent sampling, consistent context fields. When that foundation is solid, what becomes possible is genuinely remarkable.

If boards expect exponential returns from AI but organisations are only seeing incremental gains, where is the disconnect?

(GB) The gains are definitely exponential. However, so is the cost.

And when I say cost, I don’t just refer to token consumption; I refer to the risk that something goes wrong. If an organisation 10x their production code output, that’s fantastic! But if they also 4x their production incidents, the cost can be high in terms of customer perception, brand, revenue and even financial penalties.

AI increases change velocity. Change velocity increases the demand for fast, high-fidelity feedback. If your observability practice can't keep up, you are flying faster with a worse instrument panel.

The CTOs who are closing the gap between AI investment and AI ROI are the ones who have invested in observability as the operating system for how their engineering organisation gets smarter over time.

How should CTOs reset expectations internally while still moving forward?

(GB) Frame it as a learning speed problem, not a technology problem.

The question is not "how fast can we ship AI features?"; it's "how fast can we learn from what we ship?"

That reframe is actionable. It creates a clear investment thesis: instrument richly, detect unknown anomalies, reduce mean time to resolution, and build the feedback loops that compound over time.

It also creates a more honest conversation with the board: the organisations that will see exponential AI returns are not the ones that deployed the most models first. They are the ones who built the infrastructure to understand what those models are doing in production.

Looking Ahead: The Future of Observability

Observability was built for a world where humans wrote code and read dashboards. What does it look like in a world of autonomous systems and AI agents?

(CB) The shift is already underway, and it is significant. Dashboards designed for human consumption at a human pace do not work in a world where agents make decisions faster than any dashboard can refresh.

Observability in an agentic world must be legible to both humans and machines and serve more than just the engineering team. Product leaders need to understand how AI features perform across user segments. Security teams need to audit the decisions that AI agents make. Finance teams need to track what those agent workflows actually cost.

What Honeycomb is building is the architecture for all of this: a production environment where every stakeholder and every agent works from the same real-time observability context. The direction with Canvas is from a co-pilot that responds to investigations to a system that initiates them, surfacing what is wrong before you know to look. The organisations building on that architecture now will have a compounding structural advantage.

What capabilities will define the next generation of observability platforms?

(CB) The direction is clear, and the pace of change is genuinely exciting. Natural

language investigation, where anyone across the business can describe what they are seeing, and AI navigates toward the cause, is moving from a premium feature to a baseline expectation fast.

Extending observability context to AI agents directly through MCP is already happening and will become standard. Setting meaningful SLOs on non-deterministic systems, defining what "good" looks like for an LLM-powered workflow and alerting when it degrades, is a problem we are actively solving right now. But the area I'm most excited about is agent observability. It is the only genuinely greenfield space in observability today.

Every other area has established patterns and tooling. Observing autonomous agents, their decisions, their tool calls, and their full timeline of reasoning across a production environment, at scale and in real time, is largely unsolved. The platforms that get there first won't just define the next generation of observability. They will define how AI systems are trusted and operated at scale.

As Honeycomb continues to scale across EMEA, what are your priorities for the next 12–24 months?

(GB) Three things. First, ensuring that every AI agent an enterprise deploys, whether built on Claude, Cursor, or otherwise, has full, real-time access to production observability context through the Honeycomb MCP server. The agent era isn't coming; it's here.

Second, continuing to expand Private Cloud capabilities for organisations navigating data residency and compliance constraints, particularly in financial services and insurance, where hybrid deployment isn't optional.

Third, deepening the Canvas investigative experience so that the gap between "something is wrong" and "here is exactly what failed and why" continues to narrow. The mission is unchanged: no engineer and no agent should ever hit a dead end again.

If you had to write the headline for Honeycomb’s next chapter, what would it be?

"The enterprises that win the AI era won't be the ones who shipped the most; they will be the ones who created the best IRL customer and agentic experience."

Honeycomb is the platform that makes that possible.

In one sentence: What’s the biggest misconception enterprise leaders still have about AI today?

The biggest misconception is that AI is a code-generation problem, when it's actually a production-understanding problem: the moment AI-generated code ships, you need systems that can observe, interrogate, and learn from it at the same velocity at which it was written.

GURO BERGAN - VP EMEA & COLIN BURKE - Global Head of Customer Success

You Can't Govern What You Can't See

Recent Posts

Comments