HomeOperationsChallenges and considerations when implementing distributed tracing

Challenges and considerations when implementing distributed tracing

Organizations these days are focused on building microservices at scale. And in the race to rapidly build agile microservices, speed and scalability are the only two aspects that are being prioritized. But it’s important to remember that the architecture of modern applications is complex. They are so complex that they have fundamentally broken classic methods of debugging and troubleshooting. 

Even the most simple request in a modern application is processed across various systems before it is serviced. To make matters worse, there is also a significant jump in the number of simultaneous users. 

This is where distributed tracing comes into play and helps mend the holes in our tools. It brings into light everything that happens across a microservice. Distributed tracing has become a baseline necessity for building distributed applications. But as seldom is the case, distributed tracing is not the end-all-be-all solution to fix all our microservice-related woes. It also comes up with its own challenges, which we aim to explore through this article. 

This article touches on the concept of distributed tracing and briefly explains the anatomy of a trace and how distributed tracing works. We wrap it up by explaining some of the key challenges of distributed tracing.

A brief introduction to distributed tracing

Distributed tracing monitors microservices-based applications from frontend devices to backend services to observe requests propagating through distributed systems. It enables developers to identify performance issues as they can follow a single request in motion through an entire system.

Distributed tracing helps understand the coordination between services to handle individual user requests. Each user request is tracked end-to-end and assigned a unique trace ID and associated trace data. This enables you to understand how services are functioning when it comes to processing a request. 

Distributed tracing helps you gain crucial insights into the status and performance of individual services in a chain of requests. It also gives you additional context that wouldn’t be possible with traditional metrics and logging. 

Anatomy of a trace

A single trace contains multiple units of work or a series of tagged time intervals called spans. A span is characterized by the API called the date and time of the start, the time taken, and the end of the API execution. These metadata are called tags, and they help contextualize a span. The structure of a span is similar to that of a nesting doll. The overall flow of a span is called a trace.

How does distributed tracing work?

End-to-end distributed tracing systems start collecting data when a user request is launched. This leads to a unique trace ID and the parent span being generated. The request’s entire execution path is displayed in a trace. When a request enters a service, a top-child span is produced.

The distributed tracing platform then encrypts each span with the initial trace ID, a unique span ID, essential metadata and duration, and error data. Finally, all the spans are represented in a flame graph. 

Challenges when implementing distributed tracing

1. Implementation

The first obstacle is to be able to generate trace data. Your software must be structured to accept the instrumentation code needed to emit tracing data. When it comes to instrumentation, there are two options: manual instrumentation and auto instrumentation. 

Manual instrumentation requires manual coding to record events, thus increasing coding and testing time. And standardizing which parts of code to the instrument can lead to missing traces. Auto instrumentation, on the other hand, is either highly detail-oriented or provides insufficient information. 

Additionally, instrumenting the existing codebase is quite a big challenge. Adding distributed tracing to an existing application requires you to have an understanding of the complete app stack. This is an unnecessarily tedious task and requires way too much effort for troubleshooting. Creating multiple tags will also negatively influence application performance. 

2. Size and frequency

Now that you can generate trace data, how will you collect and store it? High volumes of transactions pass through microservices. There’s an exponential increase in the data collected when a trace ID is introduced for all transactions. The trace data coming in is probably at a million times per second. Increased transactions also mean the cost of storage increases. The two things you will need to figure out are what to keep and for how long and how to scale data collection to keep up with the requests for your services.

3. Value

Now that you have generated trace data and figured out how to collect and store it, how do you obtain value from the collected data? You need to find a way to convert the raw data into something actionable.  

4. Backend coverage 

Unless you use an end-to-end distributed tracing platform, trace ID is generated only when a request reaches a backend service. This makes it difficult to determine the primary cause of a bad request.

Final thoughts 

With modern applications getting more complicated with each passing day, distributed tracing is a crucial tool as it grants you complete visibility into the operation of your microservice architecture. But that being said, you cannot turn a blind eye to the challenges it brings.

To thoroughly enjoy the benefits of distributed tracing, it must be adopted on a need-to-have basis. Once your microservice infrastructure has established a certain level of maturity of observability, you have a better chance to reap the benefits of distributed tracing.


Receive our top stories directly in your inbox!

Sign up for our Newsletters