Millions of Rides, Minimal Lag: How Safeboda Manages Real-Time Data with Apache Kafka

Scott Noel

April 11, 2024

Imagine this: thousands of rides happening simultaneously across a bustling city. Riders requesting pickups, drivers accepting pings, fares being calculated — it’s a constant stream of data. At Safeboda, we handle this massive data flow with the help of a powerful technology called Apache Kafka. Let’s dive into how Kafka keeps our ride-hailing operation running smoothly and efficiently.

First up, we track our drivers all over the city, typically grabbing their location every five seconds. This driver location data gets used in a few ways. We use it to see where our drivers are physically located and strategically move them towards areas with high demand for rides. This helps us visualize driver availability through heatmaps. But the most important reason is to match riders with the closest driver.

For driver-rider matching based on GPS location, we mainly rely on in-memory data. That’s because we really only care about a driver’s and rider’s most recent GPS coordinates. We don’t need to store historical location data in memory, so a lean Redis cache is perfect for the job. This in-memory matching approach is similar to what Lyft does. We also use Envoy proxy to efficiently distribute requests and build a fault-tolerant system with high uptime.

Imagine you’re us, juggling hundreds of drivers and riders across the city. To make sure everyone gets matched quickly, we need a super-fast way to keep track of where everyone is. That’s where Redis comes in. Think of it as our in-memory cheat sheet. It stores the latest location data for both drivers and riders, so when a ride request pops up, the system can instantly check who’s closest. It’s like having a constantly updated map in our head, but way more precise.

But here’s the thing: just like a normal human based dispatch center we need more than one dispatcher? That is where we use Envoy proxy as the traffic cop. It efficiently routes all the incoming ride requests to the dispatcher service, making sure everything runs smoothly. No bottlenecks, no crashes, just riders and drivers getting connected fast.

So, to sum it up, Redis acts like our super-fast memory, keeping track of driver and rider locations in real-time. Envoy proxy plays the role of the ever-reliable traffic cop, efficiently shuttling ride requests to the dispatcher. Together, they keep the dispatching system running like a well-oiled machine.

Here’s where Kafka comes in: eventual consistency is key when we need to send this GPS data, along with other ride-related events, to downstream services. For instance, when a driver accepts a ride ping on their app, we need to initiate a sequence of actions. We record the acceptance time and location, and this information needs to be sent downstream to an ETA (estimated time of arrival) service. A few seconds of delay in the ETA showing up on the passenger’s app is likely acceptable from a user experience standpoint. So, we push this trip accepted event and its details through Kafka, which acts as a messaging queue. Essentially, Kafka is built to ingest high volumes of event streams and allows various downstream systems to consume these events. Our ETA service, for example, can pull the ping accepted data out of the Kafka queue. This event-driven architecture also lets other systems, like our BI and analytics platform, consume this data to analyze trip acceptance patterns. This is just one example, but you get the idea. With an event-driven architecture, we have dozens of trip-related actions and events (trip start, trip end, trip billing, etc.) that need to be communicated, and Kafka is our go-to tool for these high-throughput scenarios.

Here are some summaries of why we use Kafka

Real-time updates with high throughput — This is our daily routine — rush hour hits, and we’re flooded with ride requests. Kafka can handle that surge in data volume without breaking a sweat. This smooths out our driver and rider location updates so both apps stay updated in real-time, minimizing delays between a driver accepting a ping and the trip details showing up in the passenger’s app.

Scalable architecture for future growth — Safeboda’s ridership is on an upward trajectory! Kafka’s a perfect fit because it can easily scale up its operations to accommodate an influx of rides and riders on our platform. We won’t have to sweat about system crashes due to unexpected spikes in demand.

Event streaming versatility — Kafka’s not a one use case type of technology. It excels at processing all sorts of event streams, not just location updates. Need to send trip confirmation details, price estimates, or cancellation notifications to various microservices within Safeboda’s architecture? Kafka can handle it all with ease.

Rock-solid fault tolerance — Even the most robust systems experience hiccups sometimes. But with Kafka, even if a server has a meltdown, our data isn’t toast. Kafka buffers the information in a queue until things are back up and running. This ensures minimal downtime, no lost ride data, and a consistently smooth experience for our passengers and drivers.

Data-driven decision making — If we want to identify high-demand zones or optimize trip routing based on real-time data? Kafka stores all this event data, making it a goldmine for data analysis. We can leverage these insights to optimize our services and keep Safeboda at the forefront of the ride-hailing game.

Here are some other considerations that will help you understand how specifically we use Apache Kafka to meet our needs.

Offloading Infrastructure Management with Confluent Cloud

Managing our own Kafka cluster can be a resource-intensive task. That’s why Safeboda utilizes Confluent Cloud, a hosted service by the creators of Kafka itself. Confluent Cloud handles the heavy lifting of cluster provisioning, maintenance, and scaling, freeing up our engineering team to focus on core application development. This allows us to scale our Kafka usage seamlessly as our ride-hailing needs grow. At some point we may need to manage our own clusters internally but we are not at that scale just yet.

Event Stream Management with Topics

Kafka organizes data into categories called topics. Think of topics as channels specifically designed for different ride-related events. For instance, we have a topic for “driver_location_updates,” another for “trip_events,”, etc. Producers, like our driver and passenger apps, publish events to these topics. Downstream services, such as our ETA calculator or billing service, act as consumers, subscribing to relevant topics to receive the event data they need. This topic-based architecture keeps data streams organized and ensures each service receives only the data pertinent to its function.

Ensuring Delivery with Acknowledgement Strategies

Data reliability is paramount at Safeboda. Here’s where Kafka’s acknowledgement strategies come in. When a producer publishes an event, Kafka offers various acknowledgement levels to confirm delivery. We can choose settings like “leader_ack” which guarantees the event is written to the leader replica of the partition for a specific topic. Alternatively, for stricter consistency, we can use “all_replicas_ack” to ensure all in-sync replicas within the partition receive the event before the producer considers it delivered. Selecting the appropriate acknowledgement strategy depends on the criticality of the data and our desired balance between speed and reliability.

Guaranteeing Data Durability with Redundancy

Data loss is unacceptable. To safeguard against failures, Safeboda leverages Kafka’s redundancy patterns. By replicating data across multiple brokers (servers) within a Kafka cluster, we ensure that even if one node goes down, the data remains available on other replicas. Additionally, Kafka allows us to configure the number of replicas for each partition within a topic. This lets us fine-tune redundancy based on the importance of the event stream.

You might be wondering why we choose Apache Kafka for ride-related events in our message broker pipeline, especially when other technologies exist. The answer lies in the specific needs of each system.

Take our financial systems, for example. Here, every cent counts. That’s where other messaging queue technologies such as RabbitMQ reigns supreme for sending financial messages. It offers features like guaranteed delivery and transactional messaging, ensuring critical financial data reaches its destination with the utmost reliability. Here is more of a deep dive as when we choose each technology

Guaranteed Delivery — With RabbitMQ, messages aren’t delivered “maybe” or “most of the time.” RabbitMQ offers rock-solid message persistence and confirmations. This means a message isn’t considered sent until it’s written to disk and acknowledged by the receiving queue. No ambiguity, just guaranteed delivery, perfect for critical financial transactions such as payments with Mpesa. Kafka, while reliable, offers configurable levels of acknowledgement, which might not be ideal for financial data where every message is essential.

Transactions on the Queue Level — Think of a financial transaction as a series of steps, like debiting one account and crediting another. RabbitMQ supports message transactions. This ensures that all messages within a transaction are processed successfully or none at all. This all-or-nothing approach provides a strong transactional safety net for our financial operations. Kafka, while capable of handling some transactional semantics, requires more complex configurations for achieving the same level of guaranteed message ordering and completion that RabbitMQ offers by default.

Prioritization for Urgent Messages — Financial systems sometimes deal with time-sensitive situations. RabbitMQ lets us prioritize messages. Urgent transactions, like fraud detections, can jump the queue ahead of non-critical messages. This ensures that the most important financial actions are processed first, minimizing delays and potential issues. Kafka, while capable of prioritizing messages to a certain extent, focuses more on high throughput and doesn’t offer the same level of built-in, granular prioritization features as RabbitMQ.

Kafka for the Rides — Built for Speed and Volume

On the other hand, our ride-hailing system thrives on Kafka’s strength in handling massive data volumes. While Kafka offers reliable delivery options, the focus is on high throughput — perfectly suited for the constant stream of ride requests, driver locations, and other event data we juggle. RabbitMQ, while reliable, might introduce some latency for our fast-paced ride-hailing needs.

So, to recap: RabbitMQ’s guaranteed delivery, transactional messaging, and built-in prioritization make it the perfect fit for our secure and reliable financial message exchange. While on the other hand Kafka excels at the high-speed, high-volume event streams that keep our ride-hailing operations running smoothly. It’s all about choosing the right tool for the job!

At Safeboda, we’re constantly innovating to build the smoothest and most efficient ride-hailing experience possible. By leveraging a combination of technologies like Redis, Envoy Proxy, and Apache Kafka, we ensure real-time data management, fault tolerance, and high throughput. This allows us to seamlessly match riders with drivers, optimize routes, and provide valuable insights for future growth. Those that use us know when we have many drivers online we will find you a driver very quickly. And as our reach expands, we’re committed to staying at the forefront of technological advancements to keep Safeboda the leading ride-hailing platform in Africa.

Scott Noel

April 11, 2024