On the importance of having redundant data pipelines

Vojtech Tuma
6 min readMar 17, 2019

--

A typical situation for any enterprise today is about how to get data from the endpoints, such as from the software running on customer’s devices, to the cloud or data lake and turn them into something possibly useful, such as reports, insights, or automated decisions. Maybe you are lucky and have just a small piece of code on the endpoint sending basically just a database insert — or you actually live in this century and your situation looks more like this:

  1. the endpoint software composes of multiple pieces, such as the code that detects that something has happened, the code that decides this thing is worth reporting, and the code that does the sending itself — all only loosely connected;
  2. there is a reception service in the cloud, writing all incoming data to some event queue like Kafka;
  3. there is some queue processor which either shows some realtime results or makes decisions and publishes them back to the queue to reach the clients somehow;
  4. some service shovels the data into an analytics-friendly realm such as a database, on top of which analysts and regular reports do further work. And, as time flows, you will be supporting multiple versions of all components, each having to cope with others in backward/forward compatible mode.

What we are going to address in this article is that there is a slim chance that the data delivery & reporting pipeline is, over time, free of errors; and argue about how multiple independent pipelines helps detect those errors as soon as possible in the first place. Let us start with some examples of potential issues.

What can go wrong in Data Delivery

A sizeable category is of purely client-side errors, to name just a few:

  • authors of new event type reporting are not aware of common reporting library, implement it anew, and forget about, e.g., retrying on network failure;
  • authors of new event type reporting use the common reporting library but fill the common fields (such as eventType, reportingComponent, version) incorrectly;
  • upgrade of reporting library stops being backwards compatible with reporting of a few event types due to enforcing some fields being filled;
  • change of global configuration causes unforeseen stop of reporting of some event types, because that was based on using wrong configuration element that has previously been always set to True.

Similarly, purely server-side errors are not to be left behind:

  • in a few instances of the services the update to the new version silently failed or was not done at all;
  • the testing/stage instances have been misconfigured to send their data into production instances of the next service in the pipeline;
  • when adding support for new event type forwarding in a service which converts data from one format to another, the developer copypastes and not all fields are converted;
  • refactoring of data reception code causes to silently drop a particular event type because of some field not being filled.

To make matters worse (at least in terms of the “Who to blame?” question), just incompatibility between server and client leads to errors:

  • uncoordinated schema update — client started sending fields that some cloud service is not aware of;
  • encryption problem — client and server can’t agree on cyphersuite;
  • undefined contract — client developers view particular service as general-purpose for reporting while the service developers view it as specialised, and thus each party has a different contract, and undefined behaviour results;
  • semantics misunderstanding — a field named “MAC address” may hold anything from raw bytes, to their hex representation (upper case vs lower case, with or without separating colons), to their base64 representation, and if client and server interpret it differently, undefined behaviour results.

And finally, the final data consumer (e.g., the analyst or the decision-making component) suffers from any change not properly communicated:

  • deprecation of a field and its replacement with a new one causes a seeming data loss;
  • addition of a new event type and not communicating how to distinguish it causes a seeming data increase;
  • change of service semantics from exactly-once to at-least-once without introducing de-duplication later on causes a seeming data increase;
  • version-dependent semantics;
  • marketing campaign changes users behaviour with effect on the data, freaking out both operators and analysts as they think it is a genuine anomaly (assuming some monitoring is in place).

All those examples are based on my own experience — and they should persuade you that anything can fail. Don’t also forget that you may have different softwares on endpoints (possibly even on different platforms) reporting to the same services, making the coordination much harder. Broadly speaking, you can have either data loss, data increase, data corruption, or semantics change; these can be either sudden or gradual; either just seeming or real; and affecting either just a fraction or everyone. Typically, seeming fractional gradual semantics change is the worst in terms of detection; but in terms of damage, all can wreak havoc unbound.

Remedy

So, after all the fearmongering, which should not really surprise anyone, what can one do to retain sanity? None of the examples is caused by a particular technology or language being used — as long as there are multiple human parties involved, there is a huge opportunity for erring.

Let us focus, in this world of constant change, on the things that remain the same — in particular, the goals for the data collection and processing in the first place. For instance, you may be interested in counting the number of your active endpoints via heartbeating; and this will be your goal regardless of any changes in any of the components in the pipeline. Now, if you can design two completely independent redundant pipelines both serving this goal, any error in either of them will be revealed in comparison with the other, unless there is error in the other at similar time with the same effect (unlikely). In particular, for the heartbeat example, if you can have two client libraries with reporting heartbeats and two different sets of cloud services turning that data into numbers, you have a solution robust to all the aforementioned errors.

Easy to say, difficult to achieve — setting up two completely independent redundant pipelines may be difficult, time-consuming, bandwidth-consuming or otherwise resource-hungry, hard to maintain. Fairly often, the results in them will diverge not due to any meaningful but due to their intrinsic properties, only to calmly drift back once everyone got alerted. That is, however, a fundamental tradeoff — the more important your data processing goal, the more you can invest.

Of course, there is a plethora of degrees of independence/redundancy — you should focus on the parts of timeline which lie at the cross-team boundaries. For instance, it does not make much sense to have two consumers of Kafka, one in Java, one in Python, just to achieve redundancy — as both of them will likely fail in the same way in case of some contract breach issue.

A particular applied case of this framework which I believe is the most useful one is to have two client reporting flows: a very basic one which just sends daily heartbeat with almost no additional data to a very simple cloud service (“just a DB insert”), and evolves almost not at all; and a more complicated one with a rich and modular client library supporting flexible schema evolution (i.e., the “standard” one). This way, you can just keep contrasting basic metrics derived from the rich reporting flow to that of the simple flow. And of course, if a particular field is of extreme importance, it justifies its addition to the basic reporting flow (at a different time than it was added to the rich one, of course).

Note that sometimes, distinct pipelines arise spontaneously:

  • maybe a developer was not aware of a standard reporting solution and exploited Google Analytics for this purpose,
  • maybe an analyst was not aware of existing autogenerated report, and wrote a new one extrapolating data from other sources,
  • maybe the cloud team wants to experiment with a new data processing technology, sets it up in parallel, and later it is forgotten it was just an experiment and becomes a production thing.

All these cases sometimes meet not exactly enthusiasm from the management — but they serve the purpose of cross verification greatly.

Recapitulation

  1. Many things can go wrong.
  2. You won’t prevent it just by improving individual components (by better technology, unit tests, etc.).
  3. It is not just a technological problem, it is more an organizational-communication-complexity problem.
  4. Redundancy in independent rules.

--

--

Vojtech Tuma
Vojtech Tuma

Written by Vojtech Tuma

#books - #running - #pullups - #boardGames - #dataScience - #programming - #trolling - #etc

No responses yet