Best Practices for Personal Data Processing

12 min readOct 5, 2020

Disclaimer & Context

I am not a lawyer, and I am not a privacy expert. So take everything written here with a grain of salt. I am just an engineer responsible for data management, who actually does not mind talking to lawyers and tends to read some law texts as a hobby. Also, I’m presenting a minimalistic approach here — what we have in fact implemented in our company is way stronger in terms of privacy guarantees, and tailored to our context. And of course, all written here is my personal opinion only.

I have been in the field of data platform management and data engineering for about 5 years, from pre-GDPR times through its adoption and to today’s proliferation of similar standards in other countries. I have reviewed many attempts to cope with those regulations, some of them naive, some overcomplicated, and a few of them reasonable. I was lucky to be a part of the team that implemented the GDPR compliance, or more generally, proper personal data processing policy, at a software company with hundreds of millions of users.

Data Processing Regulations pertain to all sorts of societal activities, including hospitals, schools, insurance companies, banks, social media, … I am writing this essay from the point of view of an ecommerce company solely — not everything written here transfers well to other domains.

Purpose

When it comes to attitude towards GDPR and the like, people around me exhibit mostly misunderstading, but also fear or contempt. I guess people often feel intimidated by large texts written in lawyerish, and thus resort to a cognitive shortcut. I would like to provide one such shortcut here, hoping that it will not compromise on understanding or increase confusion. Also, a selection of best practices, and results of my own ponderings, are given at reader’s disposal.

What is Personal Data

Firstly, I consider Personal Data a misnomer. It leads to confusions due to similarly sounding Personal Identifiers. Personal Identifiers are things like social security number, first name + family name assuming it is unique (very rare), and the like — everything that is in 1–1 relation with a human being. Is an email, a phone number, an address, a John-Smith-like name, etc, a personal identifier? Yes! The catch is that personal identifiers must be considered in context — if I know your name is John Smith, that does not allow me to get to a single person. If I only know you live in a certain big residential apartment with 100 residents, that alone is also not sufficient. However, if I know both name and address, I will likely be able to get you.

This attempt for a definition, taken ad absurdum, would turn virtually everything into a personal identifier. If I knew the size of your shoes, approximate latitude you were at during last Friday’s noon, and your favourite kind of sushi — that would probably uniquely identify you; but neither of those sounds like a “real” Personal Identifier.

This apparent issue which people with software engineering or mathematical backgrounds often struggle with comes from the fact that Law does not always allow algorithmic interpretation. If it did, we would probably not need judges. So subjective terms like reasonable or practical often enter the field to seemingly easily resolve any ambiguities.

So what is then the Personal Data? Well, virtually everything that appears in your DBs and systems thanks to your customers. Number of heartbeats sent from the client’s application? Yes. Search terms the customer entered into your application’s search bar? Yes. CPU load during your application’s run at the client’s device? Yes.

Which is why I don’t like calling it Personal Data, and prefer, in my eyes, the much clearer one, of Customer Data.

Know Thyself

Now that we know what is in scope — Customer Data, virtually everything in your systems, there is one must for you to be able to comply with any regulation, and for you to be able to look your Grandma in the face and tell here “you can safely use our products, no worries”. And that is understanding exactly what data you collect, how you process them, where you store them, and for how long.

This is usually called a Data Catalog or Metastore, sometimes going hand in hand with Schema Registries or MDMs or other data management/governance tools. Implementation details are not important here, the point is that you must be able to, for every collection of data that happens at your company:

list products/systems/websites/etc that are source of this data, and also be able to provide the transposed view, that is, list all collections that happen at a given product;
provide exact schema of the collected data, semantics of each field, and triggers when the collection happens;
explain basic purposes why it was started (research, business intelligence, marketing, product feature, …);
in which systems it is stored, what are conditions for deletion (time-based or event-based, who has access;
data lineage in a greater detail: “a scheduled job J is computing table T from this raw data, which is then used by model training pipeline M to feed the feature F on website W”.

Praising Automation

A particular example could be simply keeping a collection of structured files (jsons, xmls, yamls, …) for every collection, looking as follows:

collectionId: heartbeat_analytics_androidApp_347
collectionDescription: Sending heartbeats from our Android app
productIds: ourGreatAndroidApp_v10.1+
purposeCategory: core_functionallity
schemaLink: <path>/heartbeat_347_schema.gpb
triggersDescription: once every day at 12:00 of device local time
storageLink: <path to S3 bucket>
deletion: data auto-deleted after 1 year
accessManagement: [mobile_business_intelligence_team r, mobile_qa_team r]
scheduledJobsId: [mobile_analytics_funnel_01, mobile_analytics_funnel_01_final, mobile_analytics_funnel_01_hotfix, mobile_qa_pipeline_judith, mobile_qa_pipeline_john]

You can see that some of the fields appear to be designated for machine processing (e.g., the Ids), some are human-level only (e.g., triggers description). The ideal is when you have as many machine processings in place as possible — such as automatic sending trigger creation for Mobile Client Configuration, or automatic data deletion job configuration and ACL management for your data lake platform. The extreme anti-ideal is when the catalog is an excel sheet maintained by hand and for writing only — that is bound to diverge from reality and require syncing efforts which usually feel like badly spent time. On the other hand, if your system has built-in mechanisms to be kept in sync, such as “scheduling of jobs happens via listing in this catalog”, then you are guaranteed to have your Catalog up to date.

I presume the reader will find it obvious that such a well updated central catalog confers a plethora of other joys.

Avoid Shame

The central point of importance for a Catalog in the realm of Data Processing is the avoidance of basic shame. I find it very shameful when various parts of the company give out contradictory statements, as in “we never process email addresses in this company” versus “oh yes they are must haves for email campaigns”. Legal authorities and auditors also do not find it particularly amusing if you keep changing your statements with “apologies, we are still researching what we are actually doing”.

It is also particularly shameful when, because of new regulation, the data lake team is told “we need to shorten retention periods of all datasets containing IP addresses to 7 days”, and they respond “ok 1 man-month to find where IP addresses are, and then 2 weeks times the number of systems to implement the change there”. The reply should have been “ok according to our 5-minute query to catalog, these and these systems will be affected, and we will need to implement in 3 weeks the new generic option of catalog of this kind, which will of course enable such retention shortenings for other fields in the future”.

A piece of software is well designed with respect to its domain if it can cope well (read: refactor easily) with new requirements that were not part of original design but could have been expected based on the domain knowledge. And the domain knowledge here says stuff like: make it easy to change how long you keep which data, based on customer preferences, countries, purposes, products; make it easy to change who can access what for what for which purposes; make it easy to throw in replacements of data with their pseudonymised versions.

Finally, another shameful aspect is when the upper management needs to craft a new data privacy strategy, following a new regulation approval or a change in PR strategy, and cannot base it on accurate or complete data.

Be Transparent with your Customer

As said before, waters of legal interpretations may be murky due to the subjective elements. Company may try to fulfil its well-intended interpretation of a regulation, and still be told “nice try but here’s your fine, do better next time”. That may even lead to defeatistic moods of “no matter what we do, if the authorities don’t have their day, we are screwed — so let’s just brace ourselves and ignore this all”.

That completely ignores the original motivation — we do not have legal regulations as an end of its own, we have them to ensure that customer is treated fairly. And whatever suboptimalities legal regulations may or may not have, you are still bound by honor to keep in mind the fair treatment of customers and their data. You do not own the data in the ultimate, creative sense — if it were not for the customer, you would not possess it at all. And no matter how strict or benign a legal regulation is, the customers must be ok with how you treat it, because they are the final arbiters when it comes to good manners. If your customers trust you, they’ll stay with you despite fines granted to you by “obscure” regulations. And if you are painted as a “ruthless data exploiter”, no official stamp of legality is going to save you. This paragraph is not intended to downplay the role of legal authorities or paint them in negative light, quite the contrary — if you keep the customer as your focus, you may find it that you have a better talk with the authorities.

And — you may have guessed it by now — you cannot be transparent to customers externally unless you are transparent internally, which you can’t be without a good Data Catalog.

Describe Your Processes

The first step to a good business relationship with a customer is transparency. In today’s world of most apps being seemingly free, data collections are somehow understood to exist. The more you can do here — explain what you collect, what you do with it, and why you use it to make you able to provide more customer value and thus fare better economically, the more customer trust you will have. Shady tactics of pretending being a “data free company” and only after someone catches you red-handed by inspecting behaviour of your applications is the road to hell. Do not be afraid of being technical and detailed — there are plethora of texts how today’s marketing and advertisement optimisation works already, and everyone assumes you do it as well. Sometimes I feel this topic is getting the treatment as sex was in the victorian times — it is clear that everybody must be doing it, yet only few admit it or talk about it.

This even can provide a differentiator for you — if you can employ some advanced techniques such as differential privacy or sophisticated pseudonymisation, you can put yourself ahead of your competitors and gain customer appreciation.

DSAR

Transparency has the form both of generic texts and descriptions of your processes, but also in the so called Data Subject Access Rights. GDPR and others require the companies to comply with customers requests of the “show me all the data you have about me” kind. This actually came as a shock to some, as their data lakes were not designed with “single id based extractions”. That, in my eyes, is unfortunate — I do not find it valuable to provide to the customer every byte that was ever generated at his device. Of course, every byte that played a role in deciding the price, or, e.g., in the areas of Health deciding when the patient will undergo a surgery, is important. But showing me the logs of heartbeats of my device for the past year? Why, when they look all the same?

Again, this gives you an opportunity to positively transcend legal requirements. Do a User Research — what will your customers find valuable and understandable? What they on the other hand don’t care about? Can you perhaps turn parts of DSARs into product features, such as dashboards and statistics about customer usage?

Also, do not forget you can use this as another explanation of how your processing works. In particular, the DSAR can be served via a combination of data stored locally and data stored in the cloud — where the customer can see “before the data leave your device they look like this, but we store them in such and such pseudonymised form”. For instance, if you are having a healthy lifestyle app, you may be storing locally the gps coordinates the device was at, but at the cloud store only summaries, aggregates, or noisy versions of the coordinates to not allow a tracking when the data would leak.

DSARs are often mentioned in conjunction with Data Deletion Rights. My own understanding, and I repeat here, unqualified, is that this is a remedial mechanism. Assuming you are collecting only what you are principally entitled to, and use keep data only as long as you should (e.g., not emailing your customer after he has uninstalled your application), Data Deletion Rights should not require any action to be done on your side, ever. Data Deletion Rights have attracted a lot of media attention, possibly because they matter in situations which are obvious breaches of good behaviour, but I view the other points listed in this article as much more important and worthy of your focus. It is always better not to breach agreements in the first place, as opposed to thinking how to fix it if someone finds about the breach.

Do Only What Benefits Your Customer

In the case of security incidents, the first assumption is that the attacker has gotten everywhere, and over time the assumed damage is limited as knowledge of the situation increases. You can adopt a similar maximum-risk approach. Treat your data processing as if every piece of your code is available to customer inspection. And if it would not withstand a qualified scrutiny in terms of morality, don’t do it. That does not say a journalist with a leaker from your company cannot cause damage to you by misinterpreting what you do — they totally can. But still that is a good situation because you can defend yourself merely by saying the truth, which is much easier than crafting lies.

Of course, in the light of the Transparency section, you have no other option. Which is another reason why Transparency is good — it forces you to behave well.

Minimalism

Yet another requirement is frequently listed in legal regulations, and that is a minimalism in data collections. And again, it is a subjective term with thought guidance only. However, the good thing here is that multiple actors share this goal with the customers and the legal authorities:

admins of data lakes wish easier operations,
budget holders look to cutting costs,
PR folks want a minimalistic picture to paint,
client developers don’t want to occupy bandwidth,
your security department does not sleep well when there are data that can leak,
laws of physics and mathematics that refuse to scale linearly, and the basically only party which has contradictory goals, the researchers and data analysts who want to have as much data at their disposal, has whole fields of knowledge at their disposal to identify what they actually need and discard the rest — feature importance, aggregates and sketches, pseudonymisations, et cetera.

Note that it does not have to be that the minimalism must happen at the earliest stage on the customer device, even though that is the best one. Minimalism as a mindset should be applied everywhere — you need to collect some data first to be able to decide what is and what is not important. But once you know it, you can limit your regular model training pipelines to required features only, and collect the other ones only from a fraction of users to allow for e.g., future analysis that the assumption remains true. Or you may realise that models can perform well with some noise added at the server side — so you keep the original copy of the data in a lake with tight security restrictions without researcher access, and export a noisy copy of the dataset for the research/training purposes.

Closing Thoughts

All in all, data regulations, despite being somehow complex and perhaps hard to comprehend, are, in my view, a desirable thing because they put a spotlight on the rights of the customers when it comes to data processing, and mandate some standardised basis. They bring changes and challenges to our daily work, and many of those challenges I actually like and find worth tackling. However, we should not treat them as religious objects, and we should not forget that the ultimate goal is fair treatment of the customers.