Data quality is undoubtedly a key part of any successful data-oriented decision. Shit in. Shit out. And because it’s a success enabler, there is absolutely no reason not to heavily invest in it. Except, there is!
As a foreword, I should mention that I’m not trying to say there’s no need to aspire for accurate, precise, well-labelled and documented web analytics data. This was actually my very first sentence. Rather, I’d like to stress the fact that focusing on data quality beyond reason as a primary goal can lead to ridiculous ROI-negative investments.
As Digital Business Intelligence Lead at Semetis, I’m regularly asked to miraculously bridge the gap between transactions, clicks, sessions, impressions, (...name it), of one platform with another database. And because the data does not always match, because there’s maybe a 5% difference, it’s all considered as wrong and it is eventually rejected as valid data. As a consequence, it is often thought that the obvious next step is to annihilate the 5% difference with tons of efforts. Let’s step back a bit:
- A difference between two data sets does not mean one or both of them are inaccurate
- A difference on 5% of the data is not a sufficient reason to ignore the other 95% of the data
- A 5% variation should not suck 95% of your energy or attention
The metrics we use are often only proxy indicators
Let’s say I want to know the number of coffees that are drunk everyday at Semetis. For this purpose, you could imagine different (and creative) ways of measuring it:
- You implement a sensor on the coffee machine button
- You ask your colleagues to swallow a high-tech stomach probe that detects caffeine
- You ask your colleague to fill a paper sheet on a weekly basis
Which one is the best indicator? Is one of them more accurate than the others? The truth is, none of them measures the number of coffee drunk, they respectively measure (1) the number of times the button is pushed, (2) the number of times caffeine is detected in your stomach and (3) the number of coffees your colleagues remember they drank.
Not only they do not exactly measure what you want, but they have very different biases: the sensor could malfunction, the probe could detect a tiramisu (note: tiramisus contain coffee), or your colleagues could drink too many beers and forget their whole week. Would you expect the exact same number from these three measures?
The same goes for web analytics data. Conversions in AdWords are in fact just the number of times the AdWords conversion tag was triggered, not the number of headphones that were shipped to your clients. The number of users in Google Analytics is the number of different Client IDs (or User ID’s), not the number of persons who visited your website. And these indicators can be biased as well: ad blockers could prevent your tags from firing, users could clean their cookies or have two accounts, your Tag Management System could not be properly updated on your visitor’s machine. Why would you expect the same numbers between Google Analytics and your ERP?
Missing data is fine. Small biais is fine. Deal with it.
Not capturing some data points does not mean there’s no information you can get out your data. Generally in web analytics and digital advertising, the data you get is very accurate, unless you completely messed up with the implementation. If 5% of your data is not collected for any reason (ad blockers for instance), this is fine. The data you have is still very valuable. Web analytics tools are meant to measure trends, proportions, to connect networks and platforms. Web analytics tools are not cancer scanners, and approximative data is OK.
Yet, we tend to focus on 5% of missing data and spend our time taking actions to reduce the gap instead of focusing on 95% of data and spend our time taking actions that improve sales, ROI and UX. In many cases the 5% gap makes us reluctant to value the rest of the data, making the ROI of collecting 95% of the data equal to collecting no data.
The same reasoning even applies to biased (less accurate) data. We tend to take binary decisions. Only unbiased data is acceptable. Biased data is totally not acceptable. We act like the decision we would take if 49% of our visitors were males is totally different than the decision we would take if only 48% were males.
NB: Bias is fine if you can reasonably assume it is equally spread. If it appears to be specific to some devices or browsers for instance, try to fix it and adapt your conclusions.
Having a 80/20 or 90/10ish focus is a rational approach, not a disregard for data quality
As a data-focused company, data architecture and data quality have always been one of our core focuses. But data-driven decisions imply there is not only data, it also implies there are decisions. Decisions imply we must stop focusing solely on the infinite loop of always-better-data and we should instead go ahead and focus on the customers, the services, the products. Clean data is a means to an end, not the goal itself.