Should we think about data quality like we do water treatment?

I recently wrote a blog looking at the data world’s obsession with talking about data like it’s water. It is an analogy worth exploring further: can we think about data quality in the same way we think about water treatment?

Water has to be treated before it can be used and consumed by people across the country, so that it is safe. In a similar manner, as we take data from where it is produced to where it’s consumed, the data should also be treated  to ensure the consuming systems remain healthy and produce reliable outputs.

We can think about data ingestion and storage like a river and a lake. Data ingestion is the river which passes your data downstream, into a lake which is where your data is stored. Just like with water, it is much easier to treat issues before it passes into your data lake. Therefore data quality up-front is crucial. This would involve things such as carefully controlled rules around data entry (free text fields should be used sparingly!) and automated rejection of data received from other organisations if not up to the standards required. 

An example of this you will have seen in recent years is when you select your address for most online shopping. You  enter your postcode and get a drop-down list of addresses to select from. This ensures you enter a valid address that matches one the retailer holds in their systems.

It’s no good constantly cleaning the lake if poor quality data keeps flowing in from the river. That’s why having data quality in your ingestion processes has to be your priority. Once this has been established, you can begin to explore sampling the data you hold, and understand the quality of that. This will allow you to take remediative action on your data, for example standardising formats or imputing missing values. This will increase the data’s quality  allowing better decisions to be made off your data.

There is probably a whole other set of analogies that could be explored around the recycling of wastewater and the archiving and disposal of data, but I’ll let someone more creative than me explore that! 

It is worth noting that by my limited understanding, water in reservoirs is actually untreated and goes through water treatment on its way to our homes, and therefore the diagram and my whole analogy is really quite false - but hopefully you still found it useful.

If your organisation wants to improve its data quality and make better decisions, get in touch. We’d love to get your data flowing in the right direction.

Next
Next

Getting your data ducks in a row