Great Expectations

Data ScienceData ManagementArtificial Intelligence

17 Oct

In data science or data-based projects, the quality of data is one of the most pertinent aspects to be considered. In the case of developing intelligent systems using various AI (Artificial Intelligence) techniques, there is some data that is required to train the system, and if it has quality issues or it has not been prepared or processed properly, then the decisions taken by the system would be affected by those issues. Therefore, it is vital that the data meets all the relevant quality standards before the system ingests it and this is where the tool ‘Great Expectations’ can come into the picture.

Great Expectations is an amazing tool that helps maintain and deliver the quality of data while working in a team. Considering the concept of automated testing would be worthwhile to discuss here, where the code is passed to another piece of code or tool, and it goes through some predefined set of rules to validate whether the code is up to the set standard or not. Similarly, this tool offers to automate the validation process of data engineering, preparation, or pre-processing in data-based projects and using this tool can help in identifying data issues quickly. It produces a number of components, in which each component has its own feature with reference to handling data.

Main Components:

Expectations

Expectations work as assertions for the data, which are written in Python format. The Python written code would generally act as a condition check whenever the certain data passes through it. For instance, in a data table containing information about customers, where the data required by the product is that customers should be above 16; now an expectation would be checked on the customer age column in the table and returns a success or failure, depending on the result. Take expectations as unit tests, each expectation would be checked on the data and if the data does not meet the specific expectation, then it would log the issue. The package itself provides a wide range of pre-defined expectations and it also allows the user to write their own custom expectations.

Automated Data Profiling

Automated profiling helps by profiling and observing the basic structure and content of the data, which reduces the load of writing the code and designing the data pipeline. It automatically generates an expectation suite, or a test suite based on the statistics extracted from the given data. As in the above example, a custom expectation was generated but, in this case, the suite will already include the expectation that the customer age should be above 16 observing the structure and stats of the column.

Data Validation

An important component that executes that expectation suites on the data is data validation. It loads any batch of the data from the whole set to execute all the expectations on it, from which it will declare whether it has passed or failed the validation test. On failing the validation test it returns the value which caused the test to fail. This is how it accelerates the data quality process by debugging the issues.

Data Documentation

Documentation is prepared in an HTML format, containing the expectation suites and the results of validation tests. It logs the status and result with the expectation every single time when it runs on the data. It is an excellent component of this tool that provides a continuous record of each test, which is useful for the data team to keep track of the quality issues.

Benefits and Features:

Speeds up the data cleaning and pre-processing stages
Keeps track of data quality issues by logging them
Helps raise any issues with data team as soon as they get logged
Provides shared descriptive data documentation
Provides data quality monitoring within the data pipeline for production
Crucially, it provides automated verification on test data or any new data which comes in
Less chances of forgetting about any of the quality issues
An important key feature of this tool is that it supports multiple data sources and storages such as Spark dataframes, Pandas dataframes, SQL databases, etc.

For further exploration about this incredibly helpful tool, case studies (https://greatexpectations.io/case-studies/) can be considered that how companies have been using this tool.

For implementation, https://medium.com/hashmapinc/understanding-great-expectations-and-how-to-use-it-7754c78962f4, this particular source would be helpful.

Data AnalysisData Science

Khadija Durrani

Great Expectations

A Butterfly’s Data

The Fall of Stadia