Using Frictionless Framework in Python

21 Nov

It is not always easy to generate insight and conclusions from data. Data can be: poorly structured, hard to find, archived in difficult to use formats, or incomplete. These problems provide ‘friction’, making it difficult to utilise, publish, and exchange data. The Frictionless Data project strives to decrease as much of this friction as possible while dealing with data, with the objective of making it simple to move data across tools and platforms for further analysis. This is an open-source software, tool, and specification initiative aimed at enhancing data and metadata interoperability.

The Frictionless Framework improves the usability of data by producing metadata and schemas and verifying data to assure quality. To enhance your data, there are four basic functions that may be applied independently. The first one is to ‘Describe your Data’; in this, we extract and modify information from a data file - for example, describe will provide metadata defining the data's layout (e.g. which row is the header) as well as producing a schema describing the data's contents (i.e. the type of data in a column). This is the first step in ensuring the quality and relevance of data.

The second function is to ‘Extract your Data’, from which a data file is read and normalised. By default, extract produces data that matches the information specified in the describe stage or deduced automatically. The user can choose to receive raw (unnormalised) data instead. Frictionless supports a variety of file protocols, including HTTP, FTP, and S3, as well as data formats such as CSV, XLS, JSON, SQL, and others.

Once you have extracted your data then you need to ‘Validate it’, finding errors in a data file. Validate analyses data tables, resources, and datasets for any errors (for example, are there any missing values?). These tests are modifiable and can be based on a schema given. While extract purifies the data by deleting invalid cells, validate allows you to see the entire image of the raw file.

Last function is to ‘Transform the Validated Data’, altering the metadata and contents of a data file. This stage may include altering the data, storing it in a different format, or uploading it alternatively. Frictionless offers pipeline capabilities as well as a lower-level interface for working with data.

This framework clearly has a flexible and useful capability for approaching data utilisation, hopefully this will have given some insight into how it can be leveraged effectively.

Reference

https://framework.frictionlessdata.io/index.html

Data ManagementPythonOpen Source