Pulse Plus

PhonePe recently released Pulse repo from their payment data. It was hard to get an overview of the data without doing some data transformation.

The data is eight levels deep, nested, and multiple files for similar purpose data. Hard to do any command-line aggregate queries for data exploration.

It’s hard to do any analysis with 2000+ files. So I created an SQLite database of the data using python sqlite-utils.

The SQLite database aggregated data and top data in 5 tables - aggregated_user, aggregated_user_device, aggregated_transaction, top_user, top_transaction. Link to the schema - https://github.com/kracekumar/pulse-plus#all-tables-schema.

python pulse/cli.py ../pulse/data --output pulse.db creates the SQLite file from the pulse repo data.

The same five tables are available as five CSV files in data/v1/ sub-directory of the repo. All aggregated transaction CSV file.

The data is flat(now) in CSV files and SQLite files, easy to explore in notebooks, metabase, or any data exploration tools. If you’re comfortable with sql, analyze it using the datasette tool.

High-level data quality observations.

Releasing datasets should be simple keep users(data scientists, analysts) in minds.

Tweet Thread

References: