PhonePe recently released Pulse repo from their payment data. It was hard to get an overview of the data without doing some data transformation.
The data is eight levels deep, nested, and multiple files for similar purpose data. Hard to do any command-line aggregate queries for data exploration.
It’s hard to do any analysis with 2000+ files. So I created an SQLite database of the data using python sqlite-utils.
The SQLite database aggregated data and top data in 5 tables - aggregated_user, aggregated_user_device, aggregated_transaction, top_user, top_transaction. Link to the schema - https://github.com/kracekumar/pulse-plus#all-tables-schema.
python pulse/cli.py ../pulse/data --output pulse.db
creates the SQLite file from the pulse repo data.
The same five tables are available as five CSV files in data/v1/ sub-directory of the repo. All aggregated transaction CSV file.
The data is flat(now) in CSV files and SQLite files, easy to explore in notebooks, metabase, or any data exploration tools. If you’re comfortable with sql, analyze it using the datasette tool.
High-level data quality observations.
-
There is no currency unit in any of the datasets for the amount field. 🤦Is the transaction represented in rupee or paisa? E.g.: Transaction data
-
Amount field is a float field with arbitrary precision(poor JSON conversion). Example: 6611459.8729725825. Typically representation for the money is integer or decimal(float in JSON) with two-digit precision. What do ten digits after decimal represent?
-
In some datasets, “from” and “to” date information is available(transaction) and missing in others(user_device). The only reliable way is to get dates is from the directory and file location.
-
Two entries in top transactions for the state Ladakh by pin codes have no name - Pincodes are missing.
Releasing datasets should be simple keep users(data scientists, analysts) in minds.
Tweet Thread
1. @PhonePe_ recently released Pulse data from their payment data. It was hard to get an overview of the data without doing some data transformation. Here is a thread about data format, transformation, and feedback about data quality. https://t.co/7QP0RwnL1p 🧵
— kracekumar || கிரேஸ்குமார் (@kracetheking) September 5, 2021
References:
- Pulse Repo: https://github.com/PhonePe/pulse
- Pulse Announcement Tweet: https://twitter.com/PhonePe_/status/1434054060148084736
- Pulse Plus Repo: https://github.com/kracekumar/pulse-plus
- Pulse SQLite DB: https://github.com/kracekumar/pulse-plus/blob/main/data/v1/pulse.db
- Datasette: https://datasette.io/
See also
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.