The data is eight levels deep, nested, and multiple files for similar purpose data. Hard to do any command-line aggregate queries for data exploration.
The SQLite database aggregated data and top data in 5 tables - aggregated_user, aggregated_user_device, aggregated_transaction, top_user, top_transaction. Link to the schema - https://github.com/kracekumar/pulse-plus#all-tables-schema.
python pulse/cli.py ../pulse/data --output pulse.db creates the SQLite file from the pulse repo data.
The data is flat(now) in CSV files and SQLite files, easy to explore in notebooks, metabase, or any data exploration tools. If you’re comfortable with sql, analyze it using the datasette tool.
High-level data quality observations.
There is no currency unit in any of the datasets for the amount field. 🤦Is the transaction represented in rupee or paisa? E.g.: Transaction data
Amount field is a float field with arbitrary precision(poor JSON conversion). Example: 6611459.8729725825. Typically representation for the money is integer or decimal(float in JSON) with two-digit precision. What do ten digits after decimal represent?
In some datasets, “from” and “to” date information is available(transaction) and missing in others(user_device). The only reliable way is to get dates is from the directory and file location.
Two entries in top transactions for the state Ladakh by pin codes have no name - Pincodes are missing.
Releasing datasets should be simple keep users(data scientists, analysts) in minds.
1. @PhonePe_ recently released Pulse data from their payment data. It was hard to get an overview of the data without doing some data transformation. Here is a thread about data format, transformation, and feedback about data quality. https://t.co/7QP0RwnL1p 🧵— kracekumar || கிரேஸ்குமார் (@kracetheking) September 5, 2021
- Pulse Repo: https://github.com/PhonePe/pulse
- Pulse Announcement Tweet: https://twitter.com/PhonePe_/status/1434054060148084736
- Pulse Plus Repo: https://github.com/kracekumar/pulse-plus
- Pulse SQLite DB: https://github.com/kracekumar/pulse-plus/blob/main/data/v1/pulse.db
- Datasette: https://datasette.io/
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.