Incomplete data is useless - COVID-19 India data

The data is a representation of reality. When a value is missing in the piece of data, it makes it less useful and reliable. Every day, articles, a news report about COVID-19 discuss the new cases, recovered cases, and deceased cases. This information gives you a sense of hope or reality or confusion.

Regarding COVID-19, everyone believes or accepts specific details as fact like mortality rate is 2 to 3 percent, over the age of fifty, the chance of death is 30 to 50 percent. These are established based on previously affected places, and some details come out of the simulation. The mortality rate, deceased age distribution, patient age distribution, mode of spread differs from region to country. With accurate and complete data, one can understand the situation, make the decision, and update the facts.

Covid19India provides an API to details of all the COVID-19 cases. The dataset for the entire post comes from an API response on 18th April 2020. The APIs endpoints for the analysis are, and

When one moves away cumulative data like total cases to specific data like age, gender, the details are largely missing and makes it difficult to comprehend what’s going in the state. For all patients in India, only 11% of age brackets and 19.2% of gender details are available.

Analyzing missing data (empty value = “) state-wise reveals how each state reveals data. The below image is a comparison of missing data for the state of Karnataka and Maharashtra.

Looking further into each case, it’s clear, Karnataka officials release date at the individual level such as age bracket, gender, other details in a tabular format(not machine-readable) compared to the State of Maharashtra. Maharashtra releases only cumulative data like the total number of new cases, the total number of recovered patients. Each state follows its format(not so useful). Next to Karnataka, Andhra Pradesh data contains more than 50% of age bracket and gender. The rest of the states except Kerala, to a certain extent, all have close to 90% of missing data for gender and age bracket.

With missing data, we can’t identify which age group is dying in Maharashtra, which age group is most affected, does deceased age group vary across all states, is there a state where a considerable amount of young people are dying?

Karnataka Case

If we divide the age bracket in the range of ten like 40 - 50, all the age groups are more or less equally affected. There is no significant variation. You can also use arbitrary age groups like 0 to 45, 45 to 60, 60 to 75, 75+ as mentioned in the tweet.

The raw_data.json API provides the status of the patients. The patient’s status change numbers don’t be match compared to the primary web application( My guess is because of API update frequency.

The patients in the age group 0-10 take more time to recover compared to the rest of the age group. The age group 10-20 less time to recover compared to the rest of the group. The two issues here. First, the dataset is small, only 60 cases. Second, the accuracy of dates makes a considerable difference; i.e dateannounced and statuschangedata in the API is crucial. If the data is available for the top three affected states like Maharashtra, Delhi, Gujarat, it would reveal which age group and gender are recovering fast and deceasing, in how many days the acute patients die.

Deceased Case

From the deaths_recoveries.json API, the age bracket is available for 39.6% of deceased cases. In Karnataka for

In the available data, the most number of deaths are in the age-group, 40-50 (44 deaths), and 20-30 (25 deaths). To get a better picture, one needs to compare with the population distributions as well. Without complete details, it’s challenging to say the age group 20-30 mortality rate is 0.1%.

The mere number of deceased patients doesn’t represent anything close to reality. Out of 13 deaths in Karnataka, age bracket and gender are available for 10 cases. All deceased cases are in the age bracket of 65 to 80, two females and eight males.

The small data suggests all corona related deaths happen with seven days of identification. Is this true for all states?


I’m aware; volunteers maintain the API. Their effort requires a special mention, and by analyzing the two sets of APIs, it’s clear how hard it is to mark patients’ status change. When the Government doesn’t release the unique patient id, it’s confusing, and local volunteer intuition and group knowledge takes over.

Every data points help us to know more about the pandemic. All the state governments need to release the complete data in a useable format. Remember, age and gender are necessary information, with more details like the existing respiratory condition, hospital allocated, the model can help in prioritizing the resources later. There is no replacement for testing and testing early. Without clean data, it’s impossible to track the epidemic and associated rampage. We haven’t seen the cases for re-infection yet. Several other factors contribute to one’s survival, like a place of residence, access to insurance, socioeconomic status. We haven’t moved further from numbers, the details like how crucial is a patient when identified, how each patient is recovering are never released in public.

By furnishing the incomplete data, the Government denies our choice of making an informed decision, and there is no independent verification for the claims on the pattern of the spread. In the prevailing conditions, the only way to understand the scenario is both qualitative and quantity (data) stories.

Code: Age Analysis, Death Recoveries

See also

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Get new posts by email: