Visualisation Engine Simulation Engine

The Synthetic Population

At the core of BharatSim is a simulated synthetic population, generated using multiple data sources. The resulting population of individuals and households with demographic attributes resembles “reality”: in that if an identical survey were carried out with the synthetic population, it would bear results that statistically similar to the true population.

A synthetic population is a simplified individual-level representation of the actual population. This means that while every person is represented individually in it, not all of their attributes are included (for example, hair colour or shoe-size are deemed to be irrelevant for modelling epidemic spread, and are thus ignored, while the presence of commodities like diabetes would be included). As such, a synthetic population does not aim to be identical to the actual population, but instead attempts to match its various statistical distributions and correlations, thereby being sufficiently close to the true population to be used in modelling.

Col Title Datatype
1AgentIDint64
2SexLabelstring
3Ageint64
4Heightfloat64
5Weightfloat64
6Religionstring
7Castestring
8M_Feverbool
9M_Coughbool
10M_Diarrheabool
11M_Cataractbool
Col Title Datatype
12M_TBbool
13M_HighBPbool
14M_HeartDiseasebool
15M_Diabetesbool
16M_Leprosybool
17M_Cancerbool
18M_Asthmabool
19M_Poliobool
20M_Paralysisbool
21M_Epilepsybool
22StateLabelstring
Col Title Datatype
23Districtstring
24JobTypestring
25EssentialWorkerbool
26AdminUnit_Namestring
27AdminUnit_Latfloat64
28AdminUnit_Lonfloat64
29HHIDint64
30H_Latfloat64
31H_Lonfloat64
32WorkPlaceIDint64
33W_Latfloat64
Col Title Datatype
34W_Lonfloat64
35WorkPlace_AdminUnitstring
36SchoolIDint64
37School_Latfloat64
38School_Lonfloat64
39School_AdminUnitstring
40PublicPlaceIDint64
41PublicPlace_Latfloat64
42PublicPlace_Lonfloat64
43AdherenceToInterventionfloat64
44UsesPublicTransportbool
Columns in the synthetic population. The synthetic population contains 44 attributes describing each agent. Columns 1-7 provide demographic and social identifiers; columns 8-21 describe comorbidities; columns 22-42 provide geographical information about homes, workplaces, and schools; and columns 43-44 describe behavioural and mobility attributes. Datatypes of each attribute are also indicated: integer, float, string, or boolean, as appropriate.

An example

In the table below, you can see an example of a section of a synthetic population. Each row represents an individual with a unique ID, as well as certain attributes. These attributes could be related to the individual themselves (like their gender, age, and height and so on), or their network (details pertaining to their homes, workplaces, and possibly schools). Additionally, the population could also contain information regarding the individual’s comorbidities (for example, whether they have diabetes or other preexisting conditions), if this is deemed relevant to the modelling exercise.

Agent_ID SexLabel Age Height Weight JobLabel StateLabel District AdminUnitName AdminUnitLatitude AdminUnitLongitude HHID H_Lat H_Lon WorkPlaceID W_Lat W_Lon school_id school_lat school_long public_place_id public_place_lat public_place_long essential_worker Adherence_to_Intervention M_High_BP M_Diabetes
99802077568 Female 76 147.37 49.05 Construction Maharashtra Pune Mohammadwadi-Kausar Baug 18.47477 73.89257 99801684702 18.46709 73.90603 2001000003467 18.4679 73.89859 0 3001000000062 18.45035 73.87068 0 1 0 0
99801380107 Male 37 162.03 57.94 Ag labour Maharashtra Pune Nagpur Chawl-Phule Nagar 18.55893 73.87609 99801248473 18.55952 73.87877 2001000006630 18.58283 73.91661 0 3001000001044 18.52699 73.83451 1 0 0 0
99802408169 Male 6 111.21 23.13 Student Maharashtra Pune Kharadi-Chandan Nagar 18.56355 73.93256 99800525921 18.54846 73.94971 0 2001000002070 18.55683 73.94757 3001000000274 18.54904 73.9491 0 1 0 0
99800402683 Female 37 152.65 52.61 Ag labour Maharashtra Pune Karve Nagar 18.49149 73.82173 99800473441 18.48539 73.82129 2001000006876 18.53875 73.92594 0 3001000000650 18.48382 73.79731 1 0 0 0
99800824202 Female 35 150.92 52.42 Homebound Maharashtra Pune Deccan Gymkhana-Model Colony 18.51845 73.83391 99800895513 18.52335 73.85339 0 0 3001000000236 18.54026 73.91186 0 0 0 0
99801178045 Female 50 151.51 50.1 Ag labour Maharashtra Pune Shanivar Peth-Sadashiv Peth 18.51944 73.85194 99801142021 18.51388 73.84935 2001000000636 18.50785 73.84921 0 3001000001403 18.51024 73.84731 0 0.9 0 0
Section of a synthetic population. Excerpt from a synthetically created population for the city of Pune.

All of these attributes are strongly correlated with each other and a good synthetic population will ideally be able reproduce the correlations that occur in the real world. However, this is a monumental task; real world data is complex, and often contains many artifacts that need to be addressed.

Datasets and data preparation

We use a variety of data sources to generate a population of individuals and households with demographic attributes that are statistically identical to real data. This population is generated using statistical methods and machine learning algorithms that are flexible enough to generate data at various administrative levels, ranging from small communities to states. The primary sources of data for these algorithms include the Census of India, the India Human Development Survey (IHDS), the National Sample Survey (NSS), the National Family Health Welfare Survey (NFHS), and the Gridded Population of the World (GPW).

While the synthetic population should faithfully reproduce demographic statistics, it must also incorporate other realistic network structures, such as those appropriate to households and workplaces. (Otherwise, we could end up, for example, with “families” composed entirely of toddlers, or workplaces with strange mixes of professions.) Because different kinds of data respond well to different techniques, a hybrid process is used to scale up these datasets. First, the data is cleaned to remove obvious inconsistencies. Next, a customised hybrid of Iterative Proportional Fitting (IPF), Iterative Proportional Updating (IPU), and a specialised variant of a neural network, called a Conditional-Tabular Generative Adversarial Network (CTGAN), is used to generate new data.

Briefly, Iterative Proportional Fitting finds a joint distribution that matches the marginals, while trying to stay as close to the sample distribution as possible. Iterative Proportional Updating is a heuristic iterative approach which can simultaneously match or fit to multiple distributions (constraints). Finally, Conditional-Tabular Generative Adversarial Networks is a method to model the tabular data distribution and sample rows from the distribution. A Generative Adversarial Network (GAN) uses two “competing” neural networks, the generator and the discriminator. The generator creates realistic samples with the goal that the discriminator should be unable to differentiate between a real sample and a generated sample. In this zero-sum game, capabilities of both the networks are enhanced iteratively.

Critically, our techniques are designed to work seamlessly across data-scarce and data-rich areas; even if a particular area has error-prone or missing data, a synthetic population can still be generated, albeit of a lower quality.

Generating a synthetic population

Schematic of population generation
Schematic overview of the synthetic population generation pipeline. Census and survey data are first processed and scaled up via IPU to create a population of the required size while respecting age, sex, and household-size distributions. Gridded density data is used to generate geo-locations, which agents are assigned to. Summary statistics and microdata and fed into a CT-GAN to produce additional individual-level attributes.

We use IPF to generate a base population, using census data for the demographics and the IHDS survey dataset for personal and household attributes. The base population thus consists of individual data and household data. We assign each household to an administrative unit within a district.

We also experimented with CTGAN to generate a base population. The major advantage of IPU over CTGAN is that IPU is capable of matching individual level and household level characteristics of an individual while making sure that members of the household have a realistic age and gender joint distribution.

To assign job labels to individuals, the relevant data from the IHDS dataset is used. For the time-being, we classify individuals below the age of 18 as students, but could easily relax this assumption. A subset of the population is also assumed be home-bound. This subset consists of unemployed individuals, homemakers, infants and children under the age 3 and elderly people over the retirement age. We use data from the NSS survey to determine the percentage of adult males and females in a city who are home-bound.

Each student in the population is assigned a school. Similarly, each working individual is assigned a workplace based upon their job label. We generate a synthetic latitude and longitude pair for each home, school and workplace in our dataset using GADM grid population density data.

To assign an individual a school, we sample from the list of schools within that geographical boundary, weighing each school by the inverse of the euclidean distance between it and the individual's home. This weighting factor increases the probability of assigning an individual a school that is closer to their home. We follow a similar method to assign workplaces to adults. Additionally, based on every individual's job label, a workplace is assigned at random from a suitable subset of allowed workplaces.

A number of additional attributes are included in our synthetic population, including whether an individual uses public transport or whether an individual is an essential worker. These values are assigned using the individual's job label.

Verifying the generated population

The figure below compares the marginal distributions of age, height, and weight across samples of the survey and the synthetic population for the combined districts of Mumbai and Mumbai Suburban. We work with a randomly chosen subset comprising 10,000 individuals to compare with the survey data, although our synthetic population has 12 million individuals.

Histograms comparing age, height, and weight distributions between survey and synthetic population
Histograms of marginal distributions. The distributions of age, height, and weight in the synthetic population for the combined districts of Mumbai and Mumbai Suburban. Panels (A)–(C) show the distributions for adults; (D)–(F) show the same for children. In each case, the synthetic distribution is plotted alongside the survey distribution, and the similarity is assessed using the Kolmogorov–Smirnov test.

The table below summarises two statistical tests: the two-sample Kolmogorov–Smirnov (KS) test and the Chi-Squared (CS) test. These tests are conducted on compatible columns in both the survey and the synthetic population: the CS test is applied to categorical or Boolean columns (sex in our population), while the KS test is applied to numerical columns (age, height, and weight). For the CS test, the confidence level is the p-value; for the KS test we report 1 − (KS statistic).

Test Confidence level
CS Test for Sex0.99
KS Test for Age0.98
KS Test for Height0.91
KS Test for Weight0.95
Statistical benchmarks of the synthetic population. To compare distributions of key columns, we use the chi-square test for discrete variables (reported as a p-value) and the Kolmogorov–Smirnov test for continuous variables (reported as 1 − (KS statistic)).

Critically, our techniques are designed to work seamlessly across data-scarce and data-rich areas; even if a particular area has error-prone or missing data, we can still generate a synthetic population, albeit of slightly poorer quality, but without affecting anything else.

Visualisation Engine Simulation Engine