Data Visualization¶
This data visualization module created for the IFMR-LEAD staff meet to be held on 28th and 29th April 2016, will touch upon various topics (listed below). However, the only tools that will be covered in detail are basic data visualizations using R and Tableau Public.
If time permits (and if everyone’s interested), we could cover a few advanced tools as well. We encourage you to set up everything before the session, by following the steps described in ‘Getting Started’.
Tools¶
Contents¶
Getting Started¶
Please follow these steps to install necessary software before the session:
- Download and install R
- Download and install RStudio
- Download and install Tableau Public. You may be asked to sign up with your email address.
Tip
There are tons of resources to learn R online, but my favourite is Datacamp. Note that knowledge of R is not a pre-requisite; we will walk through each line of code at the session.
Tip
If you’re interested in starting out before the session with Tableau Public, just open it up and play around with it. If you want more structure, here are some YouTube tutorials.
Exercises¶
In teams of 2, select one of the following 10 datasets to use. Your task is to:
- Get a basic understanding of the data set: use the internet, make summary statistics, etc.
- Explore the data and think about some questions you might want to answer through visual representation of the data
- Choose an audience that you will create a visualization and a report for: government officials, a private sector partner, your PIs, a classroom of college students, etc.
- Choose what tools you will choose to produce the visualization (Raw, Tableau, R, STATA, etc.)
- Get to work and produce an interesting graphic that (i) clearly highlights what you are investigating (in terms of variables, measures, dimensions, etc.) (ii) is easily digestible and (iii) is supported by a description substantiating your findings (even if there are none)
- Email us your final visualization (PNG / URL). We will put each of these visualizations up, and you will get to quickly explain your intent and findings (if any).
Datasets¶
No | Link | Title | Description |
---|---|---|---|
1 | https://data.gov.in/catalog/basic-habitation-information | Basic Habitation Information As On 1st April 2012 | The data refers to the list of habitations, its population in different caste category (SC, ST and GENERAL) and status of availability of potable drinking water (Covered or Partially covered) all over India. |
2 | https://data.gov.in/catalog/human-development-index-and-its-components-states | Human Development Index and its Components by States, 1999-00 and 2007-08 | The standard of living and command over resources, as reflected in the monthly per capita expenditure adjusted for inflation and inequality. By state |
3 | https://data.gov.in/catalog/out-school-children-6-17-years | Out of school children (6 to 17 years), by major religious communities, 2007-08 | Children out of school are the number of primary-school-age children not enrolled in primary or secondary school. The data desaggregated by social groups (Scheduled caste, Scheduled Tribe, Other Backward Class) and major religious communities (Hindu, Muslim, Sikh, Christian). |
4 | https://data.gov.in/catalog/indian-railways-train-time-table-0 | Indian Railways Time Table for trains available for reservation as on 03.08.2015 | Get data of Indian railways Time Table. It contains train wise departure and arrival times at various stations. It also provides information of the route, distance covered , source station and destination station etc.. |
5 | https://data.gov.in/catalog/time-taken-completing-roads-based-sample-survey | Time Taken For Completing The Roads - Based on Sample Survey 2010 | The data refers to information on Time taken for completing the Roads (as Percentage of Total Roads taken in each Sample State). It provides state-wise and scheme-wise details for number of roads in the sample, Percentage Unfinished, Percentage of Time Taken for Completion (1-9 Months, 9-12 Months, 12-18 Months, beyond 18 Months). The outcome is based on the sample survey undertaken by Programme Evaluation Organisation (PEO), Planning commission which covered 14 districts, 27 blocks, 138 roads, 138 habitations and 1380 beneficiary households spread over 7 states of India for 'Evaluation Study on Rural Roads component of Bharat Nirman, 2010'. The reference period for the study was 2005-06 to 2006-07. |
6 | https://data.gov.in/catalog/issuance-visa-various-foreign-nationals-against-various-categories-visas | Issuance of VISA to Various Foreign Nationals against Various Categories of VISAs | The data refers to issuance of VISAs to various foreign nationals against various Visa Types. Such VISA categories are Diplomatic, Employment, Tourist, Business, Conference, Entry, Medical, Missionary, Pilgrimage, Research, Transit, Student, Project etc.. Immigration, Visa, Foreigners Registration and Training (IVFRT) is one of the central MMPs in the National eGovernance Plan (NeGP) which is conceptualized with an aim to enhance the experience of in-bound and out-bound travellers from and to India by looking into the aspects of Passport, Visa, Immigration, Foreigners Registration and Tracking. |
7 | https://data.gov.in/resources/total-number-registered-motor-vehicles-india-during-1951-2012 | Total Number of Registered Motor Vehicles in India | The data refers to Total Number of Registered Motor Vehicles in India. Registered vehicles have been categorized as Two Wheelers,Cars,Jeeps and Taxis,Buses,Goods Vehicles and Others. |
8 | https://data.gov.in/catalog/tourism-statistics-india | Tourism Statistics of India | Foreign tourist arrivals refer to the number of arrivals of tourists/visitors. An individual who makes multiple trips to the country is counted each time as a new arrival. Foreign Exchange Earnings from tourism are the receipts of the country as a result of consumption expenditure, i.e. payments made for goods and services acquired, by foreign visitors in the economy out of the foreign currency brought by them. The number of Domestic Tourist Visits to different States and Union Territories (UTs) are being compiled based on the information received from them. |
9 | https://data.gov.in/resources/power-supply-position-september-2015-0 | Power Supply Position | Power supply position report prepared by Grid Operation & Distribution wing of CEA provides information about the monthly demand and availability of power / energy at various states of India. |
10 | https://data.gov.in/catalog/india-macro-economic-indicators-summary-table | Summary Table of Macro-economic Indicators of India | Summary Table of Macro-economic Indicators of India as on March 2013 |
ggplot2¶
We will use the mtcars
dataset, which comes pre-loaded in the R session. Explore the dataset by executing str(mtcars)
and summary(mtcars)
in the R console.
Note
You can download the full R Script to reproduce the analysis here.
Load the ggplot2 package using:
library(ggplot2)
Scatterplots¶
Plotting with base R:
plot(mtcars$hp,mtcars$mpg)
Plotting with ggplot2:
scatterplot <- ggplot(mtcars, aes(x = hp,y = mpg)) +
geom_point()
Add axis titles:
scatterplot <- scatterplot +
labs(title = "MPG vs HP",
x = "HP",
y = "Miles Per Gallon")
Make legend discrete, and add regression line for final graph:
mtcars$gear <- as.factor(mtcars$gear)
scatterplot <- ggplot(mtcars, aes(x = mpg,y = hp)) +
geom_point(aes(colour = mtcars$gear)) +
geom_smooth(method = "lm",se = FALSE) +
scale_colour_discrete(name = 'Gear',
breaks = c('3','4','5'),
labels = c('Low','Medium','High')) +
labs(title = "MPG vs HP",
x = "HP",
y = "Miles Per Gallon")
Histograms¶
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 5)
Make the graph pretty:
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 5,
colour = "black",
fill = "white") +
labs(title = "Distribution - Miles Per Gallon",
x = "Miles Per Gallon",
y = "Count")
Saving your graphs¶
?ggsave
gives you a description of the parameters that the function takes, and their defaults:
ggsave(filename, plot = last_plot(), device = NULL, path = NULL,
scale = 1, width = NA, height = NA, units = c("in", "cm", "mm"),
dpi = 300, limitsize = TRUE, ...)
To save our scatterplot, we might want to use the following parameters:
ggsave(filename = 'FULL_FILE_PATH.FILE_TYPE',
plot = scatterplot,
width = 6,height = 5.4)
FILE_TYPE
can be tex, pdf, jpeg, png, svg and others (refer documentation by using ?ggsave
).
Output

ggvis¶
- Data: Infant Mortality Rates (India and major states) by Sex - 2001 to 2012
- Data Source: NRHM
- Download Data
Note
You can download the full R Script to reproduce the analysis here.
Warning
ggvis is still under development (currently version 0.4.2), so some functions might not work as well as we would like them to.
Install / load required packages using:
# Check to see if packages are installed. Install them if they are not, then load them into the R session. ipak <- function(pkg){ new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])] if (length(new.pkg)) install.packages(new.pkg, dependencies = TRUE) sapply(pkg, require, character.only = TRUE) } packages <- c("dplyr", "xts", "lubridate", "tidyr", "ggvis") ipak(packages)
Set file paths:
#Set file paths ROOT <- "FULL_FILE_PATH" #Note that you need to replace \ with / DATA <- paste0(ROOT,"data/") OUTPUT <- paste0(ROOT,"output/") #---------------------------------------------------------------------
Read CSV file:
imrData <- read.csv(paste0(DATA,'IMRsex2001-2012.csv'))
Prepare data for analysis:
imrData <- imrData %>% select(state = India.States.Uts, X2001.Total:X2012.Female) %>% #Keep selected variables gather(yearSex,imr,X2001.Total:X2012.Female) %>% #Reshape data separate(col = yearSex, into = c('year','sex'), sep = "\\.", remove = TRUE) %>% #Split yearSex column into year and sex columns filter(sex != 'Total') %>% #Remove observations of category 'Total' mutate(year = as.Date(paste0(substr(year,start = 2,stop = 5),'-12-31'))) #Remove 'X' at the beginning of year, convert variable into date.
Plot bar graph:
imrData %>% filter(state != 'INDIA',year == max(year)) %>% mutate(state = as.character(state)) %>% ggvis(x = ~state,y = ~imr) %>% layer_bars() %>% add_axis("x",title = "", properties = axis_props(labels = list(angle = 270, dy = -5, align = "right")))
Output

Tip
ggvis
enables you to add basic interactivity to your visualizations. RStudio has excellent tutorials for ggvis.
htmlwidgets¶
htmlwidgets
allows us to use the power of Javascript-based visualization libraries from within R, to produce some fantastic visualizations.
Dygraphs¶
Used to plot time series data.
- Data: All India Area Weighted Annual Rainfall (in mm)
- Data Source: India Meteorological Department (IMD)
- Download Data
Note
You can download the full R Script to reproduce the analysis here.
Install / load required packages using:
# check to see if packages are installed. Install them if they are not, then load them into the R session. ipak <- function(pkg){ new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])] if (length(new.pkg)) install.packages(new.pkg, dependencies = TRUE) sapply(pkg, require, character.only = TRUE) } # usage packages <- c("dplyr", "xts", "zoo", "lubridate", "dygraphs", "tidyr", "htmlwidgets") ipak(packages)
Exercise: manipulate the data and get it to this format. Use
readRDS
to read in the processed dataset to verify that you got everything right.Convert the dataframe to a time series object using the
xts
package:annualRainfall.xts <- xts(annualRainfall$rainfall,order.by = annualRainfall$year) names(annualRainfall.xts) <- 'rainfall'
Your dataset is now ready for plotting. You can use
dygraphs
to make an interactive graph:dygraph(annualRainfall.xts,main = "Annual Rainfall (mm)") %>% dySeries('rainfall',label = 'Annual Rainfall (mm)') %>% dyAxis('x',label = 'Year') %>% dyRangeSelector() %>% saveWidget(paste0(OUTPUT,'annualRainfall.html'))
If you want to style the CSS of the graph further, you can use
dyCSS
:dygraph(annualRainfall.xts,main = "Annual Rainfall (mm)") %>% dySeries('rainfall',label = 'Annual Rainfall (mm)') %>% dyAxis('x',label = 'Year') %>% dyRangeSelector() %>% dyCSS(paste0(CODE,'dygraphs.css')) %>% saveWidget(paste0(OUTPUT,'annualRainfall.html'))
Output
RBokeh¶
Standard plotting library for interactive charts.
http://hafen.github.io/rbokeh/
http://www.r-bloggers.com/a-quick-incomplete-comparison-of-ggplot2-rbokeh-plotting-idioms/
Tips for working with R¶
Dropbox Paths
If you’re on a project that uses Dropbox for file sharing, you can set Dropbox paths (assuming default directory) using the following snippet of code.
Tableau Public¶
- Data: Total Sex Ratio (males / hundred females)
- Data Source: UN Population Division (posted on Gapminder)
- Download Data
Tableau Public is quite intuitive and easy to use. We will walk through each step to reach this visualization: