Data Visualization

This data visualization module created for the IFMR-LEAD staff meet to be held on 28th and 29th April 2016, will touch upon various topics (listed below). However, the only tools that will be covered in detail are basic data visualizations using R and Tableau Public.

If time permits (and if everyone’s interested), we could cover a few advanced tools as well. We encourage you to set up everything before the session, by following the steps described in ‘Getting Started’.

Tools

Contents

Getting Started

Please follow these steps to install necessary software before the session:

  • Download and install R
  • Download and install RStudio
  • Download and install Tableau Public. You may be asked to sign up with your email address.

Tip

There are tons of resources to learn R online, but my favourite is Datacamp. Note that knowledge of R is not a pre-requisite; we will walk through each line of code at the session.

Tip

If you’re interested in starting out before the session with Tableau Public, just open it up and play around with it. If you want more structure, here are some YouTube tutorials.

Exercises

In teams of 2, select one of the following 10 datasets to use. Your task is to:

  1. Get a basic understanding of the data set: use the internet, make summary statistics, etc.
  2. Explore the data and think about some questions you might want to answer through visual representation of the data
  3. Choose an audience that you will create a visualization and a report for: government officials, a private sector partner, your PIs, a classroom of college students, etc.
  4. Choose what tools you will choose to produce the visualization (Raw, Tableau, R, STATA, etc.)
  5. Get to work and produce an interesting graphic that (i) clearly highlights what you are investigating (in terms of variables, measures, dimensions, etc.) (ii) is easily digestible and (iii) is supported by a description substantiating your findings (even if there are none)
  6. Email us your final visualization (PNG / URL). We will put each of these visualizations up, and you will get to quickly explain your intent and findings (if any).

Datasets

NoLinkTitleDescription
1https://data.gov.in/catalog/basic-habitation-informationBasic Habitation Information As On 1st April 2012The data refers to the list of habitations, its population in different caste category (SC, ST and GENERAL) and status of availability of potable drinking water (Covered or Partially covered) all over India.
2https://data.gov.in/catalog/human-development-index-and-its-components-statesHuman Development Index and its Components by States, 1999-00 and 2007-08The standard of living and command over resources, as reflected in the monthly per capita expenditure adjusted for inflation and inequality. By state
3https://data.gov.in/catalog/out-school-children-6-17-yearsOut of school children (6 to 17 years), by major religious communities, 2007-08Children out of school are the number of primary-school-age children not enrolled in primary or secondary school. The data desaggregated by social groups (Scheduled caste, Scheduled Tribe, Other Backward Class) and major religious communities (Hindu, Muslim, Sikh, Christian).
4https://data.gov.in/catalog/indian-railways-train-time-table-0Indian Railways Time Table for trains available for reservation as on 03.08.2015Get data of Indian railways Time Table. It contains train wise departure and arrival times at various stations. It also provides information of the route, distance covered , source station and destination station etc..
5https://data.gov.in/catalog/time-taken-completing-roads-based-sample-surveyTime Taken For Completing The Roads - Based on Sample Survey 2010The data refers to information on Time taken for completing the Roads (as Percentage of Total Roads taken in each Sample State). It provides state-wise and scheme-wise details for number of roads in the sample, Percentage Unfinished, Percentage of Time Taken for Completion (1-9 Months, 9-12 Months, 12-18 Months, beyond 18 Months). The outcome is based on the sample survey undertaken by Programme Evaluation Organisation (PEO), Planning commission which covered 14 districts, 27 blocks, 138 roads, 138 habitations and 1380 beneficiary households spread over 7 states of India for 'Evaluation Study on Rural Roads component of Bharat Nirman, 2010'. The reference period for the study was 2005-06 to 2006-07.
6https://data.gov.in/catalog/issuance-visa-various-foreign-nationals-against-various-categories-visasIssuance of VISA to Various Foreign Nationals against Various Categories of VISAsThe data refers to issuance of VISAs to various foreign nationals against various Visa Types. Such VISA categories are Diplomatic, Employment, Tourist, Business, Conference, Entry, Medical, Missionary, Pilgrimage, Research, Transit, Student, Project etc.. Immigration, Visa, Foreigners Registration and Training (IVFRT) is one of the central MMPs in the National eGovernance Plan (NeGP) which is conceptualized with an aim to enhance the experience of in-bound and out-bound travellers from and to India by looking into the aspects of Passport, Visa, Immigration, Foreigners Registration and Tracking.
7https://data.gov.in/resources/total-number-registered-motor-vehicles-india-during-1951-2012Total Number of Registered Motor Vehicles in IndiaThe data refers to Total Number of Registered Motor Vehicles in India. Registered vehicles have been categorized as Two Wheelers,Cars,Jeeps and Taxis,Buses,Goods Vehicles and Others.
8https://data.gov.in/catalog/tourism-statistics-indiaTourism Statistics of IndiaForeign tourist arrivals refer to the number of arrivals of tourists/visitors. An individual who makes multiple trips to the country is counted each time as a new arrival. Foreign Exchange Earnings from tourism are the receipts of the country as a result of consumption expenditure, i.e. payments made for goods and services acquired, by foreign visitors in the economy out of the foreign currency brought by them. The number of Domestic Tourist Visits to different States and Union Territories (UTs) are being compiled based on the information received from them.
9https://data.gov.in/resources/power-supply-position-september-2015-0Power Supply PositionPower supply position report prepared by Grid Operation & Distribution wing of CEA provides information about the monthly demand and availability of power / energy at various states of India.
10https://data.gov.in/catalog/india-macro-economic-indicators-summary-tableSummary Table of Macro-economic Indicators of IndiaSummary Table of Macro-economic Indicators of India as on March 2013

ggplot2

We will use the mtcars dataset, which comes pre-loaded in the R session. Explore the dataset by executing str(mtcars) and summary(mtcars) in the R console.

Note

You can download the full R Script to reproduce the analysis here.

Load the ggplot2 package using:

library(ggplot2)

Scatterplots

Plotting with base R:

plot(mtcars$hp,mtcars$mpg)

Plotting with ggplot2:

scatterplot <- ggplot(mtcars, aes(x = hp,y = mpg)) +
        geom_point()

Add axis titles:

scatterplot <- scatterplot +
  labs(title = "MPG vs HP",
       x = "HP",
       y = "Miles Per Gallon")

Make legend discrete, and add regression line for final graph:

mtcars$gear <- as.factor(mtcars$gear)

scatterplot <- ggplot(mtcars, aes(x = mpg,y = hp)) +
  geom_point(aes(colour = mtcars$gear)) +
  geom_smooth(method = "lm",se = FALSE) +
  scale_colour_discrete(name = 'Gear',
                        breaks = c('3','4','5'),
                        labels = c('Low','Medium','High')) +
  labs(title = "MPG vs HP",
       x = "HP",
       y = "Miles Per Gallon")

Histograms

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(binwidth = 5)

Make the graph pretty:

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(binwidth = 5,
                 colour = "black",
                 fill = "white") +
  labs(title = "Distribution - Miles Per Gallon",
       x = "Miles Per Gallon",
       y = "Count")

Saving your graphs

?ggsave gives you a description of the parameters that the function takes, and their defaults:

ggsave(filename, plot = last_plot(), device = NULL, path = NULL,
        scale = 1, width = NA, height = NA, units = c("in", "cm", "mm"),
        dpi = 300, limitsize = TRUE, ...)

To save our scatterplot, we might want to use the following parameters:

ggsave(filename = 'FULL_FILE_PATH.FILE_TYPE',
        plot = scatterplot,
        width = 6,height = 5.4)

FILE_TYPE can be tex, pdf, jpeg, png, svg and others (refer documentation by using ?ggsave).

Output

ggplot2 image

ggvis

  • Data: Infant Mortality Rates (India and major states) by Sex - 2001 to 2012
  • Data Source: NRHM
  • Download Data

Note

You can download the full R Script to reproduce the analysis here.

Warning

ggvis is still under development (currently version 0.4.2), so some functions might not work as well as we would like them to.

  • Install / load required packages using:

    # Check to see if packages are installed. Install them if they are not, then load them into the R session.
    
    ipak <- function(pkg){
      new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
      if (length(new.pkg))
        install.packages(new.pkg, dependencies = TRUE)
      sapply(pkg, require, character.only = TRUE)
    }
    
    packages <- c("dplyr", "xts", "lubridate", "tidyr", "ggvis")
    ipak(packages)
    
  • Set file paths:

    #Set file paths
    
    ROOT <- "FULL_FILE_PATH"   #Note that you need to replace \ with /
    DATA <- paste0(ROOT,"data/")
    OUTPUT <- paste0(ROOT,"output/")
    #---------------------------------------------------------------------
    
  • Read CSV file:

    imrData <- read.csv(paste0(DATA,'IMRsex2001-2012.csv'))
    
  • Prepare data for analysis:

    imrData <- imrData %>%
      select(state = India.States.Uts,
             X2001.Total:X2012.Female) %>%  #Keep selected variables
      gather(yearSex,imr,X2001.Total:X2012.Female) %>%    #Reshape data
      separate(col = yearSex,
               into = c('year','sex'),
               sep = "\\.",
               remove = TRUE) %>%         #Split yearSex column into year and sex columns
      filter(sex != 'Total') %>%          #Remove observations of category 'Total'
      mutate(year = as.Date(paste0(substr(year,start = 2,stop = 5),'-12-31')))   #Remove 'X' at the beginning of year, convert variable into date.
    
  • Plot bar graph:

    imrData %>%
      filter(state != 'INDIA',year == max(year)) %>%
      mutate(state = as.character(state)) %>%
      ggvis(x = ~state,y = ~imr) %>%
      layer_bars() %>%
      add_axis("x",title = "",
               properties = axis_props(labels = list(angle = 270,
                                                     dy = -5,
                                                     align = "right")))
    

Output

ggplot2 image

Tip

ggvis enables you to add basic interactivity to your visualizations. RStudio has excellent tutorials for ggvis.

htmlwidgets

htmlwidgets allows us to use the power of Javascript-based visualization libraries from within R, to produce some fantastic visualizations.

Dygraphs

Used to plot time series data.

  • Data: All India Area Weighted Annual Rainfall (in mm)
  • Data Source: India Meteorological Department (IMD)
  • Download Data

Note

You can download the full R Script to reproduce the analysis here.

  • Install / load required packages using:

    # check to see if packages are installed. Install them if they are not, then load them into the R session.
    
    ipak <- function(pkg){
      new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
      if (length(new.pkg))
        install.packages(new.pkg, dependencies = TRUE)
      sapply(pkg, require, character.only = TRUE)
    }
    
    # usage
    packages <- c("dplyr", "xts", "zoo", "lubridate",
                  "dygraphs", "tidyr", "htmlwidgets")
    ipak(packages)
    
  • Exercise: manipulate the data and get it to this format. Use readRDS to read in the processed dataset to verify that you got everything right.

  • Convert the dataframe to a time series object using the xts package:

    annualRainfall.xts <- xts(annualRainfall$rainfall,order.by = annualRainfall$year)
    names(annualRainfall.xts) <- 'rainfall'
    
  • Your dataset is now ready for plotting. You can use dygraphs to make an interactive graph:

    dygraph(annualRainfall.xts,main = "Annual Rainfall (mm)") %>%
      dySeries('rainfall',label = 'Annual Rainfall (mm)') %>%
      dyAxis('x',label = 'Year') %>%
      dyRangeSelector() %>%
      saveWidget(paste0(OUTPUT,'annualRainfall.html'))
    
  • If you want to style the CSS of the graph further, you can use dyCSS:

    dygraph(annualRainfall.xts,main = "Annual Rainfall (mm)") %>%
      dySeries('rainfall',label = 'Annual Rainfall (mm)') %>%
      dyAxis('x',label = 'Year') %>%
      dyRangeSelector() %>%
      dyCSS(paste0(CODE,'dygraphs.css')) %>%
      saveWidget(paste0(OUTPUT,'annualRainfall.html'))
    

Output

Leaflet

Interactive maps with Leaflet

https://rstudio.github.io/leaflet/

Tips for working with R

Dropbox Paths

If you’re on a project that uses Dropbox for file sharing, you can set Dropbox paths (assuming default directory) using the following snippet of code.

Tableau Public

  • Data: Total Sex Ratio (males / hundred females)
  • Data Source: UN Population Division (posted on Gapminder)
  • Download Data

Tableau Public is quite intuitive and easy to use. We will walk through each step to reach this visualization: