Introduction
Goal of this Visualization task is to create an alternative visualization to the Tableau’s visualization for Winter Olympics data for different countries
In this blog post, I’m trying to generate a World Choropleth Map showing the total counts of medals for each country.
Analysis
Cleaning up workspace and loading required libraries
rm(list = ls())
library(tidyverse) #Data Wrangling
library(httr)
library(readxl) #Data Ingestion
library(ggplot2) #Data Visualization
library(leaflet)
library(rgeos)
library(rgdal)
library(stringr)
Obtaining Data
Reading and viewing the dataset
GET("https://query.data.world/s/n5nc32oqhtb25hdt3vsa24rd4scs2w", write_disk(tf <- tempfile(fileext = ".xlsx")))
olympics = read_excel(tf)
olympics
## # A tibble: 2,865 x 9
## Year Sport Event Country Gender `Medal Rank` Medal `Name of Athlet…
## <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 1924 Bobs… Men'… Switze… Men 1 gold Switzerland-1
## 2 1924 Bobs… Men'… Britain Men 2 silv… Britain-1
## 3 1924 Bobs… Men'… Belgium Men 3 bron… Belgium-1
## 4 1924 Cros… Men'… Norway Men 1 gold Thorleif Haug
## 5 1924 Cros… Men'… Norway Men 2 silv… Johan Grøttums…
## 6 1924 Cros… Men'… Finland Men 3 bron… Tapani Niku
## 7 1924 Cros… Men'… Norway Men 1 gold Thorleif Haug
## 8 1924 Cros… Men'… Norway Men 2 silv… Thoralf Strøms…
## 9 1924 Cros… Men'… Norway Men 3 bron… Johan Grøttums…
## 10 1924 Curl… Men'… Britain Men 1 gold Britain
## # … with 2,855 more rows, and 1 more variable: `Age of Athlete` <dbl>
Summarizing and getting stats to better understand the dataset
olympics %>%
glimpse()
## Rows: 2,865
## Columns: 9
## $ Year <dbl> 1924, 1924, 1924, 1924, 1924, 1924, 1924, 1…
## $ Sport <chr> "Bobsled", "Bobsled", "Bobsled", "Cross-Cou…
## $ Event <chr> "Men's Four/Five", "Men's Four/Five", "Men'…
## $ Country <chr> "Switzerland", "Britain", "Belgium", "Norwa…
## $ Gender <chr> "Men", "Men", "Men", "Men", "Men", "Men", "…
## $ `Medal Rank` <dbl> 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3…
## $ Medal <chr> "gold", "silver", "bronze", "gold", "silver…
## $ `Name of Athlete or Team` <chr> "Switzerland-1", "Britain-1", "Belgium-1", …
## $ `Age of Athlete` <dbl> NA, NA, NA, 29, 24, 28, 29, 27, 24, NA, NA,…
olympics %>%
summary()
## Year Sport Event Country
## Min. :1924 Length:2865 Length:2865 Length:2865
## 1st Qu.:1972 Class :character Class :character Class :character
## Median :1992 Mode :character Mode :character Mode :character
## Mean :1986
## 3rd Qu.:2006
## Max. :2014
##
## Gender Medal Rank Medal Name of Athlete or Team
## Length:2865 Min. :1.000 Length:2865 Length:2865
## Class :character 1st Qu.:1.000 Class :character Class :character
## Mode :character Median :2.000 Mode :character Mode :character
## Mean :1.996
## 3rd Qu.:3.000
## Max. :3.000
##
## Age of Athlete
## Min. :14.00
## 1st Qu.:22.00
## Median :25.00
## Mean :25.15
## 3rd Qu.:28.00
## Max. :42.00
## NA's :692
Scrubbing data
As per the dataset requirement, East and West Germany are to be grouped with Germany and Soviet Union and the 1992 Unified Team needs to be combined with Russia
olympics = olympics %>%
mutate(Country = recode(Country, "Soviet Union" = "Russia", "Unified Team" = "Russia",
"East Germany" = "Germany", "West Germany" = "Germany"))
Reading in the ISO-3166 country codes dataset in order to generate the choropleth
countryCodes = read_csv("https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv")
countryCodes
## # A tibble: 249 x 11
## name `alpha-2` `alpha-3` `country-code` `iso_3166-2` region `sub-region`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Afgh… AF AFG 004 ISO 3166-2:… Asia Southern As…
## 2 Ålan… AX ALA 248 ISO 3166-2:… Europe Northern Eu…
## 3 Alba… AL ALB 008 ISO 3166-2:… Europe Southern Eu…
## 4 Alge… DZ DZA 012 ISO 3166-2:… Africa Northern Af…
## 5 Amer… AS ASM 016 ISO 3166-2:… Ocean… Polynesia
## 6 Ando… AD AND 020 ISO 3166-2:… Europe Southern Eu…
## 7 Ango… AO AGO 024 ISO 3166-2:… Africa Sub-Saharan…
## 8 Angu… AI AIA 660 ISO 3166-2:… Ameri… Latin Ameri…
## 9 Anta… AQ ATA 010 ISO 3166-2:… <NA> <NA>
## 10 Anti… AG ATG 028 ISO 3166-2:… Ameri… Latin Ameri…
## # … with 239 more rows, and 4 more variables: `intermediate-region` <chr>,
## # `region-code` <chr>, `sub-region-code` <chr>,
## # `intermediate-region-code` <chr>
Joining the 2 datasets and verifying if any country name mismatch happening in the 2.
olympics %>%
left_join(countryCodes, by=c("Country" = "name")) %>%
filter(is.na(`alpha-3`)) %>%
select(Country) %>%
unique()
## # A tibble: 8 x 1
## Country
## <chr>
## 1 Britain
## 2 United States
## 3 Czechoslovakia
## 4 Russia
## 5 North Korea
## 6 Yugoslavia
## 7 South Korea
## 8 Czech Republic
Looks like above 7 countries do not have a corresponding entry in the countryCodes dataset. Lets try to find out the corresponding names for each of the 7 in the countryCodes dataset.
countryCodes %>%
filter(str_detect(str_to_lower(name), "britain")) #United Kingdom of Great Britain and Northern Ireland
## # A tibble: 1 x 11
## name `alpha-2` `alpha-3` `country-code` `iso_3166-2` region `sub-region`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Unit… GB GBR 826 ISO 3166-2:… Europe Northern Eu…
## # … with 4 more variables: `intermediate-region` <chr>, `region-code` <chr>,
## # `sub-region-code` <chr>, `intermediate-region-code` <chr>
countryCodes %>%
filter(str_detect(str_to_lower(name), "states")) #United States of America
## # A tibble: 3 x 11
## name `alpha-2` `alpha-3` `country-code` `iso_3166-2` region `sub-region`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Micr… FM FSM 583 ISO 3166-2:… Ocean… Micronesia
## 2 Unit… US USA 840 ISO 3166-2:… Ameri… Northern Am…
## 3 Unit… UM UMI 581 ISO 3166-2:… Ocean… Micronesia
## # … with 4 more variables: `intermediate-region` <chr>, `region-code` <chr>,
## # `sub-region-code` <chr>, `intermediate-region-code` <chr>
countryCodes %>%
filter(str_detect(str_to_lower(name), "czech")) #Czech Republic
## # A tibble: 1 x 11
## name `alpha-2` `alpha-3` `country-code` `iso_3166-2` region `sub-region`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Czec… CZ CZE 203 ISO 3166-2:… Europe Eastern Eur…
## # … with 4 more variables: `intermediate-region` <chr>, `region-code` <chr>,
## # `sub-region-code` <chr>, `intermediate-region-code` <chr>
countryCodes %>%
filter(str_detect(str_to_lower(name), "russia")) #Russian Federation
## # A tibble: 1 x 11
## name `alpha-2` `alpha-3` `country-code` `iso_3166-2` region `sub-region`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Russ… RU RUS 643 ISO 3166-2:… Europe Eastern Eur…
## # … with 4 more variables: `intermediate-region` <chr>, `region-code` <chr>,
## # `sub-region-code` <chr>, `intermediate-region-code` <chr>
countryCodes %>%
filter(str_detect(str_to_lower(name), "korea")) #Korea (Democratic People's Republic of) = North Korea, Korea (Republic of) = South Korea
## # A tibble: 2 x 11
## name `alpha-2` `alpha-3` `country-code` `iso_3166-2` region `sub-region`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Kore… KP PRK 408 ISO 3166-2:… Asia Eastern Asia
## 2 Kore… KR KOR 410 ISO 3166-2:… Asia Eastern Asia
## # … with 4 more variables: `intermediate-region` <chr>, `region-code` <chr>,
## # `sub-region-code` <chr>, `intermediate-region-code` <chr>
countryCodes %>%
filter(str_detect(str_to_lower(name), "yugo")) #Macedonia (the former Yugoslav Republic of)
## # A tibble: 0 x 11
## # … with 11 variables: name <chr>, `alpha-2` <chr>, `alpha-3` <chr>,
## # `country-code` <chr>, `iso_3166-2` <chr>, region <chr>, `sub-region` <chr>,
## # `intermediate-region` <chr>, `region-code` <chr>, `sub-region-code` <chr>,
## # `intermediate-region-code` <chr>
Renaming mismatched countries in olympic dataset based on countryCodes dataset.
olympics = olympics %>%
mutate(Country = recode(Country,
"Britain" = "United Kingdom of Great Britain and Northern Ireland",
"United States" = "United States of America",
"Czechoslovakia" = "Czech Republic",
"Russia" = "Russian Federation",
"North Korea" = "Korea (Democratic People's Republic of)",
"South Korea" = "Korea (Republic of)",
"Yugoslavia" = "Macedonia (the former Yugoslav Republic of)"))
Joining and viewing the 2 datasets
olympics = olympics %>%
left_join(countryCodes, by=c("Country" = "name"))
olympics
## # A tibble: 2,865 x 19
## Year Sport Event Country Gender `Medal Rank` Medal `Name of Athlet…
## <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 1924 Bobs… Men'… Switze… Men 1 gold Switzerland-1
## 2 1924 Bobs… Men'… United… Men 2 silv… Britain-1
## 3 1924 Bobs… Men'… Belgium Men 3 bron… Belgium-1
## 4 1924 Cros… Men'… Norway Men 1 gold Thorleif Haug
## 5 1924 Cros… Men'… Norway Men 2 silv… Johan Grøttums…
## 6 1924 Cros… Men'… Finland Men 3 bron… Tapani Niku
## 7 1924 Cros… Men'… Norway Men 1 gold Thorleif Haug
## 8 1924 Cros… Men'… Norway Men 2 silv… Thoralf Strøms…
## 9 1924 Cros… Men'… Norway Men 3 bron… Johan Grøttums…
## 10 1924 Curl… Men'… United… Men 1 gold Britain
## # … with 2,855 more rows, and 11 more variables: `Age of Athlete` <dbl>,
## # `alpha-2` <chr>, `alpha-3` <chr>, `country-code` <chr>, `iso_3166-2` <chr>,
## # region <chr>, `sub-region` <chr>, `intermediate-region` <chr>,
## # `region-code` <chr>, `sub-region-code` <chr>,
## # `intermediate-region-code` <chr>
Aggregating per country to find the total number of medals for each country and its corresponding alpha-3 code.
TotalMedalsPerCountry = olympics %>%
group_by(Country, `alpha-3`) %>%
summarise(TotalMedals = n()) %>%
rename(Code = `alpha-3`)
TotalMedalsPerCountry
## # A tibble: 39 x 3
## Country Code TotalMedals
## <chr> <chr> <int>
## 1 Australia AUS 12
## 2 Austria AUT 218
## 3 Belarus BLR 15
## 4 Belgium BEL 5
## 5 Bulgaria BGR 6
## 6 Canada CAN 170
## 7 China CHN 53
## 8 Croatia HRV 11
## 9 Czech Republic <NA> 49
## 10 Denmark DNK 1
## # … with 29 more rows
Lets see the top countries based on total number of medals
TotalMedalsPerCountry %>%
arrange(desc(TotalMedals))
## # A tibble: 39 x 3
## Country Code TotalMedals
## <chr> <chr> <int>
## 1 Germany DEU 377
## 2 Russian Federation RUS 341
## 3 Norway NOR 329
## 4 United States of America USA 282
## 5 Austria AUT 218
## 6 Canada CAN 170
## 7 Finland FIN 161
## 8 Sweden SWE 144
## 9 Switzerland CHE 138
## 10 Italy ITA 114
## # … with 29 more rows
Germany obtained the most number of medals (377) closely followed by Russia with (341)
Exploring Data
Lets plot the above data on a map using leaflet.
Loading shape file data set from World Borders Dataset.
shape = readOGR("~/Downloads/TM_WORLD_BORDERS_SIMPL-0.3/TM_WORLD_BORDERS_SIMPL-0.3.shp")
## OGR data source with driver: ESRI Shapefile
## Source: "/Users/amangal/Downloads/TM_WORLD_BORDERS_SIMPL-0.3/TM_WORLD_BORDERS_SIMPL-0.3.shp", layer: "TM_WORLD_BORDERS_SIMPL-0.3"
## with 246 features
## It has 11 fields
## Integer64 fields read as strings: POP2005
names(shape)
## [1] "FIPS" "ISO2" "ISO3" "UN" "NAME" "AREA"
## [7] "POP2005" "REGION" "SUBREGION" "LON" "LAT"
TotalMedalsPerCountry = TotalMedalsPerCountry %>%
left_join(tbl_df(shape@data), by = c("Code"="ISO3")) %>%
na.omit()
TotalMedalsPerCountry = TotalMedalsPerCountry %>%
mutate(label = str_c(sep = " - ", NAME, "TotalMedals", TotalMedals))
bins = c(0, 10, 20, 30, 50, 100, 150, 200, 300, Inf)
pal = colorBin("RdYlGn", domain = TotalMedalsPerCountry$TotalMedals, bins = bins)
TotalMedalsPerCountry %>%
leaflet() %>%
addTiles() %>%
setView(53.019815, 1.369002, zoom = 1) %>%
addCircles(~LON, ~LAT, label = ~label, color = ~pal(TotalMedals), weight = 10)
As can be clearly seen above, Norway and USA closely followed in the total medals ranking. Visualizing data on a map can provide a clear view of the overall data.