Introduction
With India’s 2019 General Elections around the corner, I thought it’d be a good idea to analyse the election manifestos of its 2 biggest political parties, BJP and Congress. Let’s use text mining to understand what each party promises and prioritizes.
In this part 1, I’ll collect and clean data and setup the ground work for the project.
Analysis
Load libraries
rm(list = ls())
library(tidyverse)
library(pdftools)
library(tidylog)
library(hunspell)
library(tidytext)
library(ggplot2)
library(gridExtra)
library(scales)
library(reticulate)
library(widyr)
library(igraph)
library(ggraph)
theme_set(theme_light())
use_python("~/anaconda3/bin/python")
Downloading Manifestos
BJP’s Manifesto available at their website - bjp.org
bjp_txt <- pdf_text("~/Downloads/BJP-Election-english-2019.pdf")
tibble(
page = 1:length(bjp_txt),
text = bjp_txt
) %>%
separate_rows(text, sep = "\n") %>%
group_by(page) %>%
mutate(line = row_number()) %>%
ungroup() %>%
select(page, line, text) -> bjp
bjp %>%
glimpse()
## Rows: 1,590
## Columns: 3
## $ page <int> 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ line <int> 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ text <chr> "", " Table of Content…
Congress’ manifesto available at their website - inc.in
download.file("https://manifesto.inc.in/pdf/english.pdf", "~/Downloads/congress.pdf")
congress_txt <- pdf_text("~/Downloads/congress.pdf")
tibble(
page = 1:length(congress_txt),
text = congress_txt
) %>%
separate_rows(text, sep = "\n") %>%
group_by(page) %>%
mutate(line = row_number()) %>%
ungroup() %>%
select(page, line, text) -> congress
congress %>%
glimpse()
## Rows: 1,490
## Columns: 3
## $ page <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ line <int> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
## $ text <chr> "CONGRESS", "WILL", "DELIVER", " MANIFESTO", " LOK SAB…
Cleaning
Page range
As we see from the 2 documents, first few pages contain Title and Index of the manifestos, and then moves on to the notes from the Party Leaders. The actual plans for development and work starts from page 11 in BJP’s manifesto and page 7 in Congress’. Filtering out all the other pages for exploration
bjp %>%
filter(page >= 11) -> bjp_content
congress %>%
filter(page >= 7) -> congress_content
Text NA
Dropping all the rows where we dont have any text.
bjp_content %>%
filter(!is.na(text)) -> bjp_content
congress_content %>%
filter(!is.na(text)) -> congress_content
Normalize
Normalizing text lines.
bjp_content %>%
unnest_tokens(text, text, token = "lines") -> bjp_content
congress_content %>%
unnest_tokens(text, text, token = "lines") -> congress_content
I’ll take a deep dive into individual topics of the manifestos in separate blog posts. For now, I will export our cleaned and normalized data for future analysis.
Export Data
bjp_content %>%
write_csv("../data/indian_election_2019/bjp_manifesto_clean.csv")
congress_content %>%
write_csv("../data/indian_election_2019/congress_manifesto_clean.csv")
Stay Tuned!
References
- Part 2 Economic Growth
- For all the parts go to Project Summary Page - India General Elections 2019 Analysis