Introduction
Analyzing the Special Counsel’s “Report On The Investigation Into Russian Interference In The 2016 Presidential Election” from - justice.gov.
Analysis
Load libraries
rm(list = ls())
library(tidyverse)
library(pdftools)
library(tidylog)
library(hunspell)
library(tidytext)
library(ggplot2)
library(gridExtra)
library(scales)
library(reticulate)
library(widyr)
library(igraph)
library(ggraph)
theme_set(theme_light())
use_condaenv("stanford-nlp")
Downloading Report
If we want to download and parse the report ourselves we can do so as follows -
download.file("https://www.justice.gov/storage/report.pdf", "~/Downloads/mueller-report.pdf")
report <- pdf_text("~/Downloads/mueller-report.pdf")
I will use the preconverted CSV format of the report present here.
report <- read_csv("https://raw.githubusercontent.com/gadenbuie/mueller-report/master/mueller_report.csv")
report %>%
glimpse()
## Rows: 19,195
## Columns: 3
## $ page <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3…
## $ line <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 1, 2, 3, 4, 5, 6…
## $ text <chr> "U.S. Department of Justice", "AttarAe:,c\\\\'erlc Predtiet // M…
Cleaning
Page range
As we see from the pdf, first few pages contain Title and Index of the report, and the actual report text starts from page 9. Filtering out all the other pages for exploration
report %>%
filter(page >= 9) -> content
Text NA
There are about 386 rows with NA as text, these maybe because of parsing failures over redactions, among other things. Dropping them for now.
content %>%
filter(!is.na(text)) -> content
Misspelled Words
content %>%
rowwise() %>%
mutate(num_misspelled_words = length(hunspell(text)[[1]]),
num_words = length(str_split(text, " ")[[1]]),
perc_misspelled = num_misspelled_words/num_words) %>%
select(-num_misspelled_words, -num_words) -> content
Dropping all rows (around 400 rows) where percentage of misspelled words is more than 50%. Assuming these are introduced because of pdf parsing errors over redactions.
content %>%
filter(perc_misspelled <= 0.5) -> content
Normalize
content %>%
unnest_tokens(text, text, token = "lines") -> content
Most popular words
tidy_content <- content %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
tidy_content %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
filter(!is.na(word)) %>%
count(word, sort = TRUE) %>%
filter(str_length(word) > 1,
n > 400) %>%
mutate(word = reorder(word, n)) %>%
ggplot( aes(x=word, y=n)) +
geom_segment( aes(x=word, xend=word, y=0, yend=n), color="skyblue", size=1) +
geom_point( color="blue", size=4, alpha=0.6) +
coord_flip() +
theme(panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank(),
legend.position="none") +
labs(x = "",
y = "Number of Occurences",
title = "Most popular words from the Mueller Report",
subtitle = "Words occurring more than 400 times",
caption = "Based on data from the Mueller Report")
Most dramatic pages
Using python’s NLTK’s Vader Lexicon to generate sentiments for each line.
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
text = pd.Series(r["content$text"])
sentiments = text.apply(sid.polarity_scores).apply(pd.Series)
content <- bind_cols(content, py$sentiments)
content %>%
group_by(index = as.integer(page/2)) %>%
summarise(total_sentiment = sum(compound)) %>%
mutate(sentiment_color=ifelse(total_sentiment>0, "yes", "no")) %>%
ggplot(aes(x=index, y=total_sentiment)) +
geom_ribbon(aes(ymin=0, ymax=total_sentiment, fill=sentiment_color), color="black", alpha=0.5, show.legend = FALSE) +
scale_fill_manual(values=c("#271569","#69b3a2")) +
labs(x = "Page Index",
y = "Total Polarity",
title = "Text sentiment variance over pages",
subtitle = "PageIndex = PageNumber/2, Compound polarity calculated using NLTK",
caption = "Based on data from the Mueller Report")
While the above plot shows the general interesting areas of the report, lets see which are some of the most polarized lines in the document, in either direction.
content %>%
arrange(desc(compound)) %>%
head(10) %>%
select(text, page)
## # A tibble: 10 x 2
## text page
## <chr> <dbl>
## 1 "crime\"). obstruction of justice can be motivated by a desire to prot… 369
## 2 "justice statutes to protect, among other things, the integrity of its… 388
## 3 "wealth fund and was interested in improving relations between the uni… 163
## 4 "protected the president. and (have great respect for that, i'll be ho… 322
## 5 "the same thing , but co-opt the liberal opposition and the gop opposi… 52
## 6 "go. he is a good guy. i hope you can let this go.\" 238 comey agreed … 252
## 7 "\"not based on a legal issue, but based on a trust issue, [where] a l… 250
## 8 "success would require u.s. support to succeed: \"all that is required… 138
## 9 "friends. please share with them - we believe this is a good foundatio… 166
## 10 "that the president still cared about him and encouraging him to stay … 333
content %>%
arrange(compound) %>%
head(10) %>%
select(text, page)
## # A tibble: 10 x 2
## text page
## <chr> <dbl>
## 1 "specialist who exposed a fraud and later died in a russian prison. 67… 120
## 2 "is terrible. this is the end of my presidency. i' m fucked .\" 504 th… 290
## 3 "be no serious argument against the president's potential criminal lia… 392
## 4 "worse , alfonse capone , legendary mob boss, killer and 'public enemy… 337
## 5 "foreign contributions ban, in violation of 18 u.s.c. § 371 ; the soli… 193
## 6 "horrible witch hunt and the dishonest media! 1025" 359
## 7 "had \"never been directed to do anything [he] believe[d] to be illega… 269
## 8 "one), three defendants with conspiracy to commit wire fraud and •bank… 182
## 9 "guilty, pursuant to a single-count information , to identity fraud , … 183
## 10 "difficult to argue that hacked emails in electronic form, which are t… 184
Search Engine
Lets build a cosine-similarity based simple search engine (instead of the basic keyword-based search that comes with the pdf document), in order to make this document more easily searchable and gain context using most related lines in the report for a given query. Using python’s scikit-learn for this.
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.metrics.pairwise import linear_kernel
stopwords = ENGLISH_STOP_WORDS
vectorizer = TfidfVectorizer(analyzer='word', stop_words=stopwords, max_df=0.3, min_df=2)
vector_train = vectorizer.fit_transform(r["content$text"])
def get_related_lines(query):
vector_query = vectorizer.transform([query])
cosine_sim = linear_kernel(vector_query, vector_train).flatten()
return cosine_sim.argsort()[:-10:-1]
get_related_lines <- py_to_r(py$get_related_lines)
Search Query - Michael Cohen
content %>%
slice(get_related_lines("michael cohen")) %>%
select(text, page, line)
## # A tibble: 9 x 3
## text page line
## <chr> <dbl> <dbl>
## 1 "alphabetically by subject, are summarized below." 446 3
## 2 "flynn is awaiting sentencing ." 203 25
## 3 "called melania trump's cell phone several times between january … 357 39
## 4 "manafort \"a brave man\" for refusin to \"break\" and said that … 218 23
## 5 "(bob) spring 2016 to discuss russian foreign… 404 17
## 6 "p-sco-000000328 (5/9/17 letter, hpsci to cohen); p-sco-000000331… 351 31
## 7 "u.s. department of justice" 346 1
## 8 "988" 354 48
## 9 "organization before the house oversight and ref orm committee, 1… 348 50
One of the interesting results from above (page 357) -
“Toll records show that Cohen was connected to a White House phone number for approximately five minutes on January 19, 2018, and for approximately seven minutes on January 30, 2018, and that Cohen called Melania Trump’s cell phone several times between January 26, 2018, and January 30, 2018. Call Records of Michael Cohen.”
Search Query - Vladimir Putin
content %>%
slice(get_related_lines("vladimir putin")) %>%
select(text, page, line)
## # A tibble: 9 x 3
## text page line
## <chr> <dbl> <dbl>
## 1 "response to question i, part (h)" 431 18
## 2 "polon skaya, olga russian national introduced to george p… 408 30
## 3 "fine.\"); andrew rafferty, trump says he would \"get along very … 228 26
## 4 "and former aide to russia's mini ster of energy . he communicate… 406 6
## 5 "2016?" 420 32
## 6 "accompanied by a russian female named olga polonskaya. mifsud in… 92 9
## 7 "vt. why did you ultimately not give the speech you reference… 428 32
## 8 "candidates by speaking of closer ties with russia, 10 saying he … 228 6
## 9 "kuznetsov, sergey russian government official at the russ… 406 18
One of the interesting results from above (page 92) -
“On March 24, 2016, Papadopoulos met with Mifsud in London. Mifsud was accompanied by a russian female named olga polonskaya. Mifsud introduced Polonskaya as a former student of his who had connections to Vladimir Putin”
Late Night Hosts’ Jokes
Lets see if we can catch (or fact check, if you will), any of the late night hosts’ jokes about the Mueller Report.
Trevor Noah
content %>%
arrange(compound) %>%
head(10) %>%
select(text, page, line)
## # A tibble: 10 x 3
## text page line
## <chr> <dbl> <dbl>
## 1 "specialist who exposed a fraud and later died in a russian pris… 120 11
## 2 "is terrible. this is the end of my presidency. i' m fucked .\" … 290 20
## 3 "be no serious argument against the president's potential crimin… 392 26
## 4 "worse , alfonse capone , legendary mob boss, killer and 'public… 337 14
## 5 "foreign contributions ban, in violation of 18 u.s.c. § 371 ; th… 193 41
## 6 "horrible witch hunt and the dishonest media! 1025" 359 4
## 7 "had \"never been directed to do anything [he] believe[d] to be … 269 4
## 8 "one), three defendants with conspiracy to commit wire fraud and… 182 28
## 9 "guilty, pursuant to a single-count information , to identity fr… 183 15
## 10 "difficult to argue that hacked emails in electronic form, which… 184 27
The joke at around 34secs into the video can be clearly seen above as the 2nd highest polarized line in the report (page 290, line 20).
Stephen Colbert
content %>%
slice(get_related_lines("olc")) %>%
select(text, page, line)
## # A tibble: 9 x 3
## text page line
## <chr> <dbl> <dbl>
## 1 "concerns about sealed indictments. even if an indictment were se… 214 20
## 2 "alia, bribery,\" id. (citing u.s. const.artii,§ 4)." 382 28
## 3 "constitution confers no power in the president to receive bribes… 382 24
## 4 "second, while the olc opinion concludes that a sitting president… 213 29
## 5 "a sitting president 's amenability to indictment and criminal pr… 213 35
## 6 "l, § 3, cl. 7. impeachment is also a drastic and rarely invoked … 390 40
## 7 "department of justice and the framework of the special counsel r… 213 24
## 8 "govern and potentially preempt constitutional processes for addr… 213 28
## 9 "executive officials)." 382 18
The discussion about OLC from the video above at around 9min48secs can be seen in multiple instances as per our search engine, notably here (page 390) - “Impeachment is also a drastic and rarely invoked remedy, and Congress is not restricted to relying only on impeachment, rather than making criminal law applicable to a former President, as OLC has recognized.”
Jimmy Fallon
content %>%
slice(get_related_lines("crazy")) %>%
select(text, page, line)
## # A tibble: 9 x 3
## text page line
## <chr> <dbl> <dbl>
## 1 "president had said similar things about comey in an off-the-reco… 283 43
## 2 "ukraine to support the plan. 925 manafort also initially told th… 148 8
## 3 "president ' s request, he decided not to share details of the pr… 299 7
## 4 "office, prepared to submit a resignation letter with his chief o… 300 23
## 5 "information.\" 470 hicks said that when she told the president a… 283 16
## 6 "prior day to terminate comey, telling lavrov and kislyak: \"tjus… 283 8
## 7 "the president also told the russian foreign minister , \"i just … 274 21
## 8 "serious concerns about obstruction\" may have referred to concer… 262 38
## 9 "• ' investigative technique" 157 20
The joke in the video around 3mins can be seen by the search query which leads to one of these interesting results on page 299 of the report - “That evening, McGahn called both Priebus and Bannon and told them that he intended to resign. McGahn recalled that, after speaking with his attorney and given the nature of the President’s request, he decided not to share details of the President’s request with other White House staff.”
Jimmy Kimmel
content %>%
slice(get_related_lines("election interference")) %>%
select(text, page, line)
## # A tibble: 9 x 3
## text page line
## <chr> <dbl> <dbl>
## 1 notwithstanding his recusal , he was going to confine the special… 310 28
## 2 members of the trump campaign conspired or coordinated with the r… 13 23
## 3 establish that members of the trump campaign conspired or coordin… 10 4
## 4 2. president-elect trump is briefed on the intelligence community… 239 3
## 5 (e.g., florida and pennsylvania) that were perceived as competiti… 51 11
## 6 is a good guy to negotiate .... 80 5
## 7 papadopoulos 's false statements in january 2017 impeded the fbi'… 201 27
## 8 for having interfered in the election. by early 2017, several con… 9 27
## 9 rosenstein, rod deput y attorney general (apr. 20 17 - pr… 409 13
The discussion at around 50secs into the video can be obtained from the above search query, from page 13 - “Although the investigation established that the Russian government perceived it would benefit from a Trump presidency and worked to secure that outcome, and that the Campaign expected it would benefit electorally from information stolen and released through Russian efforts, the investigation did not establish that members of the Trump Campaign conspired or coordinated with the Russian government in its election interference activities.”
Summary
Looks like all the above tools can be pretty handy in doing a quick and thorough investigation of a large document in a very small amount of time (and its pretty fun too!)