Introduction

Analyzing the Special Counsel’s “Report On The Investigation Into Russian Interference In The 2016 Presidential Election” from - justice.gov.

Analysis

Load libraries

rm(list = ls())
library(tidyverse)
library(pdftools)
library(tidylog)
library(hunspell)
library(tidytext)
library(ggplot2)
library(gridExtra)
library(scales)
library(reticulate)
library(widyr)
library(igraph)
library(ggraph)

theme_set(theme_light())
use_condaenv("stanford-nlp")

Downloading Report

If we want to download and parse the report ourselves we can do so as follows -

download.file("https://www.justice.gov/storage/report.pdf", "~/Downloads/mueller-report.pdf")

report <- pdf_text("~/Downloads/mueller-report.pdf")

I will use the preconverted CSV format of the report present here.

report <- read_csv("https://raw.githubusercontent.com/gadenbuie/mueller-report/master/mueller_report.csv")
report %>% 
  glimpse()
## Rows: 19,195
## Columns: 3
## $ page <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3…
## $ line <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 1, 2, 3, 4, 5, 6…
## $ text <chr> "U.S. Department of Justice", "AttarAe:,c\\\\'erlc Predtiet // M…

Cleaning

Page range

As we see from the pdf, first few pages contain Title and Index of the report, and the actual report text starts from page 9. Filtering out all the other pages for exploration

report %>% 
  filter(page >= 9) -> content

Text NA

There are about 386 rows with NA as text, these maybe because of parsing failures over redactions, among other things. Dropping them for now.

content %>% 
  filter(!is.na(text)) -> content

Misspelled Words

content %>% 
  rowwise() %>% 
  mutate(num_misspelled_words = length(hunspell(text)[[1]]),
         num_words = length(str_split(text, " ")[[1]]),
         perc_misspelled = num_misspelled_words/num_words) %>% 
  select(-num_misspelled_words, -num_words) -> content

Dropping all rows (around 400 rows) where percentage of misspelled words is more than 50%. Assuming these are introduced because of pdf parsing errors over redactions.

content %>% 
  filter(perc_misspelled <= 0.5) -> content

Normalize

content %>% 
  unnest_tokens(text, text, token = "lines") -> content

Most dramatic pages

Using python’s NLTK’s Vader Lexicon to generate sentiments for each line.

import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
text = pd.Series(r["content$text"])
sentiments = text.apply(sid.polarity_scores).apply(pd.Series)
content <- bind_cols(content, py$sentiments)

content %>%
  group_by(index = as.integer(page/2)) %>% 
  summarise(total_sentiment = sum(compound)) %>% 
  mutate(sentiment_color=ifelse(total_sentiment>0, "yes", "no")) %>%
  ggplot(aes(x=index, y=total_sentiment)) +
  geom_ribbon(aes(ymin=0, ymax=total_sentiment, fill=sentiment_color), color="black", alpha=0.5, show.legend = FALSE) +
  scale_fill_manual(values=c("#271569","#69b3a2")) +
  labs(x = "Page Index",
       y = "Total Polarity",
       title = "Text sentiment variance over pages",
       subtitle = "PageIndex = PageNumber/2, Compound polarity calculated using NLTK",
       caption = "Based on data from the Mueller Report")

While the above plot shows the general interesting areas of the report, lets see which are some of the most polarized lines in the document, in either direction.

content %>% 
  arrange(desc(compound)) %>% 
  head(10) %>% 
  select(text, page)
## # A tibble: 10 x 2
##    text                                                                     page
##    <chr>                                                                   <dbl>
##  1 "crime\"). obstruction of justice can be motivated by a desire to prot…   369
##  2 "justice statutes to protect, among other things, the integrity of its…   388
##  3 "wealth fund and was interested in improving relations between the uni…   163
##  4 "protected the president. and (have great respect for that, i'll be ho…   322
##  5 "the same thing , but co-opt the liberal opposition and the gop opposi…    52
##  6 "go. he is a good guy. i hope you can let this go.\" 238 comey agreed …   252
##  7 "\"not based on a legal issue, but based on a trust issue, [where] a l…   250
##  8 "success would require u.s. support to succeed: \"all that is required…   138
##  9 "friends. please share with them - we believe this is a good foundatio…   166
## 10 "that the president still cared about him and encouraging him to stay …   333
content %>% 
  arrange(compound) %>% 
  head(10) %>% 
  select(text, page)
## # A tibble: 10 x 2
##    text                                                                     page
##    <chr>                                                                   <dbl>
##  1 "specialist who exposed a fraud and later died in a russian prison. 67…   120
##  2 "is terrible. this is the end of my presidency. i' m fucked .\" 504 th…   290
##  3 "be no serious argument against the president's potential criminal lia…   392
##  4 "worse , alfonse capone , legendary mob boss, killer and 'public enemy…   337
##  5 "foreign contributions ban, in violation of 18 u.s.c. § 371 ; the soli…   193
##  6 "horrible witch hunt and the dishonest media! 1025"                       359
##  7 "had \"never been directed to do anything [he] believe[d] to be illega…   269
##  8 "one), three defendants with conspiracy to commit wire fraud and •bank…   182
##  9 "guilty, pursuant to a single-count information , to identity fraud , …   183
## 10 "difficult to argue that hacked emails in electronic form, which are t…   184

Most Common Correlated Words

word_cors <- tidy_content %>% 
  add_count(word) %>% 
  filter(n > stats::quantile(n, 0.7)) %>% 
  pairwise_cor(word, page, sort = TRUE)
set.seed(123)

word_cors %>%
  filter(correlation > 0.25,
         !str_detect(item1, "\\d"),
         !str_detect(item2, "\\d")) %>% 
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void() + 
  labs(x = "",
       y = "",
       title = "Commonly Occurring Correlated Words",
       subtitle = "Per page correlation higher than 0.25",
       caption = "Based on data from the Mueller Report")

Words like “mcgahn” and “comey” show up as centers of correlation network.

Search Engine

Lets build a cosine-similarity based simple search engine (instead of the basic keyword-based search that comes with the pdf document), in order to make this document more easily searchable and gain context using most related lines in the report for a given query. Using python’s scikit-learn for this.

from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.metrics.pairwise import linear_kernel

stopwords = ENGLISH_STOP_WORDS
vectorizer = TfidfVectorizer(analyzer='word', stop_words=stopwords, max_df=0.3, min_df=2)
vector_train = vectorizer.fit_transform(r["content$text"])

def get_related_lines(query):
  vector_query = vectorizer.transform([query])
  cosine_sim = linear_kernel(vector_query, vector_train).flatten()
  return cosine_sim.argsort()[:-10:-1]
get_related_lines <- py_to_r(py$get_related_lines)

Search Query - Michael Cohen

content %>% 
  slice(get_related_lines("michael cohen")) %>% 
  select(text, page, line)
## # A tibble: 9 x 3
##   text                                                                page  line
##   <chr>                                                              <dbl> <dbl>
## 1 "alphabetically by subject, are summarized below."                   446     3
## 2 "flynn is awaiting sentencing ."                                     203    25
## 3 "called melania trump's cell phone several times between january …   357    39
## 4 "manafort \"a brave man\" for refusin to \"break\" and said that …   218    23
## 5 "(bob)                     spring 2016 to discuss russian foreign…   404    17
## 6 "p-sco-000000328 (5/9/17 letter, hpsci to cohen); p-sco-000000331…   351    31
## 7 "u.s. department of justice"                                         346     1
## 8 "988"                                                                354    48
## 9 "organization before the house oversight and ref orm committee, 1…   348    50

One of the interesting results from above (page 357) -
“Toll records show that Cohen was connected to a White House phone number for approximately five minutes on January 19, 2018, and for approximately seven minutes on January 30, 2018, and that Cohen called Melania Trump’s cell phone several times between January 26, 2018, and January 30, 2018. Call Records of Michael Cohen.”

Search Query - Vladimir Putin

content %>% 
  slice(get_related_lines("vladimir putin")) %>% 
  select(text, page, line)
## # A tibble: 9 x 3
##   text                                                                page  line
##   <chr>                                                              <dbl> <dbl>
## 1 "response to question i, part (h)"                                   431    18
## 2 "polon skaya, olga        russian national introduced to george p…   408    30
## 3 "fine.\"); andrew rafferty, trump says he would \"get along very …   228    26
## 4 "and former aide to russia's mini ster of energy . he communicate…   406     6
## 5 "2016?"                                                              420    32
## 6 "accompanied by a russian female named olga polonskaya. mifsud in…    92     9
## 7 "vt.     why did you ultimately not give the speech you reference…   428    32
## 8 "candidates by speaking of closer ties with russia, 10 saying he …   228     6
## 9 "kuznetsov, sergey        russian government official at the russ…   406    18

One of the interesting results from above (page 92) -
“On March 24, 2016, Papadopoulos met with Mifsud in London. Mifsud was accompanied by a russian female named olga polonskaya. Mifsud introduced Polonskaya as a former student of his who had connections to Vladimir Putin”

Late Night Hosts’ Jokes

Lets see if we can catch (or fact check, if you will), any of the late night hosts’ jokes about the Mueller Report.

Trevor Noah

content %>% 
  arrange(compound) %>% 
  head(10) %>% 
  select(text, page, line)
## # A tibble: 10 x 3
##    text                                                               page  line
##    <chr>                                                             <dbl> <dbl>
##  1 "specialist who exposed a fraud and later died in a russian pris…   120    11
##  2 "is terrible. this is the end of my presidency. i' m fucked .\" …   290    20
##  3 "be no serious argument against the president's potential crimin…   392    26
##  4 "worse , alfonse capone , legendary mob boss, killer and 'public…   337    14
##  5 "foreign contributions ban, in violation of 18 u.s.c. § 371 ; th…   193    41
##  6 "horrible witch hunt and the dishonest media! 1025"                 359     4
##  7 "had \"never been directed to do anything [he] believe[d] to be …   269     4
##  8 "one), three defendants with conspiracy to commit wire fraud and…   182    28
##  9 "guilty, pursuant to a single-count information , to identity fr…   183    15
## 10 "difficult to argue that hacked emails in electronic form, which…   184    27

The joke at around 34secs into the video can be clearly seen above as the 2nd highest polarized line in the report (page 290, line 20).

Stephen Colbert

content %>% 
  slice(get_related_lines("olc")) %>% 
  select(text, page, line)
## # A tibble: 9 x 3
##   text                                                                page  line
##   <chr>                                                              <dbl> <dbl>
## 1 "concerns about sealed indictments. even if an indictment were se…   214    20
## 2 "alia, bribery,\" id. (citing u.s. const.artii,§ 4)."                382    28
## 3 "constitution confers no power in the president to receive bribes…   382    24
## 4 "second, while the olc opinion concludes that a sitting president…   213    29
## 5 "a sitting president 's amenability to indictment and criminal pr…   213    35
## 6 "l, § 3, cl. 7. impeachment is also a drastic and rarely invoked …   390    40
## 7 "department of justice and the framework of the special counsel r…   213    24
## 8 "govern and potentially preempt constitutional processes for addr…   213    28
## 9 "executive officials)."                                              382    18

The discussion about OLC from the video above at around 9min48secs can be seen in multiple instances as per our search engine, notably here (page 390) - “Impeachment is also a drastic and rarely invoked remedy, and Congress is not restricted to relying only on impeachment, rather than making criminal law applicable to a former President, as OLC has recognized.”

Jimmy Fallon

content %>% 
  slice(get_related_lines("crazy")) %>% 
  select(text, page, line)
## # A tibble: 9 x 3
##   text                                                                page  line
##   <chr>                                                              <dbl> <dbl>
## 1 "president had said similar things about comey in an off-the-reco…   283    43
## 2 "ukraine to support the plan. 925 manafort also initially told th…   148     8
## 3 "president ' s request, he decided not to share details of the pr…   299     7
## 4 "office, prepared to submit a resignation letter with his chief o…   300    23
## 5 "information.\" 470 hicks said that when she told the president a…   283    16
## 6 "prior day to terminate comey, telling lavrov and kislyak: \"tjus…   283     8
## 7 "the president also told the russian foreign minister , \"i just …   274    21
## 8 "serious concerns about obstruction\" may have referred to concer…   262    38
## 9 "• ' investigative technique"                                        157    20

The joke in the video around 3mins can be seen by the search query which leads to one of these interesting results on page 299 of the report - “That evening, McGahn called both Priebus and Bannon and told them that he intended to resign. McGahn recalled that, after speaking with his attorney and given the nature of the President’s request, he decided not to share details of the President’s request with other White House staff.”

Jimmy Kimmel

content %>% 
  slice(get_related_lines("election interference")) %>% 
  select(text, page, line)
## # A tibble: 9 x 3
##   text                                                                page  line
##   <chr>                                                              <dbl> <dbl>
## 1 notwithstanding his recusal , he was going to confine the special…   310    28
## 2 members of the trump campaign conspired or coordinated with the r…    13    23
## 3 establish that members of the trump campaign conspired or coordin…    10     4
## 4 2. president-elect trump is briefed on the intelligence community…   239     3
## 5 (e.g., florida and pennsylvania) that were perceived as competiti…    51    11
## 6 is a good guy to negotiate ....                                       80     5
## 7 papadopoulos 's false statements in january 2017 impeded the fbi'…   201    27
## 8 for having interfered in the election. by early 2017, several con…     9    27
## 9 rosenstein, rod         deput y attorney general (apr. 20 17 - pr…   409    13

The discussion at around 50secs into the video can be obtained from the above search query, from page 13 - “Although the investigation established that the Russian government perceived it would benefit from a Trump presidency and worked to secure that outcome, and that the Campaign expected it would benefit electorally from information stolen and released through Russian efforts, the investigation did not establish that members of the Trump Campaign conspired or coordinated with the Russian government in its election interference activities.”

Summary

Looks like all the above tools can be pretty handy in doing a quick and thorough investigation of a large document in a very small amount of time (and its pretty fun too!)