Introduction
In this NLP getting started challenge on kaggle, we are given tweets which are classified as 1 if they are about real disasters and 0 if not. The goal is to predict given the text of the tweets and some other metadata about the tweet, if its about a real disaster or not.
In this part 1 for data preparation, I will do some basic exploration and vectorize the given tweet text into glove embedding vectors.
Analysis
Load Libraries
rm(list = ls())
library(tidyverse)
library(ggplot2)
library(GGally)
library(skimr)
library(tidymodels)
library(keras)
library(janitor)
theme_set(theme_light())
Read Data
tweets <- read_csv("../data/nlp_with_disaster_tweets/train.csv")
tweets_test <- read_csv("../data/nlp_with_disaster_tweets/test.csv")
skim(tweets)
Name | tweets |
Number of rows | 7613 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
character | 3 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
keyword | 61 | 0.99 | 4 | 21 | 0 | 221 | 0 |
location | 2534 | 0.67 | 1 | 49 | 0 | 3279 | 0 |
text | 0 | 1.00 | 7 | 157 | 0 | 7503 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1 | 5441.93 | 3137.12 | 1 | 2734 | 5408 | 8146 | 10873 | ▇▇▇▇▇ |
target | 0 | 1 | 0.43 | 0.50 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▆ |
Getting glove embedding for tweet text and adding as features
The simple workflow for vectorizing tweet text into glove embeddings is as follows -
- Tokenize incoming tweet texts in the training data.
- Download and parse glove embeddings into an embedding matrix for the tokenized words.
- Generate embeddings vector for tweets text in training data.
- Generate embeddings vector for tweets text in test data.
- Append to given tweets features and export.
Tokenize incoming tweet texts in the training data
Using keras’ text_tokenizer to tokenize the text in tweets dataset.
text_tokenizer() %>%
fit_text_tokenizer(tweets$text) -> tokenizer
num_words <- length(tokenizer$word_index) + 1
print(length(tokenizer$word_index))
## [1] 22700
A total of 22700 unique words were assigned an index in the tokenization.
Using the above fit tokenizer, converting the text to actual sequences of indices.
sequences <- texts_to_sequences(tokenizer, tweets$text)
summary(map_int(sequences, length))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 13.00 17.00 16.84 21.00 33.00
maxlen <- max(map_int(sequences, length))
print(maxlen)
## [1] 33
Capping the maximum length of a tweets sequence to 33. This will translate all the tweets text sequences into a sequence of length 33. If the original sequence was longer, it will truncate from the beginning and if the original sequence is smaller, it will pad the sequence in the beginning to bring the final length to 33.
pad_sequences(sequences, maxlen = maxlen) -> padded_sequences
str(padded_sequences)
## int [1:7613, 1:33] 0 0 0 0 0 0 0 0 0 0 ...
Like we see above, for all the 7,613 tweets in the training data we have created a tokenized sequence of 33 elements each.
Download and parse glove embeddings into an embedding matrix for the tokenized words
Downloaded the pre-trained glove embeddings trained on 2 billion tweets from Stanford’s NLP projects page on Glove and borrowing the code for parsing and generating glove embedding matrix from my deepSentimentR package.
parse_glove_embeddings <- function(file_path) {
lines <- readLines(file_path)
embeddings_index <- new.env(hash = TRUE, parent = emptyenv())
for (i in 1:length(lines)) {
line <- lines[[i]]
values <- strsplit(line, " ")[[1]]
word <- values[[1]]
embeddings_index[[word]] <- as.double(values[-1])
}
cat("Found", length(embeddings_index), "word vectors.\n")
return(embeddings_index)
}
generate_embedding_matrix <- function(word_index, embedding_dim, max_words, glove_file_path) {
embeddings_index <- parse_glove_embeddings(glove_file_path)
embedding_matrix <- array(0, c(max_words, embedding_dim))
for (word in names(word_index)) {
index <- word_index[[word]]
if (index < max_words) {
embedding_vector <- embeddings_index[[word]]
if (!is.null(embedding_vector)) {
embedding_matrix[index+1,] <- embedding_vector
}
}
}
return(embedding_matrix)
}
embedding_dim <- 25
embedding_matrix <- generate_embedding_matrix(tokenizer$word_index,
embedding_dim = embedding_dim,
max_words = num_words,
"../../../nlp_with_disaster_tweets/data/glove.twitter.27B/glove.twitter.27B.25d.txt")
saveRDS(embedding_matrix, "../data/nlp_with_disaster_tweets/embedding_matrix_25d.rds")
embedding_matrix <- readRDS("../data/nlp_with_disaster_tweets/embedding_matrix_25d.rds")
str(embedding_matrix)
## num [1:22701, 1:25] 0 0.7864 0.4186 0.7086 -0.0102 ...
Using only 25 dimensional embedding in order to keep the computations fast, we have created an embedding matrix which holds the 25 dimension values for the most popular 22700 words in our tweets text data.
Generate embeddings vector for tweets text in training data
Using the keras modelling framework to generate embeddings for the given training data. We basically create a simple sequential model with one embedding layer whose weights we will freeze based on our embedding matrix created above, and a flattening layer that will flatten the output into a 2D matrix of dimensions (7613, 33x25).
keras_model_sequential() %>%
layer_embedding(input_dim = num_words, output_dim = embedding_dim,
input_length = maxlen, name = "embedding") %>%
layer_flatten(name = "flatten") -> model_embedding
model_embedding %>%
get_layer(name = "embedding") %>%
set_weights(list(embedding_matrix)) %>%
freeze_weights()
model_embedding %>%
predict(padded_sequences) -> tweets_embeddings
str(tweets_embeddings)
## num [1:7613, 1:825] 0 0 0 0 0 0 0 0 0 0 ...
For each of the 7,613 padded tweet sequences of upto max length 33, we use the keras model to “predict” and populate the embedding for each of those 33 words in the sequence and susequently flatten those to create a single feature vector of 33x25=825 dimensions.
Generate embeddings vector for tweets text in test data
Using the similar approach as above (i.e. tokenize, pad and vectorize using glove embeddings) on the test data, to generate similar embedding vector for text in the tweets test data. Note that, we use the previously fit text tokenizer on the train data to tokenize the test data.
sequences_test <- texts_to_sequences(tokenizer, tweets_test$text)
pad_sequences(sequences_test, maxlen = maxlen) -> padded_sequences_test
model_embedding %>%
predict(padded_sequences_test) -> tweets_embeddings_test
str(tweets_embeddings_test)
## num [1:3263, 1:825] 0 0 0 0 0 0 0 0 0 0 ...
We similarly get 825 embedding dimensions for 3,263 tweets in the test data.
Append to given tweets features and export
tweets %>%
bind_cols(as_tibble(tweets_embeddings)) %>%
clean_names() -> tweets_proc
tweets_test %>%
bind_cols(as_tibble(tweets_embeddings_test)) %>%
clean_names() -> tweets_test_proc
saveRDS(tweets_proc, "../data/nlp_with_disaster_tweets/tweets_proc.rds")
saveRDS(tweets_test_proc, "../data/nlp_with_disaster_tweets/tweets_test_proc.rds")
Exporting the appended feature set will help us work on this dataset for modelling. I will cover the modelling using the tidymodels framework in my upcoming posts.
References
- Project Summary Page - NLP with disaster tweets: Summary
- GloVe: Global Vectors for Word Representation - Stanford NLP Glove Project
- DeepSentimentR package - deepSentimentR