Introduction
In this NLP getting started challenge on kaggle, we are given tweets which are classified as 1 if they are about real disasters and 0 if not. The goal is to predict given the text of the tweets and some other metadata about the tweet, if its about a real disaster or not.
In this part 2 for Nearest Neighbor Modelling, I will use the processed data generated in Part 1 to train nearest neighbor models in order to predict if a tweet is about a real disaster or not using the tidymodels framework.
Analysis
Load Libraries
rm(list = ls())
library(tidyverse)
library(ggplot2)
library(tidymodels)
library(silgelib)
theme_set(theme_plex())
Loading processed data from previous part
tweets <- readRDS("../data/nlp_with_disaster_tweets/tweets_proc.rds")
tweets_final <- readRDS("../data/nlp_with_disaster_tweets/tweets_test_proc.rds")
tweets %>%
dim
## [1] 7613 830
tweets_final %>%
dim
## [1] 3263 829
Feature preprocessing and engineering
tweets %>%
mutate(target = as.factor(target),
id = as.character(id)) -> tweets
tweets %>%
count(target, sort = T)
## # A tibble: 2 x 2
## target n
## <fct> <int>
## 1 0 4342
## 2 1 3271
Split data
Splitting the data into 3 sets. A test set of 10% data, a cross validation set of 20% data and a training set of 70% data. Training and validation sets will be used for training, tuning and validating performance of models and comparing among them. Test set will only be used for final estimation of the model performance on unknown data.
set.seed(42)
tweets_split <- initial_split(tweets, prop = 0.1, strata = target)
tweets_test <- training(tweets_split)
tweets_train_cv <- testing(tweets_split)
set.seed(42)
tweets_split <- initial_split(tweets_train_cv, prop = 7/9, strata = target)
tweets_train <- training(tweets_split)
tweets_cv <- testing(tweets_split)
dim(tweets_train)
## [1] 5328 830
dim(tweets_cv)
## [1] 1522 830
dim(tweets_test)
## [1] 763 830
Preparation Recipe
I will use the recipe package from tidymodels to generate a recipe for data preprocessing and feature engineering.
recipe(target ~ ., data = tweets_train) %>%
update_role(id, new_role = "ID") %>%
step_rm(location, keyword) %>%
step_mutate(len = str_length(text),
num_hashtags = str_count(text, "#")) %>%
step_rm(text) %>%
step_zv(all_numeric(), -all_outcomes()) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_pca(all_predictors(), -len, -num_hashtags, threshold = 0.80)-> tweets_recipe
Note above
- We use the training dataset to create the recipe
- We won’t use ‘id’ field as a predictor, only as an identifier.
- For current analysis, we will drop the location and keyword features.
- Creating a length feature to model the tweet length and another feature to store the number of hashtags in the tweet.
- Getting rid of the text field since we have generated all the features from it that we wanted for now.
- Removing all predictors with zero variance.
- Normalizing all features i.e. centering and scaling.
- Adding dimensionality reduction using PCA to keep 80% variance and reduce the number of features while still keeping our custom features.
tweets_prep <- tweets_recipe %>%
prep(training = tweets_train,
strings_as_factors = FALSE)
Modelling
Baseline model
I will first create a baseline model to beat. In this case, we can predict randomly in the ratio of target counts and evaluate the model performance accordingly.
tweets_prep %>%
juice() %>%
count(target) %>%
mutate(prob = n/sum(n)) %>%
pull(prob) -> probs
set.seed(42)
tweets_prep %>%
bake(new_data = tweets_cv) %>%
mutate(predicted_target = as.factor(sample(0:1,
size = nrow(tweets_cv),
prob = probs, replace = T))) %>%
accuracy(target, predicted_target)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.512
set.seed(42)
tweets_prep %>%
bake(new_data = tweets_cv) %>%
mutate(predicted_target = as.factor(sample(0:1,
size = nrow(tweets_cv),
prob = probs, replace = T))) %>%
f_meas(target, predicted_target)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 f_meas binary 0.581
Like, we see above, we have a baseline f1 score of 0.5812. We need to build and train a model that beats this baseline.
set.seed(42)
tweets_prep %>%
bake(new_data = tweets_test) %>%
mutate(predicted_target = as.factor(sample(0:1,
size = nrow(tweets_test),
prob = probs, replace = T))) %>%
accuracy(target, predicted_target)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.503
set.seed(42)
tweets_prep %>%
bake(new_data = tweets_test) %>%
mutate(predicted_target = as.factor(sample(0:1,
size = nrow(tweets_test),
prob = probs, replace = T))) %>%
f_meas(target, predicted_target)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 f_meas binary 0.574
Generating submission file
set.seed(42)
tweets_prep %>%
bake(new_data = tweets_final) %>%
mutate(target = as.factor(sample(0:1,
size = nrow(tweets_final),
prob = probs, replace = T))) %>%
select(id, target) %>%
write_csv("../data/nlp_with_disaster_tweets/submissions/baseline_cvf_57_testf_57.csv")
K-Nearest Neighbor model
Let’s build a basic KNN model with some default number of neighbors to see how the modelling is done in this framework and checkout how the modelling output looks like.
Basic
knn_spec <- nearest_neighbor(neighbors = 3) %>%
set_engine("kknn") %>%
set_mode("classification")
wf <- workflow() %>%
add_recipe(tweets_recipe)
knn_fit <- wf %>%
add_model(knn_spec) %>%
fit(data = tweets_train)
saveRDS(knn_fit, "../data/nlp_with_disaster_tweets/knn/knn_basic_fit.rds")
knn_fit <- readRDS("../data/nlp_with_disaster_tweets/knn/knn_basic_fit.rds")
knn_fit %>%
pull_workflow_fit() -> wf_fit
wf_fit$fit$MISCLASS
## optimal
## 3 0.3521021
The above shows a simple K-nearest neighbors model using the “kknn” engine. Gives about 0.3521021 of minimal misclassification. Let’s try and tune the number of neighbors (k) and see if we can interpret the underlying problem space.
Tuning number of neighbors
Using 5-fold cross validation and values of K going from 1 to 100.
set.seed(1234)
folds <- vfold_cv(tweets_train, strata = target, v = 5, repeats = 1)
tune_spec <- nearest_neighbor(neighbors = tune()) %>%
set_mode("classification") %>%
set_engine("kknn")
neighbor_grid <- expand.grid(neighbors = seq(1,100, by = 1))
set.seed(1234)
doParallel::registerDoParallel(cores = parallel::detectCores(logical = FALSE))
knn_grid <- tune_grid(
wf %>% add_model(tune_spec),
resamples = folds,
grid = neighbor_grid,
metrics = metric_set(accuracy, roc_auc, f_meas),
control = control_grid(save_pred = TRUE,
verbose = TRUE)
)
saveRDS(knn_grid, "../data/nlp_with_disaster_tweets/knn/knn_grid.rds")
knn_grid <- readRDS("../data/nlp_with_disaster_tweets/knn/knn_grid.rds")
knn_grid %>%
collect_metrics()
## # A tibble: 300 x 6
## neighbors .metric .estimator mean n std_err
## <dbl> <chr> <chr> <dbl> <int> <dbl>
## 1 1 accuracy binary 0.628 5 0.00597
## 2 1 f_meas binary 0.705 5 0.00426
## 3 1 roc_auc binary 0.604 5 0.00653
## 4 2 accuracy binary 0.628 5 0.00598
## 5 2 f_meas binary 0.705 5 0.00426
## 6 2 roc_auc binary 0.630 5 0.00928
## 7 3 accuracy binary 0.628 5 0.00544
## 8 3 f_meas binary 0.706 5 0.00349
## 9 3 roc_auc binary 0.639 5 0.00873
## 10 4 accuracy binary 0.628 5 0.00537
## # … with 290 more rows
knn_grid %>%
collect_metrics() %>%
mutate(flexibility = 1/neighbors,
.metric = str_to_title(str_replace_all(.metric, "_", " "))) %>%
ggplot(aes(flexibility, mean, color = .metric)) +
geom_errorbar(aes(ymin = mean - std_err,
ymax = mean + std_err), alpha = 0.5) +
geom_line(size = 1.5) +
facet_wrap(~.metric, scales = "free", nrow = 3) +
scale_x_log10() +
theme(legend.position = "none") +
labs(title = "Model performance against model flexibility",
subtitle = "F1-score peaks around lower flexibility values",
x = "Model flexibility i.e. Log(1/NumberOfNeighbors)",
y = "Mean metric value")
As we see in the plot above, the f1-score increases on the evaluation set until around K=20 and then starts falling down. We plot the flexibility (i.e. 1/NumberOfNeighbors) to visualize how the model performance varies as the model flexibility increases. The KNN model with K=1 will be highly flexible and thus have high variance, whereas K=100 would lead to a much stricter model which is less flexible and might suffer from bias.
Looks like our underlying problem stays much closer to being flexible than strict (since optimal K looks to be around 20). We should remember this fact for picking further models for experimentation.
Let’s pickout the best parameter K based on the highest f1-score and train our final model on the full training dataset and evaluate against cross validation dataset.
knn_grid %>%
select_best("f_meas") -> highest_f_meas
final_knn <- finalize_workflow(
wf %>% add_model(tune_spec),
highest_f_meas
)
last_fit(final_knn,
tweets_split,
metrics = metric_set(accuracy, roc_auc, f_meas)) -> knn_last_fit
saveRDS(knn_last_fit, "../data/nlp_with_disaster_tweets/knn/knn_last_fit.rds")
knn_last_fit <- readRDS("../data/nlp_with_disaster_tweets/knn/knn_last_fit.rds")
knn_last_fit %>%
collect_metrics()
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.643
## 2 f_meas binary 0.756
## 3 roc_auc binary 0.687
Our final fit knn model with K=25 gives an f1-score of 0.7560538, which is much higher than our baseline model on the same CV dataset.
Summary
We can hence learn quite a few things about our underlying problem space by using this basic modelling algorithm K-nearest neighbors and use our learning in further model selection and tuning and also generate a fairly robust model that predicts quite effectively as compared to the baseline model.
Also, this tidymodels framework provides a good modelling structure which can be easily reproduced and used to train a variety of models. In the next part of this series, I will work on another classic modelling algorithm, Lasso Regression, where we will also see if we can identify if there any of these features are much more important than the others and if our 2 custom features are useful.
References
- Project Summary Page - NLP with disaster tweets: Summary
- Project Part 1 - NLP with Disaster Tweets: Part 1 Data Preparation
- Lasso Regression using Tidymodels by Julia Silge.
- Introduction to Statistical Learning - Book