Written by Suan Chin Yeo
on April 30, 2018

Authorship Prediction in Sci-Fi Literature Part II: Bag of Words in Neural Network

Part 1 ended with the dataset. Here, in Part 2, I’m going to cover a simple and intuitive machine learning strategy for classifying text known as the Bag of Words method. It fell short for this dataset, but keep in mind that many machine learning tasks are empirical in nature, and you never know what will work with a particular task/dataset until you try. So this experience does not invalidate the method itself.

I run into two significant challenges in this part: the lack of memory problem and the feature engineering problem. The memory problem is resolved to the degree required for this project. The feature selection is resolved to some degree, but will be further improved in Part 3.

Before we start messing with this, split dataset into training, validation, and test sets. Here’s a simple function that will take in a pandas dataframe and a training_size between 0 and 1 and returns a shuffled training set of specified size (1 being the entire dataset and 0 being none of the dataset) and split the remaining equally into validation and test sets. I used 0.7 for training_size, and so have a 70%, 15%, and 15% split. A seed is set so this can be reused if datasets get lost for whatever reason.

def train_valid_test(data, training_size):
    np.random.seed(0)
    shuffled = data.iloc[np.random.permutation(data.index),]

    trainSetSize = int(shuffled.shape[0]*training_size)
    validSetSize = int(shuffled.shape[0]*((1-training_size)/2))

    trainSet = shuffled.iloc[:trainSetSize+1, :]
    validSet = shuffled.iloc[trainSetSize+1:trainSetSize+validSetSize+1, :]
    testSet = shuffled.iloc[trainSetSize+validSetSize+1:, :]

    return trainSet, validSet, testSet

train, valid, test = train_valid_test(data, 0.7)

Everything that is happening from here on out is happening on the training set unless otherwise specified.

We have our sentences with authors assigned. Before we can even think of predicting the authors, we need to represent our data in a way that a machine learning algorithm “understands”. We need to convert sentences into numbers, preferably retaining some meaning so the algorithm is able to learn a pattern.

A simple and intuitive method is the bag of words representation. Where each sentence is divided into words (known as tokenizing) and each unique word becomes a feature. For each sentence, its features would the number of times each word is used in that sentence. In other words we would have as many features as we have unique words in our dataset.

Here’s a toy example to better illustrate the idea:

>>> toy_data
                                            sentence
0  Test! Sentence one has an exclamation mark, or...
1              Test2! Sentence two has a number too.

>>> bag_of_words
   !  ,  .  ?  a  an  exclamation  has  is  it  mark  number  one  or  point  \
0  1  1  0  1  0   1            1    1   1   1     1       0    1   1      1   
1  1  0  1  0  1   0            0    1   0   0     0       1    0   0      0   

   sentence  test  test2  too  two  
0         1     1      0    0    0  
1         1     0      1    1    1

This can be implemented quite painlessly with the following packages:

from nltk.tokenize import TreebankWordTokenizer # for the tokenizer
import pandas as pd # for the dataframe creation
from sklearn.feature_extraction.text import CountVectorizer # for bag of words implementation

While we’re at it, we can stem the words. Stemming is a way of converting english words into their root words, like writing into write. Except stemming doesn’t do that, it converts both writing and write into writ (which incidentally is another word entirely). To get grammatically correct conversions, you can do lemmatization instead, which is also provided by nltk package. The reason I chose stemming is because we are dealing with science fiction texts and very likely will encounter words that can be stemmed but will not appear in the standard english dictionary.

To implement stemming, we can define a tokenizer using nltk and feed this tokenizer into sklearn’s CountVectorizer.

import nltk
from nltk.tokenize import TreebankWordTokenizer
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer

data = pd.read_csv(file)

class Snowball():
    def __init__(self):
        self.snow = SnowballStemmer("english")
    def __call__(self, doc):
        return [self.snow.stem(t) for t in TreebankWordTokenizer().tokenize(doc)]
    
vectorizer = CountVectorizer(tokenizer = Snowball())
vectorizer.fit(data["sentence"])
matrix = vectorizer.fit_transform(data["sentence"])
dataframe = pd.DataFrame(matrix.A,
                         columns = vectorizer.get_feature_names())

The dataframe would be the bag of words implemented on the sentences, with no flitering. This is a huge dataset, with my choice of segmentation, the bag of words implementation on the training set was close to 4GB with more than 20,000 features. My six-year-old MacBookPro has 4GB RAM and runs out of memory around 20 epochs with 2000 features (this can be easily remedied, to an extent, as I’ll talk about later). We (I/my laptop) cannot efficiently train on this dataset. We need to do some feature engineering.

To select features, we need to know what’s going on. Since the dataset is close to the size of my RAM, I can’t open this dataset without crashing the Mac. There’s a very simple way of doing this in R, but in an effort to prevent from jumping too much between R and Python, I experimented with several Python friendly ways of doing this. After suffering several crashes, I went back to R. This is not to say that nothing worked in Python, but that the troubleshooting and learning process was too time consuming for me.

So let me take this opportunity to introduce a wonderful wizadry of memory management that is the data.table package in R. I do not pretend to know what happens under the hood, but working with big data feels absolutely no different from working with smaller sized data. There’s no delay in speed, your computer doesn’t crash, there’s no difference in syntax. It’s really great for data exploration. Although keep in mind that this is in the data.table package, if you try to plot the entire DataFrame object obtained from data.table and it’s quite large, this may crash your computer. But you can absolutely plot parts of it.

library(data.table) # after installing, attach the package

train <- fread("scifi_train_bagOfWords.csv", 
                stringsAsFactors = FALSE, # almost always, set to FALSE
                data.table = FALSE, # returns a data.table if T, Data.Frame if F
                verbose = TRUE, # for your sanity
                header = TRUE)

This took about five minutes on my laptop (with multiple Chrome windows opened, each with at least 15 tabs). I’ve heard that it can read in seconds. You can’t complain with performance like that.

Preprocessing steps I’ve done before this:

Using regex, sum all the features that contain numbers into the the feature NUMBERS.
Calculated the length of sentences (rowsum).

Now we’ll look at the sums of each features. And yes, not doing it iteratively crashed my Mac. So play safe and do it iteratively.

train_columns <- colnames(new_train)
column_sums <- numeric(length(train_columns))
names(column_sums) <- train_columns


for (i in 1:length(train_columns)){
  if (!(train_columns[i] %in% c("author", "rowsum"))){
    column_sums[i] <- sum(new_train[, i])
    cat(c("Summed", i,":", train_columns[i], "\n"), sep = " ") # make it verbose
  }
}

column_sums_dist <- quantile(column_sums, probs = seq(0, 1, 0.1))

The distribution is very skewed!

> column_sums_dist
   0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
    0     1     1     1     2     2     4     7    15    46 59833

30% of features have only been used once in the dataset. 70% of features have been used less than 10 times. In case you’re wondering, this dataset has 47768 rows (sentences).

I’ve tried several feature selection strategies, detailed below. For each of these strategies, I trained them on a neural network constructed in tensorflow. There are many hyperparameters you can tweak to improve performance, but it quickly became clear that the biggest factor in performance was feature selection. So I won’t be detailing the hyperparameters, as none of these turned out to be good enough.

After selecting features in R, the validation set can be similarly processed in Python. The reasons to switch back to Python here are the following:

Files are no longer larger than the RAM.
We can load the training data into Python, process the validation set, process both into standardized numpy arrays and into tensorflow dataset and subsequently feed into the neural network.

This can be done with the following, slightly overly verbose pipeline:

import numpy as np
import pandas as pd

# -------- Fill in the right info here ------------------
dataset = "scifi_sd_top2000"
pickle_file = dataset + "_train_and_valid.pckl"

author_scifi = {"ERB":0,
                "HGW":1,
                "JV":2,
                "PKD":3}

non_x_scifi = ["author"]

# Check if pickle_file exists
from os import listdir

if pickle_file in listdir():
    print(f"pickle_file already exists.")
    import sys
    print("Exiting.")
    sys.exit()

# Data loading and preparation



def loadingData(dataframe, author, non_x_columns):
    """Takes in a training, validation, or test dataframe
from the scifi_authors project. File must have
Bag of Words implemented with authors' column named
"author".
        Returns:
        1. features in numpy array.
        2. labels with one hot encoding in numpy array."""
    original_data = dataframe
    y = original_data[["author"]]
    X = original_data.drop(columns = non_x_columns)
    authors_one_hot = {"author":author}
    y = y.replace(authors_one_hot)
    y = np.eye(len(author))[y.values.flatten()]
    return X.values, y

def standardize(X,train = True, minimum = None, maximum = None):
    """ X is the features to be standardized (min max scaling).
If not standardizing training data (e.g. validation or test),
set train to False and enter a numpy array to val. All values
in features will be divided by the numpy array."""
    if train:
        minimum = np.min(X, axis = 0)
        maximum = np.max(X, axis = 0)
    max_min_range = maximum - minimum
    standardized_X = np.divide(X-minimum, max_min_range[None, :],
                               where = max_min_range[None, :] != 0)
    if train:
        return standardized_X, minimum, maximum
    return standardized_X

def loadTraining(dataframe, author, non_x_columns):
    X, y = loadingData(dataframe, author, non_x_columns)
    X, minimum, maximum = standardize(X)
    return X, y, minimum, maximum

def loadValidOrTest(dataframe, author, non_x_columns, minimum, maximum):
    X, y = loadingData(dataframe, author, non_x_columns)
    X = standardize(X, train = False, minimum = minimum, maximum = maximum)
    return X, y

def bagOfWords(file, train = True, columns = None):
    """ Takes a csv file from scifi_authors project as input
where the file has two columns: author, sentences. Author
column has the author who wrote the sentences in the sentences
columns.
        Outputs a dataframe with bag of words implemented.
        If columns == None, nothing else is done.
        If columns != None, will return dataframe with the
columns specified."""
    print("Bag of Words function called.")
    import nltk
    from nltk.tokenize import TreebankWordTokenizer
    from nltk.stem import SnowballStemmer
    from sklearn.feature_extraction.text import CountVectorizer

    print("Packages loaded.")
    data = pd.read_csv(file)

    class Snowball():
        def __init__(self):
            self.snow = SnowballStemmer("english")
        def __call__(self, doc):
            return [self.snow.stem(t) for t in TreebankWordTokenizer().tokenize(doc)]
    print("Snowball tokenizer created.")
    
    vectorizer = CountVectorizer(tokenizer = Snowball())
    vectorizer.fit(data["sentence"])
    matrix = vectorizer.fit_transform(data["sentence"])
    dataframe = pd.DataFrame(matrix.A,
                             columns = vectorizer.get_feature_names())
    print("Bag of words implemented.")
    
    dataframe = dataframe.astype("float32")
    dataframe["rowsum"] = dataframe.sum(1)
    print("Rowsum added.")
        
    dataframe.index = data.index
    dataframe["author"] = data["author"]
    print("Author reattached.")

    # deal with numbers columns
    numbers_columns = dataframe.filter(regex = ("[0-9]"))
    dataframe["numbers"] = numbers_columns.sum(1)
    print("Numbers column added.")
    
    if not train:
        dataframe = dataframe.reindex(columns = columns)
        dataframe = dataframe.fillna(0)
        print("Selected columns for dataframe.")
    
    return dataframe

print("All functions loaded.")
# ----- Loading Data ---------------

trainData = pd.read_csv(dataset + ".csv")
trainX, trainY, minimum, maximum = loadTraining(trainData,
                                                author = author_scifi,
                                                non_x_columns = non_x_scifi)
train_columns = trainData.columns
validData = bagOfWords("valid.csv", train = False, columns = train_columns)
validX, validY = loadValidOrTest(validData,
                                 author = author_scifi,
                                 non_x_columns = non_x_scifi,
                                 minimum = minimum,
                                 maximum = maximum)
print("Data loaded.")
# ----- Saving Data ----------------
import pickle

f = open(pickle_file, "wb")
pickle.dump([trainX, trainY, validX, validY], f)
f.close()


print(f"Data saved in {pickle_file}.")

Having a data processing pipeline means I can do the data exploration and feature selection in R, which involves active decision making, come here, change one line and have everything ready to go for the neural network.

The neural network script is very automated as well. But I will be going through it in chunks so it makes sense. I use tensorflow for neural network because once you learn it, it’s very easy and nice to use.

Load the processed data:

import tensorflow as tf
import numpy as np
import pickle

f = open("scifi_sd_top2000_train_and_valid.pckl", "rb")
trainX, trainY, validX, validY = pickle.load(f)
f.close()
trainSet = tf.data.Dataset.from_tensor_slices((trainX, trainY))

print("Data loaded.")

I use the tensorflow Dataset because it allows easy shuffling.

Next, some baseline performance values. The random_guess_fn makes random guesses for each entry and returns the accuracy. The zero_rule_fn guesses that every entry is the most frequently occuring entry, it may score better than random_guess_fn. For this project, the random_guess_fn is about 25% and zero_rule_fn is about 30%. So anything higher than this means the neural network is learning something.

Now we want to construct the neural network. Depending on your perspective, tensorflow either makes this overly difficult or overly easy.

First, the architecture:

# Neural network architecture
input_units = trainX.shape[1]
hidden_units = 50
output_units = trainY.shape[1]

Tweaking the architecture may or may not have significant effects on the outcome. Worth checking!

Next, the neural network itself:

def neuralNetwork(x, hidden_units, output_units, lmb):
    regularizer = tf.contrib.layers.l2_regularizer(scale = lmb)
    layer1 = tf.contrib.layers.fully_connected(x,
                                               hidden_units,
                                               activation_fn = tf.nn.relu,
                                               weights_regularizer = regularizer,
                                               weights_initializer = tf.contrib.layers.xavier_initializer(seed = 10101))
    output = tf.contrib.layers.fully_connected(layer1,
                                               output_units,
                                               activation_fn = tf.sigmoid,
                                               weights_regularizer = regularizer,
                                               weights_initializer = tf.contrib.layers.xavier_initializer(seed = 10102))
    return output

Yes, tensorflow makes this really easy. Let’s talk about the activation function first. There are many common activation functions for neural networks, including sigmoid, tanh, relu. They all have their pros and cons. Sigmoid and tanh can be susceptible to the disappearing gradient problem, where the values fed into the first layer through backpropagation is so small that it no longer learns. However, relu can have the dead relu problem, where it outputs 0 and then can no longer be revived, and so, stops learning. There is also the exploding gradient problem, which I haven’t encountered yet.

For this project and dataset, I encountered the dead relu problem. So I changed the output layer to a sigmoid layer, and this seems to solve it. The dead relu problem can be solved by sigmoid or tanh because disappearing gradient is still some gradient whereas dead relu is nothing. However, I used one of the two to prevent either problem from being too much and to ensure the network is learning.

Next, the other parts of the neural network:

epochs = 20
report = 5
batch_size = int(trainX.shape[0]/56.0)
alpha = 0.05
lmb = 0.00005

x_data = tf.placeholder(dtype = tf.float32, shape =[None, input_units])
y_data = tf.placeholder(dtype = tf.float32, shape =[None, output_units])

output = neuralNetwork(x_data, hidden_units, output_units, lmb)

reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
cost = tf.reduce_mean(tf.squared_difference(output, y_data)) + tf.reduce_sum(reg_losses)

optimizer = tf.train.AdamOptimizer(learning_rate = alpha)
gvs = optimizer.compute_gradients(cost)
train_op = optimizer.apply_gradients(gvs)

pred_correct = tf.equal(tf.argmax(output, 1), tf.argmax(y_data, 1))
pred_accuracy = tf.reduce_mean(tf.cast(pred_correct, "float"))

I’m doing mini batch gradient descent, the size of the mini batches is set by the batch_size parameter. You can do stochastic or batch. But I wouldn’t subject my laptop to batch gradient descent. Already this takes one minute per epoch. In Part 3, we’ll have a better method that not only outperforms bag of words, but trains at 12 to 16s per epoch.

In tensorflow, it is deceptively easy to set a regularizer in our neural network, but this regularizer is not applied if you do not manually add regularization loss to the cost function! This is what reg_losses is doing. As for the cost function, this is the MSE, a better cost function for multiclass may be the log loss or softmax. Which I tried. I prefer the logloss:

-tf.reduce_sum(y_data*tf.log(output + 1e-9) + (1-y_data)*tf.log(1 - output + 1e-9))

For some reason the tensorflow logloss doesn’t work, and it needs to be coded explicitly.

The optimizer can be simplified by calling optimizer.minimize(cost). But remember the disappearing gradients and dead relu problem? The way to check if you have those problems is to print the gradients. By having an intermediate step, we can print the computed gradients by printing gvs. And then apply it by running train_op.

And finally, we can calculate our accuracy.

This is the actual training:

def train_fn(filename, epochs = epochs, print_gradient = False, restore = None):
    saver = tf.train.Saver()
    with tf.Session() as sess:
        print(f"Training start")
        if restore != None:
            saver.restore(sess, restore)
            print("Restored.")
        else:
            sess.run(tf.global_variables_initializer())
        print("Neural network architechture:")
        print(f"{input_units} input units, {hidden_units} hidden units, {output_units} output units.")
        print("")
        print(f"learning rate: {alpha}")
        print(f"regularization: {lmb}")
        epochs = epochs
        total_batches = int(trainX.shape[0]/batch_size)
        for e in range(epochs):
            avg_cost = 0.0
            # shuffling and making training data into mini batches in hopes of avoiding
            # converging on bad local minima.
            shuffledTrain = trainSet.shuffle(buffer_size = trainX.shape[0])
            minibatch = shuffledTrain.batch(batch_size)
            mbIter = minibatch.make_initializable_iterator()
            current_batch = mbIter.get_next()
            sess.run(mbIter.initializer)
            for i in range(total_batches):
                try:
                    X, y = sess.run(current_batch)
                    c, grad, _ = sess.run([cost, gvs, train_op], feed_dict = {x_data : X, y_data : y})
                    avg_cost += c/total_batches
                except tf.errors.OutOfRangeError:
                    print(f"minibatch {i} of {total_batches}")
                    break
            if e%report == 0:
                if print_gradient:
                    print(grad)
                print(f"{e} of {epochs} epochs. Cost = {avg_cost}")
        print(f"{e} of {epochs} epochs. Cost = {avg_cost}")
        # prediction
        print("Training ends")
        train_accur = sess.run(pred_accuracy, feed_dict = {x_data : trainX, y_data : trainY})
        valid_accur = sess.run(pred_accuracy, feed_dict = {x_data : validX, y_data : validY})
        print(f"Training accuracy: {train_accur}")
        print(f"Validation accuracy: {valid_accur}")
        saver.save(sess, filename)

print("Ready to train.")

The simple restore feature is how I get around the running out of memory problem. If you run out of memory on tensorflow, it doesn’t tell you anything. It just restarts your shell. On terminal, you get an equally unhelpful but more mysterious response: the error message Killed 9 before the shell restarts. The restore feature allows training in small chunks of epoch (say 20), and saves a checkpoint file and quits. The function can be called again with this checkpoint file as an argument for restore. The checkpoint file is then loaded and training can resume from epoch 21.

This pipeline makes it very easy for me to get a new dataset and run it through the same neural network. I find it important to set this up because it running new tests becomes almost effortless.

Now that the mechanics are out of the way, let’s talk about feature selection.

Column Sums

First off, the easiest way is to calculate the number of times each word is used and employ a cutoff. This made sense to me because in the beginning, I wanted to minimize the number of words in the training set that won’t occur in the validation or test set. The higher the cut off, the less features we have. What cutoff to use? I didn’t know, so I tried several:

# cut off at 100 
train100 <- new_train[, train_columns[column_sums > 100]]
train100$author <- new_train$author

# cut off at 250
train250 <- new_train[, train_columns[column_sums > 250]]
train250$author <- new_train$author

# cutoff at 500
train500 <- new_train[, train_columns[column_sums > 500]]
train500$author <- new_train$author

# cutoff at 1000
train1000 <- new_train[, train_columns[column_sums > 1000]]
train1000$author <- new_train$author

fwrite(train100, file = "train100.csv", verbose = TRUE)
fwrite(train250, file = "train250.csv", verbose = TRUE)
fwrite(train500, file = "train500.csv", verbose = TRUE)
fwrite(train1000, file = "train1000.csv", verbose = TRUE)

None of these are particularly good :

train1000:
    training accuracy: 60.90%
    validation accuracy: 52.42%
train500:
    training accuracy: 70.14%
    validation accuracy: 54.45%%
train250:
    training accuracy: 82.83%
    validation accuracy: 59.29%
train100:
    training accuracy: 89.23%
    validation accuracy: 64.93%

Fairly bad. But you see that the performance gets better as we get more features. This suggests that key features were being missed by having a higher cutoff. And maybe there are other metrics of evaluating features that would better reflect if they are important or not.

By the way I hit a memory problem at train100. So we can’t keep increasing features. While I do have the workaround in my train_fn(), the workaround won’t work if we can’t even run one epoch. It’s also highly inefficient.

TF-IDF

The next method is the TF-IDF method. The TF-IDF calculates how important a word is to identifying a document from other documents.

Term frequency (TF), can be calculated this way:

(number of times term occurs in document)/(number of terms in document)

Inverse document frequency (IDF), can be calculated this way:

log( (total number of documents)/(number of documents with term) )

TF-IDF is then:

TF*IDF

Doing this in R:

jv_tf <- numeric(length(train_columns))
hgw_tf <- numeric(length(train_columns))
erb_tf <- numeric(length(train_columns))
pkd_tf <- numeric(length(train_columns))



for (i in 1:length(train_columns)){
  if (!(train_columns[i] %in% c("author", "rowsum"))){
    jv_tf[i] <- sum(train[train$author == "JV", i])/sum(train$rowsum[train$author == "JV"])
    hgw_tf[i] <- sum(train[train$author == "HGW", i])/sum(train$rowsum[train$author == "HGW"])
    erb_tf[i] <- sum(train[train$author == "ERB", i])/sum(train$rowsum[train$author == "ERB"])
    pkd_tf[i] <- sum(train[train$author == "PKD", i])/sum(train$rowsum[train$author == "PKD"])
  }
  if (i %% 500 == 0){
    print(i)
  }
}

save(jv_tf, hgw_tf, erb_tf, pkd_tf, file = "term_frequencies.RData")

idf <- numeric(length(train_columns))

for (i in 1:length(train_columns)){
  if (!(train_columns[i] %in% c("author", "rowsum"))){
    idf[i] <- log(4/sum(c(jv_tf[i] > 0, hgw_tf[i] > 0, erb_tf[i] > 0, pkd_tf[i] > 0) ))
  }
  if (i %% 500 == 0){
    print(i)
  }
}

jv_tfidf <- jv_tf*idf
erb_tfidf <- erb_tf*idf
hgw_tfidf <- hgw_tf*idf
pkd_tfidf <- pkd_tf*idf

save(jv_tfidf, erb_tfidf, hgw_tfidf, pkd_tfidf, file = "tfidf.RData")

For TF-IDF, I selected the top scoring terms of each author. With different cutoffs. Here’s an example using top50 terms of each author:

jv_tfidf_order <- rev(order(jv_tfidf))
hgw_tfidf_order <- rev(order(hgw_tfidf))
erb_tfidf_order <- rev(order(erb_tfidf))
pkd_tfidf_order <- rev(order(pkd_tfidf))
# highest tfidf score of each author of each author:

top50_columns <- unique(c(train_columns[jv_tfidf_order[1:50]],
                          train_columns[hgw_tfidf_order[1:50]],
                          train_columns[erb_tfidf_order[1:50]], 
                          train_columns[pkd_tfidf_order[1:50]], 
                          "author", 
                          "rowsum"))
top50 <- new_train[,top50_columns]

write.csv(top50, "scifi_top50.csv")

I tested using largest number of features I selected:

top 250 TF-IDF terms of each author:
    training accuracy: 65.23%
    validation accuracy: 62.24%
top 200 TF-IDF terms of each author:
    training accuracy: 63.82%
    validation accuracy: 61.24%

There is certainly a lot less overfitting with this set of features, the performance is quite low. Then it occurs to me, that IDF becomes 0 if a term occurs in all documents. Meaning terms like “a”, “the”, “.”, and so on would not be selected as all authors have used them. It doesn’t matter if it’s used once or many times, IDF will be 0.

This may be a good metric for search engines to identify documents by search terms, but it’s not suitable for our purposes.

Standard deviation of term frequencies

Since we’ve gone through the trouble of calculating term frequencies, I thought I can calculate the standard deviation of term frequencies between authors, and select the terms with highest standard deviation. In other words, terms that are used most differently by each author.

This is easily calculated:

standard_deviation <- numeric(length(train_columns))

for (i in 1:length(train_columns)){
  if (!(train_columns[i] %in% c("author", "rowsum"))){
    a <- c(jv_tf[i], pkd_tf[i], hgw_tf[i], erb_tf[i])
    standard_deviation[i] <- sd(a)
  }
  if (i %% 500 == 0){
    print(i)
  }
}

Now again, create some cut offs, this is one example:

sd_order <- rev(order(standard_deviation))

sd_top2000_columns <- unique(c(train_columns[sd_order[1:2000]], 
                               "author", 
                               "rowsum"))
sd_top2000 <- train[, sd_top2000_columns]
write.csv(sd_top2000, "scifi_sd_top2000.csv")

Again, testing the set with the highest features:

top 1800 features with highest standard deviation:
    training accuracy: 92.09%
    validation accuracy: 71.80%
top 2000 features with highest standard deviation:
    training accuracy: 94.64%
    validation accuracy: 74.78%%

This is the best performing of the bunch. But it is not good enough. It’s hard to tell what is good enough because this is my own dataset and no one has worked on it before. But here’s how I know it can do better. By increasing the number of features, I keep seeing an increase in performance. I haven’t encountered the plateau. This means it’s not performing at its best.

But here’s the problem. I can’t afford to keep increasing features since the increase I get isn’t much. And I’m already forced to train at 20 epochs each time.

A combination

The TF-IDF method is really quite logical. You can’t fault the math, the words picked with that method must be very good at identifying authors indeed. However, these are also words that may not show up in every sentence, and we would permanently miss these sentences. The dataset with top 200 TF-IDF scoring terms of each author has 764 features, the one with top 250 terms has 948 features. So I’m getting a 2% increase in performance with almost a 200 increase in number of features. To get a respectable accuracy … say 80%, assuming a linear relationship, we need almost 2000 more features. More or less.

But since the features selected in standard deviation method works so well, I thought I would combine the two. However, this set of features resulted in extreme overfitting: 99.4% training accuracy and 31.4% validation accuracy. If you’ve been keeping track, that validation accuracy is no better than the zero sum baseline performance. We would do almost just as well if we calculated the most frequently occuring author and apply it to the prediction.

But before we end, here are some other ideas I thought about but did not explore:

tf*log(sd)
bag of n-grams

There is a better way to go about this. Which I will cover in the next post. Again, depending on the task and dataset, one way may not necessarily be superior than the other. These things need to be empirically tested.

← → Top