on
Authorship Prediction in Sci-Fi Literature Part I: Data Collection, Cleaning, and Assembly in R
This is part 1 of a series of posts on my new project: identify authors of sentences from science fiction books. In this part I’ll be writing about how I went about acquiring the main requirement of a machine learning task: data. The finished datset can be accessed here.
Acquiring Data
In the best case scenario, I would like to finish this task with data looking something like this:
author sentence
JV some sentence.
IA some other sentence.
HGW another sentence.
AH every sentence.
Where JV = Jules Verne
, IA = Isaac Asimov
, HGW = H.G.Wells
, and AH = Aldous Huxley
.
Surprisingly, however, most of Asimov’s and Huxley’s writings are not yet in the public domain (it takes quite a bit of work to get this data, would be nice if everyone gets to use them). So I replaced them with authors I am planning to read in the future: Edgar Rice Burroughs, and Phillip K. Dick.
Project Gutenberg is the perfect source of data for this project for the following reasons:
- Everything here is free and in the public domain.
- It’s very nice to get everything from one source. Merging data from different sources can be done but is not trivial.
- They host books in multiple formats, including the convenient
.txt
format I am after.
Here are the files I downloaded and the number of sentences I was able to harvest from them:
file book sentences
1 ./books/erb1.txt The Project Gutenberg EBook of A Princess of Mars, by Edgar Rice Burroughs 1966
2 ./books/erb2.txt Project Gutenberg's The Land That Time Forgot, by Edgar Rice Burroughs 1510
3 ./books/erb3.txt The Project Gutenberg EBook of The Lost Continent, by Edgar Rice Burroughs 1593
4 ./books/erb4.txt The Project Gutenberg EBook of Pellucidar, by Edgar Rice Burroughs 2460
5 ./books/erb5.txt The Project Gutenberg EBook of The Gods of Mars, by Edgar Rice Burroughs 3413
6 ./books/erb6.txt The Project Gutenberg EBook of Warlord of Mars, by Edgar Rice Burroughs 1768
7 ./books/erb7.txt The Project Gutenberg EBook of Thuvia, Maid of Mars, by Edgar Rice Burroughs 2412
8 ./books/erb8.txt The Project Gutenberg EBook of The Chessmen of Mars, by Edgar Rice Burroughs 3653
9 ./books/erb9.txt Project Gutenberg's The People that Time Forgot, by Edgar Rice Burroughs 1267
10 ./books/hgw1.txt 1691
11 ./books/hgw2.txt The Project Gutenberg EBook of The War of the Worlds, by H. G. Wells 2702
12 ./books/hgw3.txt The Project Gutenberg EBook of The Island of Doctor Moreau, by H. G. Wells 1729
13 ./books/hgw4.txt The Project Gutenberg EBook of The Invisible Man, by H. G. Wells 3160
14 ./books/hgw5.txt The Project Gutenberg EBook of The Red Room, by H. G. Wells 184
15 ./books/hgw6.txt Project Gutenberg"s When the Sleeper Wakes, by Herbert George Wells 5160
16 ./books/hgw7.txt The Project Gutenberg EBook of The World Set Free, by Herbert George Wells 2585
17 ./books/hgw8.txt The Project Gutenberg EBook of The First Men In The Moon, by H. G. Wells 3536
18 ./books/jv1.txt The Project Gutenberg EBook of Around the World in 80 Days, by Jules Verne 3279
19 ./books/jv2.txt Project Gutenberg's A Journey to the Centre of the Earth, by Jules Verne 4412
20 ./books/jv3.txt The Project Gutenberg Etext of 20000 Leagues Under the Seas by Jules 6530
21 ./books/jv4.txt The Project Gutenberg EBook of All Around the Moon, by Jules Verne 4319
22 ./books/pkd1.txt The Project Gutenberg EBook of Second Variety, by Philip Kindred Dick 1863
23 ./books/pkd2.txt The Project Gutenberg EBook of The Hanging Stranger, by Philip K. Dick 655
24 ./books/pkd3.txt The Project Gutenberg EBook of The Variable Man, by Philip K. Dick 2503
25 ./books/pkd4.txt The Project Gutenberg EBook of Mr. Spaceship, by Philip K. Dick 1069
The one that’s missing is H. G. Well’s Time Machine. This metadata was generated by extracting the book title from a specific line in the text file, and that line wasn’t found in the Time Machine. The reality of data collection is that data is dirty, something we’ll encounter again and again in this project.
Removing smart quotes
Smart quotes are the bane of coding. If you don’t know what they are, they are these: ”“
or ’‘
. They break your program and it will take you forever to figure it out. One wonders how the world would be like if they were never created. If this is the first time you’re hearing of smart quotes, you can set your text editor and your Mac to turn it off in system preferences.
Our problem here is that these smart quotes may appear in some text files and not others. This is a problem because our model will see these as different and separate features from ""
, when they should not be. Authors don’t decide whether or not to use smart quotes in their Gutenberg files, this is a result of idiosyncracies in formatting! Why did I have the foresight to catch this bug? I didn’t, I went ahead without considering them and had them show up after I had completed everything. So I went back and dealt with it.
I found this excellent blog that deals with this problem very simply with Unix. I highly recommend reading and employing this simple solution before progressing further even if you are not sure whether your files have smart quotes.
Getting the main text of the book
From this section on we will be operating in R. Why R? Because I foresee quite a bit of trial and error. RStudio allows viewing of data without explicitly using the print command. This is going to a feature I will hate to lose if I progress too far in Python (I currently work in IDLE due to some technical difficulties/complications). RStudio is not the only reason to use R, after all, it allows coding in Python. The other reason is I used to do the majority of my bioinformatics work in R, which included a lot of data cleaning and string manipulation. In my mind, it works this way: R is easier when I’m not sure what I’m doing (exploratory work), Python is easier when I have a concrete plan I want to execute (high utility functions, reproducibility, machine learning work). So it’s a matter of convenience and habit.
Books have a ton of information before getting to the actual story. From Gutenberg files, these are several things that may exist in a text file other than the main text of the book:
- Gutenberg opening metadata (information about the file, the date of last update, etc.)
- Index
- Foreword (may be part of the story, or not)
- Introduction (may be part of the story, or not)
- Translator’s note
- Chapter headings
- Gutenberg closing information (guidelines/policies regarding distribution, etc.)
- and more …
Before we can even get sentences, we need to know that the sentence is part of the story and written by the author. These elements may all be present, or they may all be absent. We need two things: where the story begins, where the story ends.
Yes, there is an obvious solution of opening every single file and manually removing the undesired components. This is most probably the cleanest solution. However, while I currently have 14 files and it will be extremely dull but doable, I’m hoping to find a way to do it programmatically.
Mercifully, after looking at several of my files, it is fairly easy to find where the story ends:
feet is a huge and hideous creature with a heart of gold.
I believe that they are waiting there for me, and something tells me
that I shall soon know.
End of Project Gutenberg's A Princess of Mars, by Edgar Rice Burroughs
*** END OF THIS PROJECT GUTENBERG EBOOK A PRINCESS OF MARS ***
***** This file should be named 62.txt or 62.zip *****
This and all associated files of various formats will be found in:
http://www.gutenberg.org/6/62/
Updated editions will replace the previous one--the old editions
will be renamed.
Yup, that End of Project Gutenberg's ...
line.
So we can say:
End <- grep("^End of ", book) # note that grep uses regex
End <- End[length(End)]
Why not grep
a longer string? Because that failed. This line is not constant across the files and some files have variations including some that say End of <book title>
. So to remedy this problem, we grep
a very short string and use the last match found by grep
to ensure we’re selecting the line we want.
Now for a less elegant solution for finding the beginning of the main text. After pondering for a while, and trying a few different things (including trying to detect if a file has an introduction or not, only to realize some introductions are part of the story), a simple solution occured to me. If we separate the book into paragraphs (lines of texts separated by empty lines) and remove the first 20% of paragraphs, we would probably get rid of all of the unwanted content at the beginning. Yes, we would lose data, but we wouldn’t gain bad data.
After playing with the percentages a little bit, here we are:
getMainText <- function(book){
# book is a vector of words from a book
# getMainText cuts off the first 15% of paragraphs
# and the Project Gutenberg policies at the end
# of the files.
whitespaces <- grep("^$", book)
numOfParagraphs <- length(whitespaces)
Beginning <- as.integer(numOfParagraphs*0.15)
End <- grep("^End of ", book)
End <- End[length(End)]
return(book[(whitespaces[Beginning]+1):End-1])
}
text <- readLines("pkd4.txt")
mainText <- getMainText(allText)
Result:
> text[1:10]
[1] "The Project Gutenberg EBook of Mr. Spaceship, by Philip K. Dick"
[2] ""
[3] "This eBook is for the use of anyone anywhere at no cost and with"
[4] "almost no restrictions whatsoever. You may copy it, give it away or"
[5] "re-use it under the terms of the Project Gutenberg License included"
[6] "with this eBook or online at www.gutenberg.net"
[7] ""
[8] ""
[9] "Title: Mr. Spaceship"
[10] ""
> mainText[1:10]
[1] ""
[2] "She turned toward him. \"I have an idea. Do you remember that professor"
[3] "we had in college. Michael Thomas?\""
[4] ""
[5] "Kramer nodded."
[6] ""
[7] "\"I wonder if he's still alive.\" Dolores frowned. \"If he is he must be"
[8] "awfully old.\""
[9] ""
[10] "\"Why, Dolores?\" Gross asked."
So far so good. You can check what the first line is here to make sure you’ve picked a high enough percentage. Yes, it is slightly cheating since we’re making decisions on the fly rather than a programmed algorithm, but data work often requires active inspection and decision making.
Getting sentences aka REGEX REGEX REGEX
I was really hoping to use a package for this. However, the package I had planned to use doesn’t have a version for my R.
This task was, at the beginning, easier than I thought, and then later, harder than I thought. And the solution I arrived at hinged on one odd piece of knowledge: That older documents have two white spaces at the end of sentences instead of one.
We also have the good fortune of Gutenberg not breaking up words at the end of a line with a hyphen.
Essentially, what we need to do is this:
- concatenate all the elements in
mainText
with a single white space between lines - split the resulting string by two whitespaces
This actually worked beautifully with most of the files, but the newer ones only have a single white space at the end of sentences, and cannot be handled the same way. So I replaced all occurances of .
, ?
, and !
with .
, ?
, and !
respectively. We use regex to specify that we only want to replace the ones with exactly one white space, not zero white space or more than one white space. We then throw out anything that does not contain at least a period, question mark, or exclamation mark. This hopefully gets rid of chapter headings, or other non-sentence fragments.
getSentences <- function(mainText){
# mainText is a vector of characters
# returned from the function
# getMainText
print(length(mainText))
if (length(grep("\\.\\s{1}", mainText)) > 0){
mainText[grep("\\.\\s{1}", mainText)] = gsub("\\.\\s{1}", "\\. ", mainText[grep("\\.\\s{1}", mainText)])
}
if (length(grep("\\!\\s{1}", mainText)) > 0){
mainText[grep("\\!\\s{1}", mainText)] = gsub("\\!\\s{1}", "\\! ", mainText[grep("\\!\\s{1}", mainText)])
}
if (length(grep("\\?\\s{1}", mainText)) > 0){
mainText[grep("\\?\\s{1}", mainText)] = gsub("\\?\\s{1}", "\\? ", mainText[grep("\\?\\s{1}", mainText)])
}
sentences <- unlist(strsplit(paste(mainText, collapse = " "), " "))
# keep only strings that contain .!?
sentences <- sentences[grep("[\\.\\!\\?]", sentences)]
return(sentences)
}
Putting it all together
We now have everything we need to apply the functions above to all our books. The previous sections were hammered out using one book, looping through all the files will create some new problems for the first time (especially if you’re following along to make a similar dataset but with different books). But that’s ok, just go back and fix it (regex is your friend). This is because we are not cleaning web data, or something very structured. We are cleaning some text files that have been generated at different times by different people and contain different information at different locations. It is difficult to predict the types, locations, and degree of variation we will encounter. The problems that surfaced when I first looped through I have gone back and incorporated in the functions above.
allBooks <- list.files(path = "./books", pattern = ".txt$", full.names = TRUE)
firstSentences <- character(length(allBooks))
finalData <- data.frame(author = character(0), sentence = character(0))
metaData <- data.frame(file = character(0), book = character(0), sentences = numeric(0))
authors <- c("jv", "erb", "hgw", "pkd")
for (i in 1:length(allBooks)){
for (a in authors){
if (length(grep(a, allBooks[i])) > 0){
author = toupper(a)
break
}
}
text <- readLines(allBooks[i])
mainText <- getMainText(text)
file <- allBooks[i]
book <- text[1]
sentence <- getSentences(mainText)
firstSentences[i] <- sentence[1]
sentences <- length(sentence)
mdf <- data.frame(file, book, sentences)
metaData <- rbind(metaData, mdf)
df <- makeIntoTable(sentence, author)
finalData <- rbind(finalData, df)
}
Print firstSentences and take a look. If they all pertain to the stories, you’re good, otherwise, increase the percentage in the getMainText()
function. In hindsight, make the percentage a variable you can set while calling it.
Here are twenty random sentences from the final dataset:
> random20$author
[1] JV JV ERB ERB JV HGW JV JV HGW JV JV ERB ERB HGW ERB JV ERB JV HGW ERB
Levels: ERB HGW JV PKD
> random20$sentence
[1] What you can see, what anybody can see on a clear night when the Moon is full--only our friends had all the advantages of a closer view.
[2] He did not know what to think.
[3] "What happened, Thirty-six?" I asked him.
[4] There was the same short, stocky trunk upon which rested an enormous head habitually bent forward into the same curvature as the back, the arms shorter than the legs, and the lower leg considerably shorter than that of modern man, the knees bent forward and never straightened.
[5] Its fins set vertically, its propeller thrown in gear at the captain's signal, the Nautilus rose with lightning speed, shooting upward like an air balloon into the sky.
[6] "You."
[7] The pilot swore an angry oath; the reward of two hundred pounds was evidently on the point of escaping him.
[8] The solar rays easily crossed this aqueous mass and dispersed its dark colors. I could easily distinguish objects 100 meters away.
[9] The third and fourth stood beside him in the water, one perhaps two hundred yards from me, the other towards Laleham.
[10] Some pipes full of opium lay upon the table.
[11] Now, the understanding was, that he was to take us to the village of Stapi, situated on the southern slope of the peninsula of Sneffels, at the very foot of the volcano.
[12] It would be futile to attempt to describe them to Earth men, since substance is the only thing which they possess in common with any creature of the past or present with which you are familiar--even their venom is of an unearthly virulence that, by comparison, would make the cobra de capello seem quite as harmless as an angleworm.
[13] I could scarce restrain a smile at Perry's use of the pronoun "we," yet I was glad to share the rejoicing with him as I shall always be glad to share everything with the dear old fellow.
[14] Up he drove and up, to that pulsating rhythm, until the country beneath was blue and indistinct, and London spread like a little map traced in light, like the mere model of a city near the brim of the horizon.
[15] With such as these I WOULD conquer one!
[16] His silence, of course, did not last long.
[17] Again I climbed to the ship's rail.
[18] I stood stupefied.
[19] Somehow, his manner made me feel ashamed of myself.
[20] They ranged in height from three to four feet, and were moving restlessly about the enclosure as though searching for food.
64263 Levels: "Another enemy to harass me in my misery?" ...
You can see some of these are more than one sentence. But it is hard to tell what bug caused it to be that way. And since we harvested about 68k sentences, it’s perhaps harder to identify the ones that need to be further split into proper sentences.
However since everything is processed the same way, we can ensure that similar structures will be split similarly. Thus, we maintain consistency across the data.
That’s all for this post. Again, you can find the dataset generated here: click.