Learning to look again

A chronicle of small steppies

Apr 05, 2024

In the last month I've wrapped up the natural language processing (NLP) course, taken an intentional couple of weeks off from posting and one unintentional one due to the vagaries of pinkeye, and started my next course, which is titled Artificial Intelligence and Machine Learning.

If had to define a topic for this work period, it would be "corpus building again." Corpus, in this context, means a collection of textual objects in plain text format that I would like to study. That perennial topic entails several perennial subtopics, including: existential bibliographic questions, remembering how to open all the flavors of text file in Python, going down OCR correction rabbit holes, being grateful for Project Gutenberg, googling again to see if someone has already shared a perfect corpus for me and I've just missed it until now.

And as a bonus dynamic, these subtopics present both methodological and technical questions, a rash of "what can I do" questions spawned like rain-wrapped tornadoes in the midst of the "what should I do" supercell.1

Given this range of insistent subtopics clamoring for attention as soon you say the word "corpus" and the amount of time I've now spent in their midst, perhaps I should no longer be surprised that it feels like slow going on every front. For every useful discovery, it feels like there's a lot of wheel spinning. But if I don't spin the wheels, I don't get anywhere, I just think about all the places I could go.

It is a constant education in embracing the small steppy.

So, for the rest of this post, an account of the small steppies (sp?) of this month:

My final project for NLP was to train a classifier that could identify autobiographical prose from non-autobiographical prose. Step one, of course, is to have labeled examples of such prose.

Once upon a time in 2016, I scraped a bunch of autobiographical texts from PG using a now-deprecated Python library. I have those still, but I needed non-ab texts as well. I started with the assumption that I would need to assemble these by hand, and I could easily imagine doing so for a small training and test set--say 30 examples of each type of prose--just by copying and pasting from PG. I proposed that, and my instructor commented that a small dataset would be fine for the class, but did I know that all English language texts from PG had been assembled as a huggingface dataset? I spent a full day of work reading documentation for downloading datasets from huggingface (which is a platform for sharing machine learning data & models), and I got as far as seeing that all the metadata was in a single field formatted as a JSON object. I could not for the life of me, however, figure out how to access that field to filter based on any values, like having "autobiography" in the title.

However, in the course of doing the literature review for this project, I discovered the GutenTag tool, a web interface for creating corpora from PG with advanced filtering features, miraculously still functional 9 years after launch. A few minutes later, I had 100 fiction prose works and 87 nonfiction prose works with "autobiography" in the title, which I then eyeballed for any that I knew were works of fiction that had been misclassified by the classifiers that GutenTag is using to impute missing metadata.

I then proceeded to spend another day figuring out how to work with all of these in Python, including steps to create 500 word samples in case my computer ended up crashing when I tried to run full books.

Results in a nutshell: my computer did not crash when running full books, and full books trained basic classifiers much better than samples, getting up to 89% accuracy for a support vector machine model, keeping stopwords but removing words used fewer than seven times across the corpus. I also fine tuned a transformer model using Google collab--running full books crashed it immediately, but on samples it got to 92% accuracy in one run. If someone were willing to fork over $ for GPU time, it seems like we could pretty quickly have all the autobiographical texts from PG in hand.

After the project, I found myself still tantalized by All the English Language Books right there, on huggingface. I decided to look at my code notebook one more time to see if I could figure out how to parse that metadata field--because it included Library of Congress headers, a human-assigned classification. If I could filter for books with PS in the LoC, I would be a lot closer to having all the American literature on PG.

I found that when I took one more look, the experience I'd built working with huggingface in the later weeks of the course allowed me to see pretty quickly what I had been doing wrong that first fumbling day. It was less than one line of code; it was one optional parameter that made the difference between importing the dataset as a dataset, with all the parsing tools already built to use it, and importing it as a much less useful dataset dictionary, a format with no prebuilt utilities.

What a difference the split= argument makes

I added the parameter, left my future a cautionary comment, and proceeded to have a way to get all of the American literature from PG circa one year ago any time I wanted it. 9,300-ish of them.

And then I got the reward you always get for solving a problem, which is the next problem.

Next problem: how do winnow these down to twentieth century texts only? It couldn't be as simple as using the metadata field marked "original publication date", could it? It could not. That date refers to the date the file was published on PG. I confirmed via their own documentation that they do not record such metadata. Their volunteer transcribers are not in the business of verifying publication dates and places for the specific volume they have in hand, let alone for the work as a whole. And it's one less field to have a typo in that could lead to a public domain work being mistaken for an in copyright work.

Yet, I remembered that GutenTag had a publication date search feature. How did they do it? Machine learning! They've got a classifier on the back end that is running on the part of the transcription designated as front matter. So these are best guesses. Better than no guesses, so I tried downloading everything from 1900-public domain with PC in the LOC. The tool seems to max out at 500, though, which seems low given that I know it's starting w/ 9,000 some. I went through again, decade by decade, and I got a few more, but still fewer than 1000. That could mean there really are fewer than that, or it could mean the guesser is conservative, or it could mean there's a whole lot of poetry & periodicals in the original set (I also filtered for prose, fiction and nonfiction).

Next problem: how can I further filter either of these sets for US modernists? I've typically thought of this as a bibliography problem. So I've gone through a couple of indexes of modernist lit survey type books, and in terms of what's on PG, that gets me to 60ish books, HEAVILY weighted toward Sherwood Anderson, Willa Cather, and someone I've never read named Ellen Glasgow. This is an interesting practice corpus, but I wouldn't make claims based on it. I'm going to need HathiTrust to build this out for sure.

But Andrew Goldstone's "Modernism Without Modernists" offered a different method: find the authors that have been the subject of articles in journals devoted to modernist literature, as cataloged in the MLA International Bibliography. He only shared the top ten list in the article, but he did share his code--maybe there's a way to get the rest of them using what he shared?

First look: he has a disclaimer that the data he shared is not full bibliographic entries, and that his searches can't be replicated because of how MLAIB has changed their search.

First response: well that's that, then.

But let's sigh and look one more time.

Second look: the data he shared is not full bibliographic entries, but all I need is the subjects--did he share that?

Second response: he did! But it's all the subjects together. How did he get the author names out? He did it in R. My brain is a Python brain now. I don't want to look at R.

Drink some coffee. Let's look one more time.

Next look: let's just open up the R. I don't think it's got what I want.

Read an article. Let's look one more time.

Next look: maybe what I'm looking for isn't in a standalone R script. Maybe it's in the full R markdown to replicate the paper. Oh wait, it is there! But it's calling on a function to determine whether a given subject is an author name that I don't see anywhere.

Pet the cat. Let's look one more time.

Next look: the function that it's calling on is in support library that it imports at the top, and library that Goldstone also wrote, and look he linked to the repository for it, and he also explained how to install it since it's not on the usual central repo. I can install it and see what happens.

And what happened is, with a few more looks, I could get the list. So now I have a list of authors who were the subject of articles in the Journal of Modern Literature from 1970-1990, potentially the canonical era if there ever was one. After a comparable process of looking at Python again and again, I now have a way to run the last names from this list against that big long PS list from the PG dataset.

More that last looks like, and how it may or may not be the answer to my "who is a modernist" problem, in future posts.

Despite the fact that I'm still far from having settled either the technical or the methodological questions behind what it means to have a "modernist corpus," I'm really happy with my emerging ability to "just look at it one more time" and get one inch farther.

This sounds kind of dramatic. In my head, it is. I know that none of these questions are as consequential as the average brain surgery, or even the average plumbing repair. And yet.

The Algorithm and the Autobiography

Discussion about this post