The List
I've got a little one. And a bigger one.
So far in these posts, I've mostly talked about the technical sides of corpus building, or the process of gathering plain text files for all the textual objects you want to study. Today I'm going to dig in to the more methodological side of corpus building which is, what group of textual objects do I want to study? I say "more methodological" because inevitably, the technical comes back into the decision process, in the form of "what texts have actually already been turned into plain text files and shared on a platform where I can access them, and how good is their OCR?" But to even get to these questions, you have to decide what you are ideally looking for.
My broad goal is to use computational textual analysis to better understand the discursive contexts of early twentieth century literature as one way of evaluating claims about the exceptional novelty of a group of writers typically thought of as modernist and, using measures of what we think of as great about these writers, see if additional texts that have similar features but haven't yet been studied can be identified. The study of modernism has long been dominated by increasingly detailed attention to a very small group of authors whose value for study has simply been a given, and the teaching of modernism more or less still is. No doubt, these writers are pretty great, but there was a lot of other writing going on, and in the age of HathiTrust, we can get to a lot more of it quite quickly. But we can get to so much of it that a whole new set of questions about how to decide what to spend our time looking at has been raised. My project is one way of looking at those questions: are canonically modernist works truly exceptional at the level of the text? If not, what else might lie near them? If they are, how do we characterize that exceptionality at scale? Are there clusters of feature that typify modernist themes, and could we use these to identify new objects of study?
I think these are questions where the interests of modernist studies scholars and the affordance of a computational view might intersect, although I will be the first to tell you that the computational view is neither an absolutely necessary nor epistemologically superior way to answer these questions. Scholars have long since demonstrated that a lot of social capital went into the making of modernists as a coterie of (mostly male and nearly all white) geniuses. They have also over the last thirty years or so made a concerted effort to shift the study of modernists to modernist studies, making persuasive arguments that lots of kinds of writers were addressing the technological change, economic globalization, shifting forms of imperialism, racial inequity, gender inequity, and media explosion of the early twentieth century. We have nearly every kind of modernism now! I have even made my own kind!
I just want to make it very clear that my interest in the computational view is not a "well actually" endeavor. I might find the same things or I might find different things, but they will no more be the "right" or "complete" things than any other attempt. My hope is that they will be useful and maybe even provocative things, and along the way I will learn a lot more about the practicalities and epistemologies of computational methods.
Actually doing this will depend on making an actual list of who I want to consider a canonical modernist. The easiest way to make a list is always to find someone else's list, and there's no way I know all the modernists in my head anyway. So here are some of the lists I can choose from:
Authors who called themselves modernists
Implicit argument: the only way to know for sure if someone was a modernist is if they self-identified as one.
Downsides: I can't remember right now if any of them actually did, and if they didn't, what exact self-identified movement designations would I want to include, and anyway this list would be extremely short.
Authors assigned on syllabi with the word "modernism" in Open Syllabus Project data, free version
Implicit argument: the most frequently assigned writers represent The Canon
Upsides: that's a pretty good argument and a pretty good data source to back it up.
Downsides: have to manually collect ie copy and past and the interface isn't ideal for that.
Discussion of content: the usual suspects fully represented. But how far down the list should I go? What is the cut off rank between canonical and non-canonical?
Wikipedia's list of modernist writers or a list of selected modernists put together by a scholar teaching a class on the Rhetoric of Art in Consumer Culture circa 2006
Implicit argument: there really is a list most of us keep in our heads, and if a few of us put our heads together we'll get a good enough version of that list
Upsides: these are easy to copy and past.
Downsides: the Wikipedia list is a bit broad, will still need cleaning.
Discussion of content: after taking out the non-Anglophone writers, these are short lists! 41 for Wikipedia and 28 for the course page, including poets and playwrights. The canon really is teeny!
Andrew Goldstone's list of authors who were subjects of articles published in the Journal of Modern Literature from its founding in 1970 to 1990, the start of the shift to modernist studies.
Implicit argument: if an author has been studied as modernist by a scholar, they are a modernist.
Upsides: this rationale makes sense, and it expands the data set in a way that gets past the list in our heads but still reflects scholarly discourse. This seems likely to produce the longest plausible list.
Downsides: see below.
Discussion of content: In his article “Modernist Studies Without Modernism,” Goldstone is using this list to demonstrate that the "top 10" aka The Most Canonical studied authors barely shifts from the discipline's inception to now, after three decades of expanding the canon. So he doesn't share the whole list in the article, but it is gettable using the data and code he has shared. Upon review, the entire list is 245 authors, but these include a lot of very definitely not modernists, like Dante, and not writers, like Picasso and Einstein. Which is a good reminder that you can filter text based on whether or not it contains () to indicate birth and death dates, which is a good clue that the subject text is a name, but you can't filter based on whether or not someone is a writer unless you have...a list. After hand cleaning the names I knew not to be Anglophone writers and not writers, the list is down to 167, but assuredly there are more to clean based on additional research.
And, most interestingly, there are a bunch of writers on this list whose inclusion or exclusion will be a judgment call. Stephen Crane is a good example: died in 1900 at 28, he technically is a nineteenth century author, but he was also clearly pushing formal boundaries in a way that might very well be seen as early modernism.
I'm guessing the ultimate list may be closer to 100 writers.
What am I learning from all these lists?
For the purposes of modeling canonicity, I need to expand my focus from US modernists to anglophone modernists if I want more than a very short list. There's nothing wrong with a short list, though, if it leads to a corpus big enough for corpus methods, so we'll see.
I am probably going to want at least two modernist lists: the Very Canonical List and the Somewhat Canonical List. If I can assemble one, the other won't be too hard, and neither of them will be too big. The argument for the first one is that it will be most representative of self-conscious experimentation, which has been one of the loose criteria of canonicity. The argument for the second is that formal experimentation took place outside of this, and these would still have been works thought of as modernist by some early practitioners.
In the assembly of the second, longer list, I'm going to need to do some additional research and this will entail additional judgment calls, and I won't be sure I made the "right" ones. So I'll need to document them and reflect on them, and there will always be room for another version.
The lists, of course, are only aspirational. What's available as plain text is always the next question.


