This is a version of the talk I’m giving today at the International Autobiography Association European regional conference today.
The 2022 documentary "Three Minutes: A Lengthening," director Bianca Stigter and writer Glenn Kurtz narrate the painstaking search to assemble the most basic identifying information for the places and people represented in three minutes of film taken on ---, 1938 by Kurtz's grandfather. When Kurtz discovered the congealed, vinegarized film stock in his grandfather's closet in 2009 and sent it to the Holocaust Museum for assistance in preservation and digitization. The resulting three minutes of film are a rare look into a Jewish community in Poland before the Holocaust and World War II. The footage is example of what Marlene Kadar theorized as the "autobiographical trace." It was never intended as a formal autobiographical statement, yet it represents a moment in the life of the person who filmed it and a moment in the lives of those he filmed. Paying attention to such traces, fragmentary and residual, is often the only way to encounter the lives of those caught in historical catastrophe that has destroyed all else, both through literal destruction and the destruction of opportunity to create one's own text.
But while it may immediately be obvious what we are looking at when we watch this video, it is utterly opaque who we are looking at, and even where. The film on its own captures a lost moment, but without Kurtz and Stigter's efforts, that moment could only remain a symbol, a collection of nameless people, reduced to our knowledge of later victimization. Kurtz, who wrote a book about his process of researching the film, and Stigter, who continued the research as she worked with Kurtz to compose this documentary, spend years deciphering these three minutes. The film itself runs an hour and nine minutes without incorporating external footage or images. It is a case study in how much a skilled and persistent research can do even with very little obvious information. Through their research, Kurtz and Stigter determine name of the town, the history of its briefly pictured synagogue, the economic context for its bright fabrics and buttons--and the names of 11 of the one hundred and fifty people whose faces appear on camera, one hundred and fifty of the three thousand known to have lived in this village. 2 of these 11 were identified as among the roughly one hundred survivors of the Holocaust.
This film has become a touchpoint for me as I enter into a year of learning as much as I can about the math and statistics behind computational textual analysis, with the goal of developing a computational literary studies project in the area of life writing. I watched it during my winter break, just as I began planning my year of research leave in earnest. For while I am excited about having time to work with data hands on, after having worked with it conceptually for so long, from what I have already experienced of computational textual analysis, there is a part of me that wonders if as a set of methods computational textual analysis can ever live up to the ethical commitments of life writing scholarship. As I prepare to spend time learning how to model texts as bags of words and identify by algorithm stylistic or affective outliers--to knowingly reduce the texts I'm working with to linguistic features, to condense the experience of reading dozens of books to modeling dozens of text files--the makers of this film dwell, dig, imagine, as the title conveys, lengthen their engagement with the sparse data before them. I am haunted by what I felt to be the climax of the film, where the filmmaker creates a collage of the individual faces captured in the footage, slowly and deliberately compiling each individual data point, emphasizing how much has been lost yet somehow recovering their individuality. There is no way to do what Kurtz and Stigter do algorithmically, and without a massive influx of photographic evidence and metadata about the people pictured, there never will be. In short, for all my enthusiasm, I find that the thought of bringing computational methods to life narrative has provoked some level of epistemological crisis for me. What does it mean for me, as a scholar working in the field of life writing, to self-consciously turn away from close attention to individuals and turn toward the avowedly reductive, distancing methods of a computational lens?
I am far from alone in this hesitation or this observation. Katherine Bode summarizes [paywall]: "In arguments both for and against CLS, computation is a technological and material process, distinct from readers, texts, and reading. CLS proceeds, in these terms, by representing in simplified, discrete, and manipulable forms—what occurs elsewhere in complex and continuous ways. For the field’s critics, this distinction makes CLS an oxymoron; for its proponents, both ways of knowing can contribute to literary studies, and there is critical potential in working across the divide."
This last sentence, gesturing toward the idea that there is critical potential in working across the divide, also gestures to the answer to the question that some of you may be asking right now: if I'm having an epistemological crisis about these methods, why seek to use them? As Julie Rak writes in her chapter on big data [paywall] for Kate Douglas and Ashley Barnwell's Research Methdologies for Auto/Biography Studies, "The methods of life writing scholarship at the present time would seem to be opposed to the methods of big data. And yet, the object of big data collection and analysis, human lives and online identities, is becoming increasingly important to the work of life writing scholarship. But the methods of life writing study have not yet been matched to this object of study. They need to be, because life writing scholarship is poised—as Smith and Watson point out—to address the ethical dimension of big data. Who is big data for, who is it about, and most importantly, how is the quantification of identity by machines affecting us all?" We can refuse to read with algorithms, but we cannot currently refused to be read by algorithms.
I would also note that at last month’s Computational Literary Studies conference, Eitan Wagner, Renana Keydar*, Amit Pinchevski, Omri Abend from Hebrew University in Jerusalem presented their work on applying machine learning methods to identify topical segments in digitized testimonies of Holocaust survivors, and they frame the potential of these methods in ethical terms: "The imminent passing of the last remaining survivors coincides with the transformation from analog platforms (such as film, video, and television) to digital platforms (big data, online access, social media), which introduces great challenges―and great opportunities―to the future of Holocaust memory. As the phase of survivors’ testimony collection reaches its inevitable conclusion, pressing questions emerge: how can we approach and make sense of the enormous quantity of materials collected, which by now exceeds the capacity of human reception? How can we study and analyze the multitude of testimonies in a systematic yet ethical manner, one that respects the integrity of each personal testimony? How can new technology help us cope with the gap between mass atrocity and mass testimony ?"
As someone who works in life and critical digital studies, and someone who trained in literary studies and information science, and most simply and perhaps most importantly, someone interested in the conceptual frameworks and practical application of computation, I find myself well-positioned to work at the intersection of life writing and big data methods. I have done this work conceptually, and now I want to see how applied approaches extend these insights. I may end up with new insights into autobiographical texts, but I think I will certainly end up with a deeper insight into how lives are read as data. Computationally-enabled algorithmic modeling processes pervade our interactions with information and with each other. As a humanist working in literary and digital fields, I seek to position myself as a scholar and teacher who can grapple directly with these processes to the greatest extent possible. I want to be able to reflect on the application of these tools from the perspective of an experienced practitioner, to engage the potentials of computation while resisting computationalism, the conflation of computational logic with essential reality.
So, where to begin? The first requirement of any computational method is having data amenable to computation, which means in Bode's words, data in "simplified, discrete, and manipulable forms." In the case of texts, that can mean several things:
- plain text files of the text itself, either in human readable, sentence format or in tables of word use and frequency
- spreadsheets of bibliographic information, such as author, publisher, publication date, etc.
- images, if you are using computer vision techniques
- other types of data created to model specific features of text--lists of geographic places mentioned if you want to create a spatial visualization, or as Anna Poletti described in their presentation yesterday on reading Goodreads reviews, hand-coded for sentiment and sampled by quantitative rating assigned
Creating any of these types of data doesn't just begin with generating full text, it starts with developing a data model.
As Julia Flanders and Fotis Jannidis describe, "data modeling is the modeling of some segment of the world in such a way to make some aspects computable." Johanna Drucker elaborates, "determines what will be identified as a feature, how it will be made explicit, and what format it will have." In practical terms, a data model is how you capture information that you think is necessary for the study of your object of inquiry in a format that allows you to undertake that study. While computation makes the data modeling process more explicit, traditional literary methods employ data models as well--the study of the text vs. the study of an author's body of work vs. the study of its publication history--all of these questions imply different relevant features, privilege one concept of a text's significance over others. To think about data as modeled calls attention, as Johanna Drucker has done, to the fact that "data are made," not found, and "All data models embody values that carry implicit or explicit judgment and therefore often include biases. Almost all data are partial and represent some features of a phenomenon and not others." So, a key question is not just "what is the data of life narrative texts" but rather "how we can form data models for life narrative texts that embody methodological and ethical commitments central to the field."
As examples, I am going to share two very initial forays into this process.
### Experiment 1: What is early 20th century US immigrant autobiography?
Data source: Bibliographic info from Kaplan's bibliography
Rationale: Expanding the textual universe of US immigrant autobiography will give insight into what we are not reading
Method: Assemble bibliographic data
Workflow: For each entry, create a row in a spreadsheet, check for full text on Project Gutenberg, HathiTrust
Outcomes/insights:
- There is difference between life narrative of immigrants and life narrative about immigration. How did the latter come to claim our scholarly attention so firmly? This is a dynamic of literary methodology that is often left implicit: why pay attention to this text and not another one? This is often not directly explained, aside from the implicit understanding that since someone before us has already decided this text is worth of sustained attention, we feel safer in applying the same.
- Incommensurate experience of reading: there is now way in which doing word frequency or topic modeling on this group of texts is comparable to reading one of them. So, to consider the less canonical texts solely from a distance does not inherently work against the canon.
### Experiment 2: Can we model individual early 20C immigrant US narratives as an affective arc?
Data source: Mary Antin's The Promised Land
Rationale: Developmental and linear conceptions of form in immigrant narrative would seek to map a subject’s progress from departure to arrival to citizenship; from economic precarity to productive labor to material stability; from national otherness to national identity. Analysis at the level of the word/sentence has the potential to capture shifting affective relationships to the idea and reality of the US nation. Attempting to model immigrant narrative intersects with broader digital humanities efforts to model plot in literary texts.
Method: Sentiment analysis with Syuzhet package in R
Workflow: use clean text from Project Gutenberg, model with Syuzhet and employ validation of sentiment analysis plot modeling using a middle reading technique presented by Elkins and Chun (2019) in their study of a modernist novel.
Outcomes/insights:
- The Syuzhet lexicon, developed in response to criticism of sentiment analysis lexicons, visualized using a rolling mean, appears to capture accurate highs and lows in the text. The two highest points are both scenes of reading as personal and civic empowerment, which definitely aligns with the experience of reading the text itself and underscores that the narrative is not built around achieving formal citizenship but rather cultural citizenship.
- This presents the current paradox of computational literary studies: if the findings seem valid, we feel like we already knew that from what we had already read. If the findings do not align with what we expect, we question the computation.
- What would come next? More narratives, but these are not going to get much beyond the beaten path if I want to use only clean data.
### closing
I began this talk by setting up the research and interpretive work of the film "Three Minutes: A Lengthening" as a polar opposite of computational approaches to text, and now I want to close it by outlining how we might instead see it as a model of ethical relationship to life data that holds for computational work as well:
- look for absence, document it, attend to it
- the information you need may lie outside the frame
- reconnect text and context
- From the narration of “Three Minutes”: “They say 'one picture is worth a thousand words', but for that phrase to make sense, you do need to know what it is you are looking at.”
-- how can we begin to see what we didn't already know? Look again.
- caring for data can be caring for people