Decoding Sex in the Humanities: Five Key Findings from MIT’s Gender/Novels Project

It’s been said plenty of times that language is a vehicle for culture, but language is also a time capsule.

Just as our modern vernacular is (finally) evolving to explicitly include singular “they/them” pronouns and other gender-flexible language, similar transition points are fossilized in the language of the past, recording implicit biases up and down the centuries. How our ancestors conceived of and expressed gender differences on the page can be as subtle as passive-versus-active voice or in the tone connoted by a dialogue tag.

Tracing the evolution of those nuances over the course of centuries is a task almost too monumental and detailed to imagine—but that’s exactly what a team of researchers set out to explore from the MIT’s School of Humanities, Arts and Social Sciences Program in Digital Humanities.

In a lot of ways, the Digital Humanities team’s lab (DH Lab) doesn’t look like other spaces on campus—there are no robots or beakers or high-powered microscopes. But MIT is an ideal setting for an interdisciplinary workspace of this kind: well known for its STEM fields, the campus also draws on superb arts and humanities fields, collectively recently ranked second in the world, including linguistic and economics departments that are longstanding global leaders. 

The DH Lab is populated by a diverse, female-dominated group of undergraduates. (At-large, women make up 46 percent of MIT undergraduates.) Associate Professor of Music Michael Cuthbert, a musicologist and computational innovator who previously taught at Smith and Mount Holyoke Colleges, is spearheading the lab and conceptualized and set in motion the Gender/Novels project in the fall of 2018. For the project, she works alongside postdocs Stephan Risi and Lisa Tagliaferri.

In an intense two-month sprint, the DH Lab hit the ground running with the Gender/Novels project. Their mission was twofold: to teach a program to deliver meaningful data from a truly colossal number of books on a sentence-by-sentence level of detail, and to establish their new lab as a force on campus.

The roughly 4,200 novels sampled for the project stretched as far as copyright and availability would allow—beginning in the 1770s with sparser samples, as the novel form was still in its infancy, up to the more prolific 1920s, with most of the samples falling within the late nineteenth and early twentieth centuries. To remove the complex subjectivity of translation, all analysis was done on works originally published in English.  

Precision and clarity were the name of the game: What defines a novel? How do you approach the metadata? When you’re coding, you have to have an A or B definition? A snagging point could be as mundane as getting a computer to recognize tables of contents—something that was instinctual for a human but was essentially impossible for the computer program.

One of the challenges of the research was drawing on data without losing nuance. “Rather than just counting the pronoun occurrences,” explains Ife Ademolu-Odeneye, who worked as a web acquisition specialist on the project and built the corpus of books from which data was developed, “we had to standardize them as well.”

When all was said, done and counted, this was what the DH Lab learned.

#1: An author’s gender changes the powers of gendered pronouns.

When a man writes the pronoun “she,” it’s mostly likely to be followed by the verbs “cried,” “replied,” “answered,” “asked,” “seemed,” “laughed” or “murmured.” Women in these novels are tacitly reactive, rather than active: men are more likely to follow the pronoun “he” with “shouted,” “became,” “fired,” “pointed,” “yelled” or “reached.” The language makes men the instigators and aggressors; it places women in the role of respondent.

But when a woman writes “she,” the pronoun’s power changes. “She” is suddenly likely to be followed by verbs more attuned to a character’s internal life—“wondered,” “announced,” “wanted,” “reached,” “opened,” “longed”—and male pronouns are followed by emotive, less aggressive verbs like “said,” “liked,” “loved,” “smiled,” “kissed” and “wants.” These verbs might also reflect the kinds of commercial subjects that women most frequently wrote about in the eighteenth to the twentieth centuries: domestic life, romance and interpersonal drama. 

In adjective associations over the one hundred and fifty year period covered by the project’s purview, the word most frequently associated with female pronouns is—wait for it—“beautiful.” That highly original adjective is trailed closely in the rankings by “pretty,” “sweet,” “lovely” and “dear,” as well as “rosy,” “alone,” “pale,” “childish” and “slim.” Adjectives most associated with male pronouns include “old,” “good,” “last,” “great” and “first,” with frequent cameos by “long,” “big,” “best” and “certain” as well.  

While the active roles at the language level are largely reserved for male characters, female characters certainly have the monopoly on connotation-laden, physically descriptive language.  

#2: Female authors mention women much more often than male authors do.

Female authors tend to use slightly more female pronouns than male authors, but the numbers are pretty close to balanced: 53 percent female pronouns, 47 percent male pronouns. For male authors, the stats are more lopsided: 25 percent female pronouns, 75 percent male ones.  

There are plenty of books across these centuries that lack women characters. Not so for men.

#3: Men tend to get all of the (grammatical) agency.

No matter who was writing them, male characters were, by and large, the subjects of sentences in the study. Between 73 percent and 75 percent of the time, when a male character appears in a sentence, he is the active agent. 

#4: Male characters are mentioned with much more frequency than female ones.

Researchers also took the measure of how far apart mentions of women by men are and how far apart mentions of men by women are. The results are sadly predictable: The median for the most words between use of he/him by a female author was 43 words; the median number of words between use of she/her by a male author was 19,713. 

#5: Gender norms shape women’s writing.

Female authors also tend to include honorifics, such as “sir” or “Mr.” in association with male characters much more often than male authors do. The lab concluded that this may be a direct reflection of how women of those periods were allowed to address their male counterparts off the page.

Dina Atia, an undergraduate student andHathiTrust corpus specialist at the DH lab, wasn’t shocked by the results—but she was fascinated by observing sexism through computer science rather than social science.

“It was cool to see this thing that we had hypothesized in a very humanities-based setting was proven to be the case by data,” she says. “We can recognize that sexism exists and that sexism definitely existed in the nineteenth century. When you get the numbers behind it, you’re able to say: ‘We aren’t trying to push a narrative; this is something observed and quantified.’ I think we got what we expected. We didn’t expect to discover that the nineteenth century was really, really progressive.”

Those kinds of pleasant surprises come more often with the lab’s current project: examining gender roles in the roots of programming’s history, where women are far more numerous than is culturally acknowledged.   

The concrete findings about adjectives and pronoun frequency aren’t the be-all, end-all purpose of the Gender/Novels project. The lab’s code is really a foundation for future analysis: Post-docs Tagliaferri and Risi explained that the code is open source, which means that anyone can pick up the program and set it running in a new context, such as analyzing social media or news articles, and the lab itself will hopefully return to use this code itself in future projects as well, such as applying the program to tweets and speeches by prominent people.

“It’s very easy to say that actions speak louder than words,” Atia says. However, analysis of language can reveal enduring, deeply ingrained patterns of oppression in a world that looks very different than it did one hundred years ago. “Some of the big, obvious ways that racism and sexism and other types of oppression were built into our law, for example, aren’t as evident anymore, but they’re definitely still there in the way we talk. Analyzing language is important because the way that we speak is very indicative of the way that we think.”  

“Bringing the humanistic inquiry to bear on this,” says Tagliaferri, “bringing in feminist theory, queer theory — I think that’s where we’ll see the richness of this data.”

This line of questioning is especially pertinent to MIT in 2019, as the Institute prepares to launch the new Schwarzman College of Computing, whose mission includes research on the ethical implications of computing and AI tools. The new college’s curriculum aims to educate “bilinguals” — the term MIT uses for students who have both technical expertise and a humanistic understanding of complex societal issues. As technology like facial-recognition software is shown to reproduce implicit biases of the programmer, data-driven tools to call out biases have their work cut out for them.

The Gender/Novels project is one more tool in that toolbox: a high-powered, intricate program that can manage a massive sample size and call it like it is.


Alison Lanier writes about the intersections of gender and media, which is the focus of her current studies at MIT. Her fiction, nonfiction, and poetry appear at BUST, Bitch, Origins, and elsewhere.