Jane Austen Corpus: textual analysis/distant reading report

Why This Corpus

When deciding on which body of text to analyze for this corpus project, I thought Jane Austen would not only be advantageous given that I have read a majority of her novels, but I also thought her works would be interesting since she is a well known and influential writer. Often her works are belittled and labeled as “chick lit” or simple romance plots, which in and of itself is not a negative aspect but that’s a different issue, so to analyze the data of the words used individually, something that can be applied to any novel no matter the writer or genre, seemed like an interesting project. I also thought it would be interesting to compare her works to others during the era in terms of language and how that reflects on thematic focuses and if Austen was either straying from the norm in her topics, or if she was more aligned with the general consensus and was yet still able to stand out and remain well known to this day.

Collecting the Texts

I collected each of Austen’s novels from Project Gutenberg. Rather than use a file which had all of the novels together, I decided to collect each individually then upload them together so that they could be compared. This allowed me to analyze the data not only on the macro level, i.e. her whole corpus and how certain language stands out, but also on a micro level, i.e. each book individually and how the language is dispersed among the six novels. In terms of prewashing the texts, I only needed to remove the Project Gutenberg added text.

Voyant Process

            First Cycle:

Initially, I had wanted to see what themes or subjects stood out in Austen’s novels by looking at the most commonly used words in the corpus, unfortunately this didn’t prove to be fruitful as the most common words are characters names or words that are too ambiguous to really tie to a single theme without looking deeper into the text at the scene and context of its use. This result was illuminating though in that Voyant can’t explicitly give you the information you are looking for, similar to the concept of pre-washing, you need to know what you are looking for in the text and know how to find it in the data and prep it just so.

That said, I did still use the information that was given in my first cycle of Voyant, I noticed that three of the top four words in the novel are “Mr” “Mrs” and “Miss,” and the term “Mrs” is about 600 times more frequent than “Miss” which was interesting considering all of the novels main heroines are unmarried women, thus showing through the data the importance that is placed on marriage within Austen’s novels.

Second Cycle:

This initial cycle pointed me towards the theme of marriage, which isn’t a surprising conclusion given that Austen’s novels are famous for their courtship-marriage plots, but my questions now were if the data of the words used actually reflected this trademark aspect of Austen’s work. For this cycle I entered terms that related to marriage: marriage, engagement, marry, proposal, elope, etc. In this instance, the terms “marriage” and “engagement” were the most prominent of the ones I entered, both being above 150 instances while the rest were below 100; it is also worth noting that the term “engagement” can be used in Austen’s novels in situations other than marriage, i.e. “a prior engagement” making the fact that “marriage,” which has a much more limited use, was more frequent than engagement particularly significant. But these results when compared to the previous of “mr” “mrs” and “miss,” which are all at around 2000 instances, are significantly low in frequency.

These results made me reflect on the theme of marriage in relation to Austen’s novels, while marriage is a major focus of each, it is not marriage as an isolated concept, rather it is the whole topic of marriage and how it is approached, motivated, achieved, etc. Marriage is not present in Austen’s novels usually until the final chapters of the novel, it is the end goal, what’s actually in focus and being written about in the major chunk of Austen’s novels is again, how this marriage approached and achieved.

Third Cycle:

For my third and final cycle through Voyant I turned from the question of marriage and instead turned towards one of the major motivations for marriage in Austen’s novels, love. In all of her novels, the heroine ultimately ends up in a marriage of some type of mutual love, but again, I wanted to see if the data backed this notion. I entered the following terms: love, heart, affection, regard, fond, and esteem. While these terms still did not come close to the 2000 range of “Mr” “Mrs” and “Miss,” the term “love” does appear almost 500 times, making it more than twice as frequent as the term “marriage” which was at only 214 instances. The terms “heart,” “affection,” and “regard” also appear more often than the term “marriage” thus revealing the significance of not only love within Austen’s novels, but emotions as a whole, whether it be for example “ill affections” or “high regard,” these more emotionally linked terms are more frequent, and thus arguably more important, than marriage.

An interesting note in regards to these results is that while Pride & Prejudice is Austen’s most popular and well-known novel, it doesn’t hold any single majority in any of the terms related to love and emotions; the term “love” is quite high, but it is basically tied with the novels Mansfield Park and Emma. But, if one looks at the data for the term “marriage,” one will find that Pride & Prejudice has a very distinct and wide majority over the other books in terms of its word frequency. This result perhaps undermines the idea that love is a more important theme in Austen’s novels than marriage, and instead suggests that the courtship-marriage plot is crucial and that themes of both marriage and love need to be present for the novel to be successful.

It would also be interesting to analyze the data of other motivations for marriage in Austen’s novels, e.g. money, beauty, personality, parental pressures, etc. and see how they compare to “love” in terms of frequency and if they add any new information to the issue discussed above concerning Pride & Prejudice.

Google Ngram

When using Ngram, the obvious terms to enter were “love” and “marriage” in order to see how Austen’s novels compared to others. The parameters I used were 1770, Austen was born in 1775, to 1820, Austen’s last novel was published in 1817, and I used the corpus of British English. The term marriage was slightly increasing for about 15 years after Austen was born, so one can assume she read literature about it, but the term then declined around the time Austen actually began publishing, raising the question of both why she focused on such a subject and why her novels were, and are still, successful. One could perhaps look towards the issue of gender in that again Austen’s novels were then, and now, labeled as a type of “chick lit” or novels that were not of any significant substance because they focused on female readers wants and interests. One could then conclude that these themes and topics that more closely catered to women’s interests weren’t being written about as commonly and thus they aren’t represented when looking at Google Ngram.

The term “love” was also on a slight increase around when Austen was born, but then gradually declined in the following years and didn’t significantly increase again until around 1805 and then remained steady until 1812 where it more noticeably decreased again. These results align with Austen’s novels a bit more in that Austen’s first novel was published in 1811, which is during one of the upswings of the term “love,” but the rest of her novels, like with the term “marriage,” are published during the decline of the use of the word. One can again point to the issue of gender and books during that era in that female writers were not as common as male writers and female readers were not paid as much attention as male readers; female writers and readers had little power of agency and sway over massive popular opinion and the data reflects that. The data from Google Ngram may be imperfect in that it is implicitly skewed towards male influenced themes by virtue of unequal gender relations through different eras of literature.

It is worth noting though that when “marriage” and “love” are entered together into Google Ngram, “love” is much more popular than “marriage,” mirroring the data from Voyant of the much higher frequency of “love” in Jane Austen’s corpus than “marriage.” This reveals that while the terms may have both been on a decline, the concept of love was still significantly more significant than marriage, perhaps suggesting that, like Austen’s works, marriage alone was not seen as important to write about but rather the motivations and reasons for it were.

When I expanded the parameters of the two terms to all of English and from 1770 to 2008, the latest that Google Ngram can go, I actually found that while “marriage” has remained basically consistent, “love” has actually decreased significantly, but is on slight increase currently. The height of the use of the term was in the 1800s, peaking around 1811, when Austen’s first novel was published. This reveals that when looking at the parameters up close one finds that these subjects may have been on a slight decline, but when compared to the expanded timeline of 1770 to 2008, Austen’s novels were characteristic of their time. Again, one wonders then why and how her novels have remained so popular if “love” is on a decline, is it a signal of nostalgia? A slowly coming reemergence of the theme of love? These Google Ngram results left me with new information that ultimately led me to questions that couldn’t be answered by data alone, but they were questions that I probably wouldn’t have found without analyzing such data.


This process allowed me to analyze Austen’s corpus both as a whole, spanning all six of her novels, but also individually in that I could compare all six novels against each other to see how any stand out. As I came up with questions to ask of the corpus, I realized that the data requires an interpretive approach. Being given the data is one tool, but you need to know how to use it and fit it to your analysis, the raw data isn’t enough. This also means you need some sort of context or familiarity with the works you are analyzing, while one doesn’t need background knowledge in order to understand numbers of word frequency, context is very important and thus one can’t go into such a project with no familiarity, in my experience.

 I did find this process and project to be interesting though in that, as mentioned earlier, it raised questions about the popularity of Pride & Prejudice which might not have been noticed without the use of data analysis. The analysis of word use is obviously an important one in the academic field, the concepts of diction and syntax are crucial to the analysis of literature, but this large-scale data analysis also opens up different possibilities of examining texts on a more macro level. This macro level research can give more encompassing evidence to questions of eras of literature as a whole. The data is also subject to drawbacks though in that it can’t avoid being biased, similar to human analysis, in that its reach is limited to only that which was published and favors whatever gives the most results. This can misrepresent different eras in that only those who were wealthy, had access to books, could read, were being published, etc., are being majorly represented, rather than the society as a whole. This is not to say human analysis isn’t subject to this problem as well, but data analysis is sometimes given a gloss of being purely factual and unbiased when that is simply not the case.

The tools from Voyant and Google Ngram are quite useful though and any advice I would give to a “newbie” is that they should be aware of these biases that are always going to be present and that they need to have some type of familiarity with whatever corpus they are analyzing, relying on purely data alone will not get you as far and will make it difficult to use your tools effectively since you won’t know how to shape them for the specific works you are analyzing.

1 thought on “Jane Austen Corpus: textual analysis/distant reading report”

  1. Very nice! I wonder too if the Miss/Mrs. distinction also says something about age/generation and gender in Austen’s novels? I.e. that even though the action is centered on the Miss-es, they are surrounded by many Mrs’s? The prevalence of these terms – – Mr./Miss/Mrs. – – also made me think of class distinctions. Is the abundance of these terms – – perhaps in relation to Sir/Lady/etc. – – a sign of a shift in the subject matter and reading audiences of the English novel in the period?

    No big surprises with “love” and “marriage” – – but I found it pretty interesting that “heart” ranked so high in your analysis, along with “affection” and “regard.” The heart as the seat of sentiment and the importance of sentiment in 18th/early 19th c. British fiction seems borne out somewhat by your analysis. I wonder if one thing you’re analyzing is the emergence of a new discourse in the English novel about feeling and sentiment. Another cycle of Voyant – – trying to pick up the context for/terms associated with “heart,” “regard,” “affection,” might be profitable here. And, these really particular, but loaded terms, might make good candidates for the NGram dive – – as opposed to the more general love, marriage, etc.

    Nice work and excellent use of Voyant graphs/figures to illustrate your findings.

Leave a Reply

Your email address will not be published. Required fields are marked *