The Corpus of Nineteenth-Century Newspaper English (CNNE)

The English used in nineteenth-century newspapers is of considerable interest to linguists for at least two reasons. First, studies of late twentieth-century English have shown that newspaper language is responsive to language change of the types known as colloquialization (that is, the tendency for some written genres to become more similar to informal speech in their linguistic make-up) and densification (that is, the tendency for a given meaning to be expressed using less linguistic material). This raises the question of whether such tendencies can also be identified in nineteenth-century newspaper English. Secondly, the newspaper became an increasingly central written genre in nineteenth-century Britain. Owing to factors such as advances in printing technology, the repeal of taxes, and increases in literacy, newspapers were bought and read by a far higher proportion of the British population in 1900 than had been the case 100 years previously. Consequently, nineteenth-century newspaper English is of central importance not only in order to understand how twentieth-century English developed, but also in order to describe the English of the 1800s in itself.

The Corpus of Nineteenth-century Newspaper English (CNNE) will enable scholars to study the language of English newspapers from the nineteenth century; in addition, the division of the corpus into two periods makes it possible to trace language change across the 1800s. The period division reflects a number of extralinguistic changes in the middle of the nineteenth century that had important consequences for the newspaper business, such as the repeal of the so-called Taxes on Knowledge (the stamp duties on newspapers and the customs and excise duties on paper) in 1855 and 1861 and the formation of the Press Association in 1868. As CNNE contains texts from the decades before as well as after these changes, it is possible to use the corpus to study the potential effect of the changes on newspaper language.

CNNE is being compiled by Erik Smitterberg. Newspaper texts for the corpus are selected from online databases; the PDF files selected are converted to machine-readable text with the aid of OCR software complemented by manual proof-reading.