Uppsala Student English Corpus (USE)

The Uppsala Student English corpus (USE) is a machine-readable collection of essays from the Department of English, Uppsala University, spanning the years 1999-2001.

Aim

USE was set up by Ylva Berglund and Margareta Westergren Axelsson with the aim of creating a powerful tool for research into the process and results of foreign language teaching and acquisition, as manifest in the written English of Swedish university students.

Contents

The corpus consists of 1,489 essays written by 440 Swedish university students of English at three different levels, the majority in their first term of full-time studies. The total number of words is 1,221,265, which means an average essay length of 820 words. A typical first-term essay is somewhat shorter, averaging 777 words.

The essays cover set topics of different types. They were written out of class, against a deadline of two to three weeks, length limitations imposed (usually 700-800 words), and suitable text structure suggested. First-term students were admitted for both spring (January 20 - June 6) and autumn terms (September 1 - January 19).

First-term essays:

a1. "English, my English." Students describe their experience of the English language, evaluating their reading, writing, speaking, and listening proficiency. Personal, involved style. Written late January or early September.

a2. Argumentation. Students argue for or against a statement concerning a topical issue. Formal style. Written in mid-February or early October.

a3. Reflections. Students reflect on the medium of television and its impact on people, or on related issues of their choice. Personal/formal style. Written in March or October.

a4. Literature course assignment. Students choose between a discussion of theme/character/narrator and a close-reading based analysis of a set passage. Formal style. Written in early April or November.

a5. Culture course assignment. Students study topics in set secondary sources and compose an essay using this material, often quoting and listing these sources. Topics include issues such as 19th-century education of women, the industrial revolution, slavery, and utopias. Written in late April or November.

Second-term essays:

b1. Causal analysis. Students discuss causes of some recent trend of their choice. Formal style. Suitable in content and style for comparison or combination with essay a3.

b2. Argumentation. Students present counter-arguments to views expressed in articles or letters to the editor. Similar in approach and tone to essay a1.

b3. Short papers in English linguistics, on various topics, e.g. loan words in English, English spelling, British and American English, the semantic properties of synonymous pairs. Academic style. Lengthy tables, lists of words, and appendices, irrelevant to the study of learner English, were removed (the place was marked in the document). Essays may still contain words in other languages than English, or from earlier periods of English, items quoted directly from dictionaries, and lists of references.

b4. English literature. A discussion of character, theme etc., produced in a survey course, dealing with Shakespeare's Julius Caesar or contemporary novels. Essays may contain quotations, sometimes also references to secondary sources. Academic style.

b5. American literature. Similar to b4. Essays formed part of a course on American contemporary novels and may contain quotations and references to secondary sources. Academic style.
In the autumn of 1999, 30 additional essays (coded b6-b8) were produced by second-term teacher trainees, namely

b6. Taboo, not taboo. (12 essays)

b7. Politics and education. (15 essays)

b8. School visit reports. (3 essays)

Third-term essays:

c1. Collected only in the spring term, 2000. Seven longer essays, all literature course assignments.

A quantitative overview

Tables 1-4 provide a survey of the USE corpus, tabulating its content and size across the three years of collection, thus illuminating the history of the corpus production. The number of words has been calculated with the Wordlist option in WordSmith Tools, version 2.0, set to count numbers as words, and hyphenated words as one word.

Table 1. Number of essays written by first-term students (a) and number of words

Essay type

Spring


1999

Autumn


1999

Spring


2000

Autumn


2000

Spring


2001

Autumn


2001

Total N


essays


& words

Average


words/


essay

Student id

0100-238

1000-1121

2000-75

3000-72

4000-48

5000-49

 

 

Evaluation (a1) Words

115

84

63

31

6

4

303

 

83,285

64,802

45,319

12,702

2,586

1,656

210,349

703/414

Argumen- tation (a2) Words

105

67

58

26

42

46

344

 

78,390

52,250

41,916

20,104

30,326

32,388

255,374

742

Reflections (a3) Words

94

56

54

18

36

34

292

 

67,832

40,811

38,725

14,110

25,992

24,273

211,743

725

Literature (a4) Words

73

49

48

8

6

1

185

 

66,125

43,123

43,140

8,085

5,218

1,214

166,905

902

Culture (a5) Words

90

40

24

---

---

---

114

 

51,736

39,867

25,583

 

 

 

117,186

1028

Total N of essays Total N of words Average w/essay

437

296

247

83

90

85

1238

 

347,368

240,853

194,682

55,001

64,122

59,561

961,557

 

 

 

 

 

 

 

 

777

Notes on Table 1:

Evaluation (a1): From 2001 these essays were limited to about 400 words, and collection for the corpus was officially discontinued. A few essays nevertheless submitted were included.

Literature (a4): Collection for the corpus was discontinued after a sharp drop in students' interest toward the end of the autumn term, 2000. A few essays still submitted were included.

Culture (a5) was dropped from the curriculum as of autumn, 2000.

Table 2. Essays and papers written by second-term students (b) and number of words

Text type

Autum


1999

Spring


2000

Autumn


2000

Spring


2001

Autumn


2001

Total N essays & words

Average words/ essay

Student id

0100-318

1000-1500-41

2000-

3000-3500-25

4000-4500-8

 

 

Causal analysis (b1) Words


15

21

12

22

6

76

 

11,469

16,479

8,597

16,762

4,559

57,886

761

Argumen- tation (b2) Words


10

15

8

17

3

53

 

9,728

14,386

5,866

14,730

2,741

47,451

895

Linguistics (b3) Words


18

15

2

---

---

35

 

26,662

26,679

2,996

 

 

56,337

1,610

Literature (b4) Words

14

12

3

---

---

29

 

17,437

14,131

3,307

 

 

34,875

1,203

Literature (b5) Words

6

8

5

2

---

21

 

8,118

11,368

6,328

2,355

 

28,169

1,341

Total N of essays



Total N of words



Average w/essay

63

71

30

41

9

214

 

73,414

83,043

27,094

33,847

7,300

224,698

 

 

 

 

 

 

 

1050

Notes on Table 2

In the last two terms, only causal analysis (b1) and argumentation (b2) essays were requested from the students.

Table 3. Number of essays written by second-term teacher trainees and number of words. Only from the autumn term, 1999, student id. codes 0100-238

Text type

N of essays

N of words

Taboo, not taboo (b6)

12

8,377

Politics and education (b7)

15

12,132

School visit report (b8)

3

3,782

Total

30

24,291

Table 4. Number of literature course essays written by third-term students (c) and number of words. Only from the spring term, 2000, student id. codes 0100-238, 0500-2

Text type (c1)

N of essays

N of words

American literature (0140 & 0165)

2

4,13

English literature

5

6,58

Total

7

10,71

Notes on Table 4

The literature essays essays (c1) were produced in elective courses of English and American literature. Five of the seven students taking part also submitted essays on the underlying levels. These students keep their original codes in the range of 0100-328. The new participants have the codes 0500-2. All are coded with the student identification code with the addition 'c1'.

File system and encoding

Each essay in USE is a separate file in plain text format. The first line always has a begin-document tag as the only word of the line. That tag also provides the file name of the text document (e.g. <doc.id = 2031.a3>). An end-document tag (</doc>) is the only word on the last line of the document. The file name shows the student identity number (2031) followed by an extension giving the term/level of writing and the type of essay (e.g. a3, where a = first term, 3 = essay 3, Reflections). As shown in the tables, the first digit of the student identity code denotes the term the student entered the project (0 = spring term, 1999; 5 = autumn term, 2001); the following digits are only numbers marking the order in which students volunteered.

The student identity codes thus make it possible to select essays from a particular term, if so desired. It is also possible to follow individual students over time, as, once a student entered the project, her/his identity code remained the same. The extension denotes the term or level the essay belongs to. Thus, student 2012 may have produced several essays, such as 2012.a2, 2012.a3 (all on first-term level) and 2012.b1 (second-term level).

Normally students continue their studies on consecutive terms, so that a student beginning first-term studies in the autumn of 2000 will proceed to second-term studies in the spring of 2001. Four students, however, interrupted their studies, returning to take the second-term courses one or more terms later. Such second-term essays have an "i" (for "interrupted" period of study) added to the file extension (2012.b1i). The exact term when the student wrote her/his second-term essays is shown in the database. This time factor may be important to consider, if such an essay is included in a longitudinal sub-corpus.

Some editing of the essays has been done: author names have been removed (deletion marked <name>) along with other identifying information. Apostrophes have been standardised. Formatting characters (hard line and page breaks, extra line spacing, etc.) have been removed. Paragraph breaks (end of paragraph) have been kept, standardised as CR + LF (return and line feed, ASCII 13, 10) to enable study of text organisation. Three spaces have been substituted for tabs (HT, ASCII 9). Titles, if any (some essays are untitled), are preceded by <title> and followed by </title> to enable exclusion of titles, if desired.

Collection procedure

In connection with a grammar lecture by one of the compilers, students were informed about the USE project, its aims and practical organisation. They were encouraged to enrol, although on an entirely voluntary basis. Consent to enrol and permission to use essays in the corpus were given in writing (Appendix 1) and students also completed a questionnaire providing information for a database (see below and Appendix 2).

All essays were written without supervision or time constraints (apart from date deadlines), and with access to dictionaries, and written and electronic sources for facts. Essay deadlines approaching, students were reminded to hand in electronic copies of their original essays to the USE compilers at the same time as they submitted a printed original to their essay tutors. Electronic copies were handed in on disk, copied into e-mails, or provided as e-mail attachments.

The USE compilers removed the students' names and other means of identification, converted the texts to plain text format, standardised certain items (see above) and saved the files under the identity codes allocated during the enrolment procedure.

Text types

A consequence of the set topics is that the essays can be expected to represent different text types and registers, i.e. they exhibit different levels of formality, certain kinds of vocabulary etc. This is why essay type (numbered a1-a5, b1-b8, and c1) rather than the term of production has been chosen as the main principle of organisation in the final version of the corpus, which facilitates grouping of similar texts, in order to obtain larger samples. It also makes it possible to compare similar text types on different levels. Table 5 shows the different types of essays. Evidently, some essays represent similar text types on different levels of proficiency. In terms of formality level and general topic areas, argumentation and discussion essays are related, usually dealing with topical subjects in society. This means that two large categories of essays can be discerned, one about matters of interest in society and the other about literature. Evaluation, culture and linguistics can be seen as more specialised categories.

Table 5. Essay types in the USE corpus and how they are interrelated

 

First term (a)

Second term (b)

Third term (c)

Evaluation

a1

b8

 

Argumentation

a2

b2

 

Discussion

a3

b1, b6, b7

 

Literature

a4

b4, b5

c1

Culture

a5

 

 

Student background information

All 440 students in the USE project filled in a questionnaire, answering questions about themselves, concerning their first language, parents' first language, grades in English, previous studies, exposure to English etc. This information is coded in a Microsoft Excel database, see below. Incomplete data in the database are due to some students overlooking the second page of the questionnaire, choosing not to answer all the questions, or (in the last two terms) a shortened, simplified form of the questionnaire.

USE resources

USE consists of three separate parts, shown in Figure 1.

USE corpus

USE database

USE manual

1,489 essays in plain text format, organised in 14 essay type categories (each contained in one folder marked a1, a2, etc.). Untagged text. For number of files, see Tables 1-4.

Information about the 440 students coded in a Microsoft Excel file.

Detailed information about the corpus and the database in a Microsoft Word file.

Figure 1. USE resources

Researching the corpus

Interface

As yet, there is no special interface or search engine created for the USE corpus. Most studies conducted on the material have been carried out with the software program WordSmith Tools, version 3.0 or earlier.

Sampling

Depending on the research question, the corpus can be used in different ways. If the investigated feature is expected to be frequent, a smaller sample of texts may be sufficient. If the investigation deals with a feature sensitive to register variation, it is important to choose essay type(s) suitable for the purpose.

Comparisons of Swedish students' English can be made with standard corpora of authentic written English or with other corpora of learner English (see Pravec 2002). Internal comparisons of samples of essays in the corpus may also be of interest, for instance, to see to what extent a syntactic construction or lexical unit is mastered on different levels of study (see Axelsson and Berglund 2002).

Database

The USE database provides an overview of the resources available. By using the sorting or filter functions in Excel, one can easily identify the essays relevant for a specific research question and then sample the selected essays from the corpus.

The information in the database makes it possible to study the progression of individual students who have submitted several essays over time. This can mean several essays in one term, two terms or three terms. The number of essays in each student’s production varies from one to eleven. About sixty students have handed in varying numbers of essays on both the first- and second-term levels, five of these even on the third-term level.

The following variables are coded in the database:

Column A. Student identity code. The code was given when the student first entered the project and then followed him/her during all the terms of participation. Table 1 shows the codes allotted on the different first-terms. Students enrolling for the first time during the second-term programme have their own set of identity codes (see Table 2 and the database). The majority of students who contributed both first- and second-term essays did so on consecutive terms. As mentioned above, four students submitted second-term essays (b) after a break of one term or more. These essays are marked "i" for "interrupted" and their term of production is given in column AJ.

Columns B-O. Essays submitted. Each essay type has its own column.

Column Q. Sex. Female (f) or male (m).

Column R. Age.

Column S. Year of birth.

Column T. Course=programme. A1, B1, C1 = general programme. A2, B2, A4, B4 and A6 = programmes for teacher trainees (A2, B2: upper secondary level; A4, B4: school years 4-9, and A6: school years 1-7).

Column U. Mother tongue, defined as "language spoken at home". Abbreviations are self-explanatory, for example, sw=Swedish, fi=Finnish, nor=Norwegian, spa=Spanish, ger=German. If less than obvious, "ot" = "other" is given and specified in column AJ.

Column V. Mother's first language.

Column W. Father's first language.

Column X. Answers the question "How many years have you studied English at school?"

Column Y. Answers the question "When did you first go to university?"

Column Z. Answers the question "What was your grade in English in Swedish upper secondary/high school?" Several changes in the Swedish grading system explain the variation in the data:

A few, older students have grades from a system ranging across A, a, AB, Ba, B, B? (with A as the best, very unusual grade, and B? a bare pass).

Many have grades 5, 4 and 3 (5 being the best, and 3 or better the requirement for English at university level).

The most recent system comprises several programmes in English at upper-secondary level: a minimum, a standard and a supplementary course, each graded MVG, VG and G - excellent, pass with distinction, pass. G on the standard course is required for university studies. Single grades refer to the standard course, double grades to standard course/supplementary course.

Column AA. Grade in Swedish. See above.

Column AB. University credits (points) in language studies at a Swedish university. The figure entered = weeks of study. Linguistics has been counted in this category.

Column AC. University credits (points) in language studies at a university abroad. The figure entered = weeks of study.

Column AD. University credits (points) in other subjects than languages at a Swedish university. The figure entered = weeks of study.

Column AE. University credits (points) in other subjects than languages at a university abroad. The figure entered = weeks of study.

Column AF. Worked in an English-speaking country. The figure entered = months.

Column AG. Studied in an English-speaking country. The figure entered = months. A college year in the United States, and earlier periods of school or summer schools in some English-speaking country are coded here.

Column AH. Total time spent in an English-speaking environment, broadly defined as "where English is used every day, abroad or in Sweden". The figure entered = months.

Column AI. Answers the question "Is there anything in particular that has affected your command of English?" A common answer refers to visits to English-speaking countries and has been coded as "Stay in Eng", regardless of country.

Column AJ. Clarification as to the first language, term of resumed studies on the second-term level, foreign grades, and other pertinent information offered.

An empty cell means that the student did not answer the question. For the last two terms of the USE project, the questionnaire was simplified, which explains the empty columns AF, AG and AI.

Acknowledgement of the origin of USE

Anyone using USE for research is obligated to acknowledge the source of data as follows: USE = Uppsala Student English corpus, compiled by Margareta Westergren Axelsson and Ylva Berglund, the Department of English, Uppsala University, 1999-2001.

Availability of USE

The corpus can be used for research and educational purposes. It can be accessed on the Internet from the Oxford Text Archive at http://hdl.handle.net/20.500.14106/2457. For Uppsala students and researchers, it is also available on a CD at the Department of English (Professor Merja Kytö and Senior lecturers supervising language project work).

About USE and other learner corpora

The two corpus compilers and students at the Department of English have used material from the corpus for investigations. The titles of studies finished so far are given in the list below.

• Axelsson, Margareta Westergren (1999) 'Project USE (Uppsala Student English),' ASLA Information 25:2, 25-6.
• Axelsson, Margareta Westergren. (2000) 'USE - The Uppsala Student English Corpus: An instrument for needs analysis,' ICAME Journal 24:155-7. Available online at http://nora.hd.uib.no/icame/ij24/ .
• Axelsson, Margareta Westergren (2000) 'The use of a corpus of students' written production in university English teaching, ' in Gunilla Byrman, Hans Lindquist and Magnus Levin (eds.) Corpora in research and teaching: Papers from the ASLA symposium on corpora in research and teaching, Växjö, 11-12 November 1999. ASLA:s skriftserie 13. 293-303.
• Axelsson, Margareta Westergren and Angela Hahn (2001), 'The use of the progressive in Swedish and German advanced learner English - a corpus-based study,' ICAME Journal 25:5-30. Available online at http://nora.hd.uib.no/icame/ij25/.
• Axelsson, Margareta Westergren and Ylva Berglund (2002), 'The Uppsala Student English Corpus (USE): A multi-faceted resource for research and course development,' in Lars Borin (ed.) Parallel corpora, parallel worlds. Selected papers from a symposium on parallel and comparable corpora at Uppsala University, Sweden, 22-23 April, 1999. Amsterdam: Rodopi. 79-90.
• Berglund, Ylva and Oliver Mason (2002), 'The influence of external factors on learner performance,' in Bernhard Kettemann and George Marko (eds.) Teaching and learning by doing corpus analysis. Amsterdam: Rodopi. 205-215.
• Borin, Lars and Klas Prytz, '"New wine in old skins?" A corpus investigation of L1 syntactic transfer in learner language.' Poster at TALC 2002, Fifth International Conference on Teaching and Language Corpora, 27-31 July, 2002. Bertinoro, Italy.
• Granger, Sylviane (ed.) (1998) Learner English on computer. London: Longman.
• Mason, Oliver and Ylva Berglund (2002), 'Low-level parameters reflecting the naturalness of text,' in Conference publication of JADT 2002, 6th International Conference on the Statistical Analysis of Textual Data. Saint-Malo, France.
• Mason, Oliver and Ylva Berglund, '"But this formula doesn't mean anything!" - some reflections on parameters of texts and their significance.' Forthcoming in Festschrift for Geoffrey Leech (Peter Lang).
• Pravec, Norma A. (2002) 'A survey of learner corpora,' ICAME Journal 26:81-114. Available online at http://nora.hd.uib.no/icame/ij26/

Students' third-term (C) and fourth-term (D) papers (unpublished, filed by the Department of English)

• Blomberg, Karin (2000) Swedish learners' use of the progressive aspect in English.
• Eiman, Carin (2000) Adjectives and attitudes: A linguistic study of how male and female students use adjectives when describing their knowledge of and proficiency in the English language.
• Hellén, Christina (2001) Swedish students' use of hypothetical conditional sentences in English. (D-course)
• Linerstad, Andrea (2002) The development of students' skills in handling S-V concord.
• Svensson, Jenny (2001) Noun compound, noun-compound or nouncompound? Three different constructions of a noun+noun compound. (D-course).

Appendix 1

Uppsala Student English Project (USE)

Projektet syftar till att skapa en korpus (datorläsbar textsamling) bestående av material producerat av studenter. Korpusen kommer att användas för forskning, undervisning och läromedelsframställning. Att deltaga i projektet är frivilligt och de deltagande kommer att vara anonyma (namnen avlägsnas från korpusen). Frågor besvaras av projektledarna, Margareta Westergren Axelsson och Ylva Berglund.
The project aims at creating a corpus (computer-readable collection of texts) of material submitted by students. The corpus will be used for language research, teaching, and production of teaching material. All contribution to the project is voluntary and anonymity will be maintained by the removal of participants' names in the corpus. Questions will be answered by the project coordinators, Margareta Westergren Axelsson and Ylva Berglund.

Medgivande

Consent form

Jag ger 'The Uppsala Student English Project' rätten att använda det jag lämnar till projektet för forskning, undervisning, läromedelsframställning, publicering och presentation (tex. på konferenser och seminarier). Jag godkänner att mitt material eventuellt publiceras (helt eller delvis) i någon form, tex. elektroniskt eller i tryck.

I hereby give to 'The Uppsala Student English Project' the right to use the material I submit to the project for research, teaching, publication and presentation (conferences, workshops, etc.). I consent to the possible publication of my material (as a whole or in part) in various forms, including paper and electronic media.

Namnteckning / signature ....................................................................................................

Namnförtydligande /printed name ........................................................................................

Datum / date ...........................

Appendix 2

Background data for corpus of student English

Name
(in the final corpus, all contributors will be anonymous, with only a code for identification):

.................................................................................................................

Sex: [ ] female [ ] male

Year of birth: 19...........

Mother tongue (what language do you speak at home? ):

[ ] Swedish [ ] English [ ] Other, namely ...........................................................

Mother tongue of parents
a) mother
[ ] Swedish [ ] English [ ] Other, namely ...........................................................

b) father
[ ] Swedish [ ] English [ ] Other, namely ...........................................................

How many years have you studied English at school? ...................

What year did you first go to university? 19............

Have you taken any previous language courses at university level ?
[ ] no [ ] yes (please specify language and points, for example French 20 p, Russian 5 p, etc.):

.................................................................................................................

.................................................................................................................

Have you taken any other courses at university?
[ ] no [ ] yes (please specify course and points, for example Economics 20 p, Law 5 p, etc.):

.................................................................................................................

.................................................................................................................

.................................................................................................................

Have you studied/worked abroad? (See also the following question.)
[ ] no [ ] yes (please specify country, type of activity, length):

.................................................................................................................

.................................................................................................................

.................................................................................................................

How much time have you spent in an English-speaking environment (where English was used every day), abroad or in Sweden? Please specify where, for how long and to what extent if possible (for example prolonged stay in an English-speaking country, long holidays and travels abroad, work in an international environment):

.................................................................................................................

.................................................................................................................

.................................................................................................................

What was your grade in English in Swedish upper secondary/high school?

.........................

What was your grade in Swedish (language) in upper secondary/high school?

.....................

Is there anything in particular you feel has affected your command of English?
[ ] no [ ] yes (please specify):

FOLLOW UPPSALA UNIVERSITY ON

facebook
instagram
twitter
youtube
linkedin