The SweLL Language Learner Corpus From Design to Annotation

Main Article Content

Elena Volodina
Lena Granstedt
Arild Matsson
Beáta Megyesi
Ildikó Pilán
Julia Prentice
Dan Rosén
Lisa Rudebeck
Carl-Johan Schenström
Gunlög Sundberg
Mats Wirén

Abstract

The article presents a new language learner corpus for Swedish, SweLL, and the methodology from collection and pesudonymisation to protect personal information of learners to annotation adapted to second language learning. The main aim is to deliver a well-annotated corpus of essays written by second language learners of Swedish and make it available for research through a browsable environment. To that end, a new annotation tool and a new project management tool have been implemented, – both with the main purpose to ensure reliability and quality of the final corpus. In the article we discuss reasoning behind metadata selection, principles of gold corpus compilation and argue for separation of normalization from correction annotation.

Article Details

Section
Articles

References

Abel, Andrea, Aivars Glaznieks, Lionel Nicolas, and Egon Stemle. 2014. KoKo: an L1 Learner Corpus for German. In Language Resources and Evaluation Conference (LREC), pages 2414–2421.

Ahlberg, Malin, Lars Borin, Markus Forsberg, Martin Hammarstedt, Leif-Jöran Olsson, Olof Olsson, Johan Roxendal, and Jonatan Uppström. 2013. Korp and Karp - a bestiary of language resources: the research infrastructure of Språkbanken. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), pages 429–433.

Alexopoulou, Theodora, Marije Michel, Akira Murakami, and Detmar Meurers. 2017. Task effects on linguistic complexity and accuracy: A large-scale learner corpus analysis employing natural language processing techniques. Language Learning 67(S1):180–208. https://doi.org/10.1111/lang.12232

Andringa, Sible and Aline Godfroid. 2019. SLA for all? Reproducing SLA research in non-academic samples. In OSF Project, published January 25, 2019 (https://osf.io/mp47b/).

Artstein, Ron and Massimo Poesio. 2008. Inter-coder agreement for computational linguistics. Computational Linguistics 34(4):555–596. https://doi.org/10.1162/coli.07-034-R2

Bird, Steven and Edward Loper. 2004. NLTK: the natural language toolkit. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, pages 69–72. Association for Computational Linguistics. https://doi.org/10.3115/1219044.1219075

Borin, Lars, Markus Forsberg, Martin Hammarstedt, Dan Rosén, Roland Schäfer, and Anne Schumacher. 2016. Sparv: Språkbanken’s corpus annotation pipeline infrastructure. In The Sixth Swedish Language Technology Conference (SLTC), Umeå University, pages 17–18.

Boyd, Adriane. 2017. MERLIN: Lessons Learned. In Presentation at CLARIN workshop on Interoperability of Second Language Resources and Tools. University of Gothenburg, Sweden. December 2017 .

Boyd, Adriane, Jirka Hana, Lionel Nicolas, Detmar Meurers, Katrin Wisniewski, Andrea Abel, Karin Schöne, Barbora Stindlová, and Chiara Vettori. 2014. The MERLIN corpus: Learner language and the CEFR. In Language Resources and Evaluation Conference (LREC), pages 1281–1288.

Carlsen, Cecilie. 2012. Proficiency level—a fuzzy variable in computer learner corpora. Applied Linguistics 33(2):161–183. https://doi.org/10.1093/applin/amr047

Church, Kenneth Ward. 2017. Emerging trends: I did it, I did it, I did it, but... Natural Language Engineering 23(3):473–480. https://doi.org/10.1017/S1351324917000067

Corder, Stephen Pit. 1967. The significance of learner’s errors. IRAL-International Review of Applied Linguistics in Language Teaching 5(1-4):161–170. https://doi.org/10.1515/iral.1967.5.1-4.161

Council of Europe. 2001. Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Press Syndicate of the University of Cambridge.

Dıaz-Negrillo, Ana, Detmar Meurers, Salvador Valera, and HolgerWunsch. 2010. Towards interlanguage pos annotation for effective learner corpora in sla and flt. In Language Forum, vol. 36, pages 139–154.

Dobric, Nikola. 2015. Quality measurements of error annotation-ensuring validity through reliability. The European English Messenger 24:36–42.

Doshi-Velez, Finale and Been Kim. 2017. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608 .

Erickson, Gudrun and Julieta Lodeiro. 2012. (In Swedish) Bedömning av språklig kompetens–En studie av samstämmigheten mellan Internationella språkstudien 2011 och svenska styrdokument. ISSN 1652-2508. Skolverkets aktuella analyser .

Forsberg, Fanny and Inge Bartning. 2010. Can linguistic features discriminate between the communicative CEFR-levels?: A pilot study of written L2 French .

Fort, Karën. 2016. Collaborative Annotation for Reliable Natural Language Processing: Technical and Sociological Aspects. John Wiley & Sons. https://doi.org/10.1002/9781119306696

Gaillat, T., P. Sébillot, and N. Ballier. 2014. Automated classification of unexpected uses of this and that in a learner corpus of English. Recent Advances in Corpus Linguistics pages 309–324. https://doi.org/10.1163/9789401211130_015

Geertzen, Jeroen, Theodora Alexopoulou, and Anna Korhonen. 2013. Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCAMDAT). In Proceedings of the 31st Second Language Research Forum. Somerville, MA: Cascadilla Proceedings Project.

Golden, Anne, Scott Jarvis, and Kari Tenfjord. 2017. Crosslinguistic influence and distinctive patterns of language learning: findings and insights from a learner corpus. Multilingual Matters. https://doi.org/10.21832/9781783098774

Granger, Sylviane. 1998. The computer learner corpus: a versatile new source of data for SLA research. Granger, S. (Ed.). Learner English on Computer. pages 3–18.

Granger, Sylviane. 2009. The contribution of learner corpora to second language acquisition and foreign language teaching. Corpora and language teaching 33:13–32. https://doi.org/10.1075/scl.33.04gra

Granger, Sylviane. 2013. Error-tagged learner corpora and CALL: A promising synergy. CALICO journal 20(3):465–480.

Granger, Sylviane, Gaëtanelle Gilquin, and Fanny Meunier. 2015. The Cambridge handbook of learner corpus research. Cambridge University Press. https://doi.org/10.1017/CBO9781139649414

Granger, Sylviane and Magali Paquot. 2017. Towards standardization of metadata for L2 corpora. In Keynote talk at CLARIN workshop on Interoperability of Second Language Resources and Tools. University of Gothenburg, Sweden. December 2017.

Gustafson-Capková, Sofia and Britt Hartmann. 2006. Manual of the Stockholm-Umeå Corpus, version 2.0 . Stockholm University, Stockholm, Sweden (https://spraakbanken.gu.se/parole/Docs/SUC2.0-manual.pdf, last accessed October 2019).

Hildén, R. 2008. Analys av svenska kursplaner i relation till den Europeiska Referensramen. Skolverket. Internal report. ISBN: 978-91-87115-72-1.

Holdt, Špela Arhar, Iztok Kosem, and Polona Gantar. 2017. Corpus-based resources for L1 teaching: The case of Slovene. In Handbook on digital learning for K-12 schools, pages 91–113. Springer. https://doi.org/10.1007/978-3-319-33808-8_7

Housen, Alex and Folkert Kuiken. 2009. Complexity, accuracy, and fluency in second language acquisition. Applied Linguistics 30(4):461–473. https://doi.org/10.1093/applin/amp048

Hovy, Eduard and Julia Lavid. 2010. Towards a ‘science’ of corpus annotation: a new methodological challenge for corpus linguistics. International journal of translation 22(1):13–36.

Hyland, Ken and Fiona Hyland. 2019. Feedback in second language writing: Contexts and issues. Cambridge university press. https://doi.org/10.1017/9781108635547

Jaccard, Paul. 1908. Nouvelles recherches sur la distribution florale. Bulletin de la Societe Vaudoise des Sciences Naturelles 44:223–270.

Krippendorff, Klaus. 2004. Reliability in content analysis: Some common misconceptions and recommendations. Human communication research 30(3):411–433. https://doi.org/10.1111/j.1468-2958.2004.tb00738.x

Kuiken, Folkert and Ineke Vedder. 2007. Task complexity and measures of linguistic performance in l2 writing. IRAL-International Review of Applied Linguistics in Language Teaching 45(3):261–284. https://doi.org/10.1515/iral.2007.012

Leacock, Claudia, Martin Chodorow, Michael Gamon, and Joel Tetreault. 2010. Automated grammatical error detection for language learners. Synthesis lectures on human language technologies 3(1):1–134. https://doi.org/10.2200/S00275ED1V01Y201006HLT009

Lenhard, Alexandra, Wolfgang Lenhard, Sebastian Suggate, and Robin Segerer. 2018. A continuous solution to the norming problem. Assessment 25(1):112–125. https://doi.org/10.1177/1073191116656437

Lüdeling, Anke, Maik Walter, Emil Kroymann, and Peter Adolphs. 2005. Multi-level error annotation in learner corpora. Proceedings of corpus linguistics 2005 1:14–17.

MacWhinney, Brian. 2017. A shared platform for studying second language acquisition. Language Learning 67(S1):254–275. https://doi.org/10.1111/lang.12220

Madnani, Nitin, Jill Burstein, Norbert Elliot, Beata Beigman Klebanov, Diane Napolitano, Slava Andreyev, and Maxwell Schwartz. 2018. Writing Mentor: Self-Regulated Writing Feedback for Struggling Writers. In Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pages 113–117.

Megyesi, Beáta, Lena Granstedt, Sofia Johansson, Julia Prentice, Dan Rosén, Carl-Johan Schenström, Gunlög Sundberg, Mats Wirén, and Elena Volodina. 2018. Learner Corpus Anonymization in the Age of GDPR: Insights from the Creation of a Learner Corpus of Swedish. In Proceedings of the 7th NLP4CALL, Swedish Language Technology Conference, SLTC 2018, pages 47–56.

Megyesi, Beáta, Jesper Näsman, and Anne Palmér. 2016. The Uppsala corpus of student writings: Corpus creation, annotation, and analysis. In Language Resources and Evaluation Conference (LREC).

Megyesi, Beáta, Anne Palmér, and Jesper Näsman. 2019. (In Swedish) SWEGRAM Användarmanual. (https://cl.lingfil.uu.se/ bea/publ/swegram-manual-2019.pdf – Last accessed October 2019).

Meurers, Detmar, Kordula De Kuthy, Florian Nuxoll, Björn Rudzewitz, and Ramon Ziai. 2019. Scaling up intervention studies to investigate real-life foreign language learning in school. Annual Review of Applied Linguistics 39. https://doi.org/10.1017/S0267190519000126

Mitchell, Rosamond, Florence Myles, and Emma Marsden. 2012. Second language learning theories, vol. 3 ed. London: Routledge.

Myles, Florence. 2005. Interlanguage corpora and second language acquisition research. Second Language Research 21(4):373–391. https://doi.org/10.1191/0267658305sr252oa

Näsman, Jesper, Beáta Megyesi, and Anne Palmér. 2017. SWEGRAM: A Web-Based Tool for Automatic Annotation and Analysis of Swedish Texts. In 21st Nordic Conference on Computational Linguistics, Nodalida 2017 , pages 132–141.

Nivre, Joakim, Marie-Cathrine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal Dependencies v1: A Multilingual Treebank Collection. In Proceedings of Language Resources and Evaluation Conference (LREC).

Norris, John M and Lourdes Ortega. 2009. Towards an organic approach to investigating CAF in instructed SLA: The case of complexity. Applied Linguistics 30(4):555–578. https://doi.org/10.1093/applin/amp044

Oscarson, Mats. 2015. (In Swedish) Bedömning på systemnivå - En komparativ studie av stegsystemet i språk i den svenska skolan och språknivåer i Europarådets Common European Framework of Reference. EDUCARE 2:128–153.

Pallotti, Gabriele. 2009. CAF: Defining, refining and differentiating constructs. Applied Linguistics 30(4):590–601. https://doi.org/10.1093/applin/amp045

Paquot, Magali. 2013. Lexical bundles and L1 transfer effects. International Journal of Corpus Linguistics 18(3):391–417. https://doi.org/10.1075/ijcl.18.3.06paq

Paquot, Magali and Sylviane Granger. 2012. Formulaic language in learner corpora. Annual Review of Applied Linguistics 32:130–149. https://doi.org/10.1017/S0267190512000098

Paquot, Magali and Luke Plonsky. 2017. Quantitative research methods and study quality in learner corpus research. International Journal of Learner Corpus Research 3(1):61–94. https://doi.org/10.1075/ijlcr.3.1.03paq

Parkvall, Mikael. 2009. (In Swedish) Sveriges språk - vem talar vad och var?. Institutionen för lingvistik, Stockholms universitet.

Pettersson, Eva, Beáta Megyesi, and Joakim Nivre. 2013. Normalisation of Historical Text using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting. In Proceedings of the 19th Nordic Conference of Computational Linguistics, NODALIDA.

Pilán, Ildikó, Elena Volodina, and Torsten Zesch. 2016. Predicting proficiency levels in learner writings by transferring a linguistic complexity model from expert-written coursebooks. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 2101–2111.

Prentice, Julia and Sylviane Granger. In prep. Error - (still) a controversial notion in SLA research. In In preparation.

Rosen, Alexandr. 2017. Introducing a corpus of non-native Czech with automatic annotation. Language, Corpora and Cognition pages 163–180.

Rosen, Alexandr, Jirka Hana, Barbora Štindlová, and Anna Feldman. 2014. Evaluating and automating the annotation of a learner corpus. Language Resources and Evaluation 48(1):65–92. https://doi.org/10.1007/s10579-013-9226-3

Rosén, Dan, Mats Wirén, and Elena Volodina. 2018. Error Coding of Second-Language Learner Texts Based on Mostly Automatic Alignment of Parallel Corpora. In CLARIN Annual conference 2018.

Settles, Burr, Chris Brust, Erin Gustafson, Masato Hagiwara, and Nitin Madnani. 2018. Second language acquisition modeling. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 56–65. https://doi.org/10.18653/v1/W18-0506

Skehan, Peter and Pauline Foster. 1999. The influence of task structure and processing conditions on narrative retellings. Language learning 49(1):93–120. https://doi.org/10.1111/1467-9922.00071

Skolverket. 2018. Kommunal vuxenutbildning i svenska för invandrare - Elever och kursdeltagare - Riksnivå. (https://www.skolverket.se/skolutveckling/statistik/ Last accessed December 2018.).

Stymne, Sara, Eva Pettersson, Beáta Megyesi, and Anne Palmér. 2017. Annotating errors in student texts: First experiences and experiments. In Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition, pages 47–60.

Tenfjord, Kari, Hilde Johansen, and Jon Erik Hagen. 2006a. The “hows” and the “whys” of coding categories in a learner corpus (or" how and why an error-tagged learner corpus is not’ipso facto’one big comparative fallacy"). Rivista di psicolinguistica applicata 6(3):1000–1016.

Tenfjord, Kari, Paul Meurer, and Knut Hofland. 2006b. The ASK corpus: A language learner corpus of Norwegian as a second language. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC), pages 1821–1824.

Tetreault, Joel, Daniel Blanchard, and Aoife Cahill. 2013. A report on the first native language identification shared task. In Proceedings of the eighth workshop on innovative use of NLP for building educational applications, pages 48–57.

Thewissen, Jennifer. 2013. Capturing L2 accuracy developmental patterns: Insights from an error-tagged EFL learner corpus. The Modern Language Journal 97(S1):77–101. https://doi.org/10.1111/j.1540-4781.2012.01422.x

Volodina, Elena, Lena Granstedt, Sofia Johansson, Beáta Megyesi, Julia Prentice, Dan Rosén, Carl-Johan Schenström, Gunlög Sundberg, and Mats Wirén. 2018. Annotation of learner corpora: first SweLL insights. Proceedings of Swedish Language Technology Cconference (SLTC) 2018.

Volodina, Elena, Beáta Megyesi, Mats Wirén, Lena Granstedt, Julia Prentice, Monica Reichenberg, and Gunlög Sundberg. 2016. A Friend in Need? Research agenda for electronic Second Language infrastructure. In Proceedings of Swedish Language Technology Conference (SLTC) 2016, Umeå, Sweden.

Volodina, Elena, Ildikó Pilán, Lars Borin, and Therese Lindström Tiedemann. 2014. A flexible language learning platform based on language resources and web services. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), pages 3973–3978.

Wirén, Mats, Arild Matsson, Dan Rosén, and Elena Volodina. 2019. SVALA: Annotation of Second-Language Learner Text Based on Mostly Automatic Alignment of Parallel Corpora. Post-conference proceedings of CLARIN 2018.

Zipser, Florian and Laurent Romary. 2010. A model oriented approach to the mapping of annotation formats using standards. In Workshop on Language Resource and Language Technology Standards, Language Resources and Evaluation Conference (LREC) 2010.

Östling, Robert. 2016. Efficient Sequence Labeling: efselab. https://github.com/robertostling/efselab – Last accessed February 2018, Stockholm University, Stockholm, Sweden.

Östling, Robert, Andre Smolentzov, Björn Tyrefors Hinnerich, and Erik Höglin. 2013. Automated essay scoring for Swedish. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 42–47.