The Koala Part-of-Speech Tagset for Written Swedish

Main Article Content

Yvonne Adesam
Gerlof Bouma

Abstract

We present the Koala part-of-speech tagset for written Swedish. The categorization takes the Swedish Academy grammar (SAG) as its main starting point, to fit with the current descriptive view on Swedish grammar. We argue that neither SAG, as is, nor any of the existing part-of-speech tagsets, meet our requirements for a broadly applicable categorization. Our proposal is outlined and compared to the other descriptions, and motivations for both the tagset as a whole as well as decisions about individual tags are discussed.

Article Details

Section
Articles

References

Adesam, Yvonne, Gerlof Bouma, and Richard Johansson. 2015a. Defining the Eukalyptus forest – the Koala treebank of Swedish. In Proceedings of the Nordic Conference on Computational Linguistics (Nodalida). Linköping University Electronic Press, Sweden.

Adesam, Yvonne, Gerlof Bouma, and Richard Johansson. 2015b. Multiwords, word senses and multiword senses in the Eukalyptus treebank of written Swedish. In Proceedings of TLT. ISBN 978-83-63159-18-4.

Adesam, Yvonne, Gerlof Bouma, Richard Johansson, Lars Borin, and Markus Forsberg. 2018. The Eukalyptus treebank of written Swedish. In Swedish Language Technology Conference (SLTC). Available at https://sltc2018.su.se/program/. Stockholm University.

Ameka, Felix. 1992. Interjections: The universal yet neglected part of speech. Journal of Pragmatics 18(2–3):101–118. https://doi.org/10.1016/0378-2166(92)90048-G

Baker, Mark and William Croft. 2017. Lexical categories: Legacy, lacuna, and opportunity for functionalists and formalists. Annual Review of Linguistics 3(1):179–197. https://doi.org/10.1146/annurev-linguistics-011516-034134

Borin, Lars, Markus Forsberg, Martin Hammarstedt, Dan Rosén, Roland Schäfer, and Anne Schumacher. 2016. Sparv: Språkbanken’s corpus annotation pipeline infrastructure. In Swedish Language Technology Conference (SLTC). Available at https://people.cs.umu.se/johanna/sltc2016/abstracts/SLTC_2016_paper_31.pdf. Umeå University.

Borin, Lars, Markus Forsberg, and Lennart Lönngren. 2013. SALDO: a touch of yin to WordNet’s yang. Language Resources and Evaluation 47(4):1191–1211. https://doi.org/10.1007/s10579-013-9233-4

Borin, Lars, Markus Forsberg, and Johan Roxendal. 2012. Korp – the corpus infrastructure of Språkbanken. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC), page 474–478. European Language Resources Association (ELRA).

Börjars, Kersti. 2003. Morphological status and (de)grammaticalisation: the Swedish possessive. Nordic Journal of Linguistics 26(2):133–163. https://doi.org/10.1017/S0332586503001069

Carlberger, Johan and Viggo Kann. 1999. Implementing an efficient part-of-speech tagger. Software: Practice and Experience 29(9):815–832. https://doi.org/10.1002/(SICI)1097-024X(19990725)29:9<815::AID-SPE256>3.0.CO;2-F

Croft, William, Dawn Nordquist, Katherine Looney, and Michael Regan. 2017. Linguistic typology meets Universal Dependencies. In Proceedings of the 15th International Workshop on Treebanks and Linguistic Theories, pages 63–75.

Davis, Randall, Howard Shrobe, and Peter Szolovits. 1993. What is a knowledge representation? AI Magazine 14(1):17–33.

de Marneffe, Marie-Catherine, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre, and Christopher Manning. 2014. Universal Stanford dependencies: A cross-linguistic typology. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC). European Language Resources Association (ELRA).

Durkin, Philip. 2015. The Oxford Handbook of Lexicography. Oxford University Press, 1st edn. ISBN 9780199691630. https://doi.org/10.1093/oxfordhb/9780199691630.001.0001

Ejerhed, Eva, Gunnel Källgren, Ola Wennstedt, and Magnus Åström. 1992. The linguistic annotation system of the Stockholm-Umeå corpus project - description and guidelines. Tech. Rep. 33, Department of Linguistics, Umeå University.

Forsbom, Eva. 2008. Good tag hunting: Tagability of Granska tags. In J. Nivre, M. Dahllöf, and B. Megyesi, eds., Resourceful Language Technology: Festschrift in Honor of Anna Sågvall Hein, no. 7 in Studia Linguistica Upsaliensia, pages 77–85. Acta Universitatis Upsaliensis. ISBN 978-91-554-7226-9.

Fuertes-Olivera, Pedro A. 2017. The Routledge Handbook of Lexicography. Milton: Routledge. ISBN 9781138941601. https://doi.org/10.4324/9781315104942

Haspelmath, Martin. 2007. Coordination. In T. Shopen, ed., Language Typology and Syntactic Description, vol. 2, pages 1–51. Cambridge University Press, 2nd edn. https://doi.org/10.1017/CBO9780511619434.001

Haspelmath, Martin. 2012. How to compare major word-classes across the world’s languages. In Theories of everything: in honor of Edward Keenan, no. 17 in UCLA Working Papers in Linguistics, pages 109–130. Los Angeles: UCLA.

Haspelmath, Martin. 2015. Defining vs. diagnosing linguistic categories: A case study of clitic phenomena. In J. Błaszczak, D. Klimek-Jankowska, and K. Migdalski, eds., How categorical are categories? New approaches to the old questions of noun, verb, and adjective, pages 273–304. Berlin, Boston: De Gruyter Mouton. ISBN 9781614514510.

Jackson, Howard, ed. 2013. The Bloomsbury Companion to Lexicography. Bloomsbury Companions. London, England: Bloomsbury. ISBN 9781441145970.

Josefsson, Gunlög. 2005. Ord. Lund: Studentlitteratur. ISBN 9144037260.

Knutsson, Ola, Johnny Bigert, and Viggo Kann. 2003. A robust shallow parser for Swedish. In Proceedings of the Nordic Conference on Computational Linguistics (Nodalida).

Marcus, Mitchell P., Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn treebank. Computational Linguistics 19(2):313–330. https://doi.org/10.21236/ADA273556

Nivre, Joakim. 2014. Universal Dependencies for Swedish. In Swedish Language Technology Conference (SLTC). Available at https://www2.lingfil.uu.se/SLTC2014/abstracts/sltc2014_submission_7.pdf. Uppsala University.

Nivre, Joakim, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal Dependencies v1: A multilingual treebank collection. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S. Piperidis, eds., Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC). European Language Resources Association (ELRA).

Nivre, Joakim, Beáta Megyesi, Sofia Gustafson-Capková, Filip Salomonsson, and Bengt Dahlqvist. 2008. Cultivating a Swedish treebank. In Resourceful Language Technology: Festschrift in Honor of Anna Sågvall Hein. Uppsala University, Department of Linguistics and Philology.

Nivre, Joakim, Jens Nilsson, and Johan Hall. 2006. Talbanken05: A Swedish treebank with phrase structure and dependency annotation. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC), pages 1392–1395. European Language Resources Association (ELRA).

Nunberg, Geoffrey. 1990. The linguistics of punctuation. No. 18 in CSLI Lecture Notes. Stanford: Center for the Study of Language and Information (CSLI). ISBN 0937073474.

Osborne, Timothy and Kim Gerdes. 2019. The status of function words in dependency grammar: A critique of Universal Dependencies (UD). Glossa: A Journal of General Linguistics 4(1):17. https://doi.org/10.5334/gjgl.537

Pullum, Geoffrey. 2009. Lexical categorization in English dictionaries and traditional grammars. Zeitschrift für Anglistik und Amerikanistik 57(3):255–273. https://doi.org/10.1515/zaa.2009.57.3.255

Sag, Ivan A., Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002. Multiword expressions: A pain in the neck for NLP. In A. Gelbukh, ed., Computational Linguistics and Intelligent Text Processing, pages 1–15. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-45715-1_1

Stroh-Wollin, Ulla. 2002. Som-satser med och utan som [Som-clauses with and without som]. Ph.D. thesis, Uppsala University.

Svensén, Bo. 2004. Handbok i lexikografi: ordböcker och ordboksarbete i teori och praktik. Andra, omarbetade och utökade upplagan. Stockholm: Norstedts akademiska förlag. ISBN 9172272694.

Teleman, Ulf. 1974. Manual för grammatisk beskrivning av talad och skriven svenska. Lund: Studentlitteratur. ISBN 91-44-10721-8.

Teleman, Ulf, Staffan Hellberg, and Erik Andersson. 1999. Svenska Akademiens Grammatik. Stockholm: Svenska Akademien. ISBN 9172271264.

Trask, R. L. 1999. Parts of speech. In E. K. Brown and J. E. Miller, eds., Concise encyclopedia of grammatical categories, pages 278–284. Oxford, Amsterdam: Pergamon; Elsevier. ISBN 008043164X.

Vogel, Petra and Bernard Comrie, eds. 2000. Approaches to the Typology of Word Classes. Berlin-New York: Mouton de Gruyter. ISBN 3-11-016102-8. https://doi.org/10.1515/9783110806120

Volk, Martin, Anne Göhring, Torsten Marek, and Yvonne Samuelsson. 2010. SMULTRON (version 3.0) - the Stockholm multilingual parallel treebank. An English-French-German-Spanish-Swedish parallel treebank with sub-sentential alignments, https://www.cl.uzh.ch/en/texttechnologies/research/corpus-linguistics/paralleltreebanks/smultron.html, (consulted 22 October 2019).