skip to content

Language Data

↓ Language corpora and datasets
↓ External language corpora and datasets
Language corpora and datasets in development

Researchers at the University of Cologne produce and publish a wide range of corpora and other datasets in the field of language resesearch. These resources are often the result of years of work and are of great value to the research community.

The Data Center for the Humanities (DCH) has taken on stewardship for many of these corpora and datasets. We are committed to the long-term preservation of these resources and to making them available to the research community. The DCH is a member of the CLARIN and NFDI Text+ research infrastructure and is committed to the FAIR principles of data stewardship (Wilkinson et al. 2016) and in the case of language resources connected to indigenous communities, we are also committed to the CARE principles.

The corpora and other datasets are published through the Language Archive Cologne or archived via the DCH data archiving service. The Language Archive Cologne is a Core Trust Seal certified repository and is a member of the CLARIN and NFDI Text+ research infrastructures as well as of the DELAMAN network.

Researchers at the University of Cologne have also deposited corpora and other datasets in other DELAMAN language archives as well as in other repositories. These datasets are also listed below.

CELD native researcher Emanuel Tuturop (on the right) conducting Iha fieldwork (foto by Nikolaus P. Himmelmann)
CELD staff member Jean Lekeneney checking data with CELD international advisor Nikolaus P. Himmelmann (foto by Sonja Riesberg)

Language corpora and datasets

Bauer, Anastasia & Roman Poryadin 2023. Russian Sign Language conversations. Data Center for the Humanities.

Bracks, Christoph A., Datra Hasan, Maria Bardají i Farré, Sumitro Pogi & Nikolaus P. Himmelmann. 2023. Totoli documentation corpus 2. Data Center for the Humanities.

Compes, Isabel & Birgit Hellwig 2023. Beria Corpus Diverse. Data Center for the Humanities.

Compes, Isabel & Birgit Hellwig 2023. Beria Corpus Naturalistic. Data Center for the Humanities.

Compes, Isabel & Birgit Hellwig 2023. Beria Corpus Nouns. Data Center for the Humanities.

Compes, Isabel & Birgit Hellwig 2023. Beria Corpus Verbs. Data Center for the Humanities.

Degener, Almuth, Eugen Hill & Daniel Kölligan. 2019. Nuristani Archive Cologne. Data Center for the Humanities.

Gipper, Sonja & Jeremías Ballivián Torrico. 2018. The Family Problems Picture Task in Yurakaré. Data Center for the Humanities.

Gipper, Sonja. 2023. SCOPIC Corpus Low German. Data Center for the Humanities.

Gipper, Sonja & Jeremías Ballivián Torrico. 2023. Yurakaré language class recordings. Data Center for the Humanities.

Gipper, Sonja & Jeremías Ballivián Torrico. 2023. Yurakaré interviews on language infrastructure use. Data Center for the Humanities.

Gipper, Sonja & Jeremías Ballivián Torrico. 2023. Yurakaré word list recordings. Data Center for the Humanities.

Gipper, Sonja, Jeremías Ballivián Torrico & Jildo Hinojosa. 2023. Yurakaré sociolinguistic interviews. Data Center for the Humanities.

Gipper, Sonja, Danielle Barth & Nicholas Evans 2023. SCOPIC Corpus Kölsch. Data Center for the Humanities.

Gipper, Sonja, Vincent Hirtzel, Jeremías Ballivián Torrico & Daniel Chávez Orosco. 2023. Yurakaré Covid-19 interviews. Data Center for the Humanities.

Hannß, Katja. 2015. Etymological Kallawaya Dictionary. Data Center for the Humanities.

Heissig, Walther & Klaus Sagaster. 2019. Oral Tales of Mongolian Bards. Data Center for the Humanities.

Hellwig, Birgit 2018. Zaghawa. Data Center for the Humanities.

Hellwig, Birgit, Carmen Dawuda, Henrike Frye & Steffen Reetz. 2023. Qaqet Child Language Corpus: Longitudinal Study. Data Center for the Humanities.

Lalinde, Miranda. 2022. Documentation of language variety in Latin America and the Caribbean. Data Center for the Humanities.

Lau, Jonas. 2020. Abesabesi Grammar.

Niethammer, Lutz. 2020. Lebensgeschichte und Sozialkultur im Ruhrgebiet 1930 bis 1960 (LUSIR). Data Center for the Humanities.

Oukafi, Issak Cheikh. 2022. Interviews about Rock Art. Data Center for the Humanities.

Rau, Felix. 2014. Gtaq Field Recordings. Data Center for the Humanities. doi:10.18716/dch/b.00000001.

Rau, Felix. 2021. Dora Telugu Recordings. Data Center for the Humanities.

Rau, Felix. 2021. Kuvi Recordings. Data Center for the Humanities.

Schnell, Stefan. 2018. Multi-CAST Vera'a. Data Center for the Humanities.

Taluah, Asangba Reginald. 2020. Grandmasters of the Drum. A literary linguistic analysis of the Dagbamba Panegyrics. Data Center for the Humanities.

Patrick Jahn recording Fidelia Amalgethu While she’s planting new yam in her garden (foto by Henrike Frye 2023)
Shirley Balar and Patrick Jahn practising to set up the new camera (foto by Henrike Frye 2023)

External language corpora and datasets

Beaudry, Nicole & Ingeborg Fink. 2012–2013. Collection Délı̨nę. The Language Archive.

Belo, Maurício C. A., John Bowden, John Hajek, Alexandre V. Tilman & Nikolaus P. Himmelmann. 2002–2006, DoBeS Waima'a Documentation. The Language Archive.

Beuse, Silke Angelika, Katharina Haude & Miguel Ángel. 2001–2010. Collection Movima. The Language Archive.

Birch, Bruce, Nicholas Evans, Murray Garde, Linda Barwick, Kim Akerman, Oscar Whitehead, Ruth Singer, Bernhard Schebeck, Glenn Wightman, Heather Hinch (now Heather Hewitt) & Noreen Pym. 1966–2006. Collection Iwaidja team. The Language Archive.

Compes, Isabel. 2015–2016. Zaghawa-Wagi: Towards documenting the Sudanese dialectal variant of Zaghawa. Endangered Languages Archive.

Conceição Aparício Belo, Maurício da, Nikolaus P. Himmelmann, John F. Bowden, John Hajek, Alexandre V. Tilman, Alex Gusmao Freitas & José Costa Gomes. 2002–2006. Collection Waimaa team. The Language Archive.

Daigle, Benjamin & Sonny A. Djonler. 2014. “Demonstratives.” In Collection Aru languages. The Language Archive.

Djonler, Sonny A., Benjamin Daigle, Emilie Wellfelt, Antoinette Schapper, A. Ross Gordon, David de Winne & Jock Hughes. 1985–2016. Collection Aru languages. The Language Archive.

Döhler, Christian. 2017–2019. A comprehensive documentation of Bine – a language of Southern New Guinea. Endangered Languages Archive.

Fink, Ingeborg. 2010–2011. Dene Narratives – Language Documentation in Délįnę, NWT, Canada. Endangered Languages Archive.

Frye, Henrike. 2022. The Simbali Baining of Papua New Guinea: A community-based documentation project. Endangered Languages Archive.

Gijn, Rik van, Vincent Hirtzel, Sonja Gipper & Jeremías Ballivián Torrico. 2011. The Yurakaré Archive. The Language Archive.

Hellwig, Birgit & Gertrud Schneider-Blum. 2014. A Documentation of Tabaq, a Hill Nubian language of the Sudan, in its sociolinguistic context. Endangered Languages Archive.

Hellwig, Birgit. 2013. Language socialisation and the transmission of Qaqet Baining (Papua New Guinea): Towards a documentation project. Endangered Languages Archive.

Hellwig, Birgit. 2003. Goemai Texts. Endangered Languages Archive.

Hellwig, Birgit. 1999–2003. Goemai corpus. The Language Archive.

Hellwig, Birgit, Gertrud Schneider-Blum & Khaleel Bakheet Khaleel Ismail. 2022. “Tabaq (Karko) DoReCo dataset.” In Frank Seifart, Ludger Paschen & Matthew Stave (eds.), Language Documentation Reference Corpus (DoReCo) 1.2. Berlin & Lyon: Leibniz-Zentrum Allgemeine Sprachwissenschaft & laboratoire Dynamique Du Langage (UMR5596, CNRS & Université Lyon 2). / doi:10.34847/nkl.eea8144j.

Hellwig, Birgit. 2022. “Goemai DoReCo dataset.” In Frank Seifart, Ludger Paschen & Matthew Stave (eds.), Language Documentation Reference Corpus (DoReCo) 1.2. Berlin & Lyon: Leibniz-Zentrum Allgemeine Sprachwissenschaft & laboratoire Dynamique Du Langage (UMR5596, CNRS & Université Lyon 2). / doi:10.34847/nkl.b93664ml.

Himmelmann, Nikolaus P. & Sonja Riesberg. 2014. “Eipo Summits Collection.” In Collection CELD Papua. The Language Archive.

Kirihio, Jimmi Karter, Volker Unterladstetter, Apriani Arilaha, Freya Morigerowsky, Alexander Loch, Yusuf Sawaki & Nikolaus P. Himmelmann. 2005–2010. DobeS Wooi Documentation. The Language Archive.

Lau, Jonas. 2018–2019. Documenting Àbèsàbèsì. Endangered Languages Archive.

Leto, Claudia, Winarno Salim Alamudi, Nikolaus P. Himmelmann, Sonja Riesberg, Jani Kuhnt-Saptodewo, Antara News Tolitoli & Bapak Zaharman. 1988–2010. Collection Totoli. The Language Archive.

Leto, Claudia, Winarno S. Alamudi, Jani Kuhnt-Saptodewo, Sonja Riesberg, Hasan Basri & Nikolaus P. Himmelmann. 2005. DoBeS Totoli Documentation. The Language Archive.

Mazzitelli, Lidia Federica. 2017. Documentation and Description of Lakurumau. Endangered Languages Archive.

Mitchell, Alice 2017. Causality Across Languages (CAL): Datooga. Endangered Languages Archive.

Mueller, Gabriele, Dagmar Jung, Olga Charlotte Müller, Julia Colleen Miller, Kate Hennessy, Patrick Moore, Pat Moore, Gabriele Schwiertz & Amber Ridington. 2004–2009. Collection Beaver Archive. The Language Archive.

Narfafan, Sutriani & Emanuel Tuturop, 2009–2016. DoBeS Iha Documentation. The Language Archive.

Ozerov, Pavel. 2016–2017. A community-driven documentation of natural discourse in Anal, an endangered Tibeto-Burman language. Endangered Languages Archive.

Riesberg, Sonja & Nikolaus P. Himmelmann. 2010–2014. “Summits-PAGE Collection.” In Collection CELD Papua. The Language Archive.

Riesberg, Sonja, Nikolaus P. Himmelmann, Kristian Walianggen & Apriani Arilaha. 2012–2015. “Yali Summits Collection.” In Collection CELD Papua. The Language Archive.

Riesberg, Sonja, Kristian Walianggen & Siegfried Zöllner. 2012–2016. DoBeS Documentation Summits in the Central Mountains of Papua. The Language Archive.

Schapper, Antoinette. 2012–2014. Zapal, an oral literature genre of the Bunaq Lamaknen. Endangered Languages Archive.

Si, Aung. 2010–2011. Documentation of the language and biological knowledge of the Solega. Endangered Languages Archive.

Si, Aung. 2012–2013. Documentation of Danau, an endangered language of Myanmar (Burma). Endangered Languages Archive.

Si, Aung. 2014. “Kune.” In Collection SI1. PARADISEC.

Unterladstetter, Volker, Alexander Loch, Freya Morigerowsky & Yusuf Sawaki. 2009–2013. Collection Wooi. The Language Archive.

Wegener, Claudia. 2022. “Savosavo DoReCo dataset.” In Seifart, Frank, Ludger Paschen and Matthew Stave (eds.), Language Documentation Reference Corpus (DoReCo) 1.2. Berlin & Lyon: Leibniz-Zentrum Allgemeine Sprachwissenschaft & laboratoire Dynamique Du Langage (UMR5596, CNRS & Université Lyon 2). / doi:10.34847/nkl.b74d1b33.

Wegener, Claudia, Aurélie Cauchard, Ian Scales and Eva Schultze-Berndt (eds.). 2007–2015. Collection “Savosavo and Gela.” The Language Archive.

Winarno (on the right) interviewing Abdullah Allamudi (on the left) for the Totoli Documentation Project 2 (foto by Maria Bardají)
Christoph Bracks (on the right) recording the Totoli speaker Ramlin while telling folk stories to children for the Totoli Documentation Project 2 (foto by Maria Bardají)

Language corpora and datasets in development

Compes, Isabel. In development. Recordings of Beria poetry, drumming, dances and songs. Data Center for the Humanities.

Gipper, Sonja & Jeremías Ballivián Torrico. 2023 (forthc.). Yurakaré Glottobank. Endangered Languages Archive.

Kapitonov, Ivan. Submitted. Kunbarlang, central Arnhem Land. PARADISEC.

Mitchell, Alice. In development. A video corpus of Barabaiga- and Gisamjanga-Datooga conversation.
