zum Inhalt springen

Language Data

↓ Language corpora and datasets
↓ External language corpora and datasets
Language corpora and datasets in development
 

Researchers at the University of Cologne produce and publish a wide range of corpora and other datasets in the field of language resesearch. These resources are often the result of years of work and are of great value to the research community.

The Data Center for the Humanities (DCH) has taken on stewardship for many of these corpora and datasets. We are committed to the long-term preservation of these resources and to making them available to the research community. The DCH is a member of the CLARIN and NFDI Text+ research infrastructure and is committed to the FAIR principles of data stewardship (Wilkinson et al. 2016) and in the case of language resources connected to indigenous communities, we are also committed to the CARE principles.

The corpora and other datasets are published through the Language Archive Cologne or archived via the DCH data archiving service. The Language Archive Cologne is a Core Trust Seal certified repository and is a member of the CLARIN and NFDI Text+ research infrastructures as well as of the DELAMAN network.

Researchers at the University of Cologne have also deposited corpora and other datasets in other DELAMAN language archives as well as in other repositories. These datasets are also listed below.

Language corpora and datasets

Bracks, Christoph A., Datra Hasan, Maria Bardají i Farré, Sumitro Pogi & Nikolaus P. Himmelmann. 2023. Totoli documentation corpus 2. Data Center for the Humanities. https://doi.org/10.18716/dch/a.00000014

Degener, Almuth, Eugen Hill & Daniel Kölligan. 2019. Nuristani Archive Cologne. Data Center for the Humanities. https://doi.org/10.18716/dch/b.00000003

Gipper, Sonja. 2018. The Family Problems Picture Task in Yurakaré. Data Center for the Humanities. https://doi.org/10.18716/dch/b.00000009

Hannß, Katja. 2015. Etymological Kallawaya Dictionary. Data Center for the Humanities. https://doi.org/10.18716/dch/b.00000007

Heissig, Walther & Klaus Sagaster. 2019. Oral Tales of Mongolian Bards. Data Center for the Humanities. https://doi.org/10.18716/dch/b.00000002

Hellwig, Birgit. 2018. Zaghawa. Data Center for the Humanities. https://doi.org/10.18716/dch/b.00000005

Hellwig, Birgit, Carmen Dawuda, Henrike Frye & Steffen Reetz. 2023. Qaqet Child Language Corpus: Longitudinal Study. Data Center for the Humanities. https://doi.org/10.18716/dch/c.00000001

Lalinde, Miranda. 2022. Documentation of language variety in Latin America and the Caribbean. Data Center for the Humanities. https://doi.org/10.18716/dch/b.00000006

Niethammer, Lutz. 2020. Lebensgeschichte und Sozialkultur im Ruhrgebiet 1930 bis 1960 (LUSIR). Data Center for the Humanities. https://doi.org/10.18716/dch/b.00000000

Oukafi, Issak Cheikh. 2022. Interviews about Rock Art. Data Center for the Humanities. https://doi.org/10.18716/dch/b.00000008

Rau, Felix. 2014. Gtaq Field Recordings. Data Center for the Humanities. doi:10.18716/dch/b.00000001

Rau, Felix. 2021. Dora Telugu Recordings. Data Center for the Humanities. https://doi.org/10.18716/dch/a.00000001

Rau, Felix. 2021. Kuvi Recordings. Data Center for the Humanities. https://doi.org/10.18716/dch/a.00000002

Schnell, Stefan. 2018. Multi-CAST Vera'a. Data Center for the Humanities. https://doi.org/10.18716/dch/b.00000010

Taluah, Asangba Reginald. 2020. Grandmasters of the Drum. A literary linguistic analysis of the Dagbamba Panegyrics. Data Center for the Humanities. https://doi.org/10.18716/dch/b.00000004

External language corpora and datasets

Compes, Isabel. 2017. Zaghawa-Wagi: Towards documenting the Sudanese dialectal variant of Zaghawa. Endangered Languages Archive. http://hdl.handle.net/2196/00-0000-0000-000F-BF52-A.

Frye, Henrike. 2022. The Simbali Baining of Papua New Guinea: A community-based documentation project. Endangered Languages Archive. http://hdl.handle.net/2196/223p4444-is3o-5682-g99u-98oob457698m

Gijn, Rik van, Vincent Hirtzel, Sonja Gipper & Jeremías Ballivián Torrico. 2011. The Yurakaré Archive. DoBeS Archive, MPI Nijmegen. https://hdl.handle.net/1839/8df587ed-3d6e-4db8-bfe5-4ecad5cef3a2.

Hellwig, Birgit & Gertrud Schneider-Blum. 2014. A Documentation of Tabaq, a Hill Nubian language of the Sudan, in its sociolinguistic context. Endangered Languages Archive. http://hdl.handle.net/2196/eab29275-6e4f-4d3f-8b0d-2badfa322d55 .

Hellwig, Birgit. 2013. Language socialisation and the transmission of Qaqet Baining (Papua New Guinea): Towards a documentation project. Endangered Languages Archive. http://hdl.handle.net/2196/b0dbb431-a51e-4988-8a52-106d9d4f1406 .

Hellwig, Birgit. 2003. Goemai Texts. Endangered Languages Archive. http://hdl.handle.net/2196/4bf05980-7e71-4efb-b6d1-18713b8fcb02 .

Hellwig, Birgit. 1999-2003. Goemai corpus. The Language Archive. https://hdl.handle.net/1839/00-0000-0000-0000-6B5E-B .

Hellwig, Birgit, Gertrud Schneider-Blum & Khaleel Bakheet Khaleel Ismail. 2022. “Tabaq (Karko) DoReCo dataset.” In Frank Seifart, Ludger Paschen & Matthew Stave (eds.), Language Documentation Reference Corpus (DoReCo) 1.2. Berlin & Lyon: Leibniz-Zentrum Allgemeine Sprachwissenschaft & laboratoire Dynamique Du Langage (UMR5596, CNRS & Université Lyon 2). https://doreco.huma-num.fr/languages/kark1256 / doi:10.34847/nkl.eea8144j.

Hellwig, Birgit. 2022. “Goemai DoReCo dataset.” In Frank Seifart, Ludger Paschen & Matthew Stave (eds.), Language Documentation Reference Corpus (DoReCo) 1.2. Berlin & Lyon: Leibniz-Zentrum Allgemeine Sprachwissenschaft & laboratoire Dynamique Du Langage (UMR5596, CNRS & Université Lyon 2). https://doreco.huma-num.fr/languages/goem1240 / doi:10.34847/nkl.b93664ml.

Mitchell, Alice 2017. Causality Across Languages (CAL): Datooga. Endangered Languages Archive. http://hdl.handle.net/2196/00-0000-0000-000F-F304-0.

Wegener, Claudia. 2022. “Savosavo DoReCo dataset.” In Seifart, Frank, Ludger Paschen and Matthew Stave (eds.), Language Documentation  Reference Corpus (DoReCo) 1.2. Berlin & Lyon: Leibniz-Zentrum Allgemeine Sprachwissenschaft & laboratoire Dynamique Du Langage (UMR5596, CNRS & Université Lyon 2). https://doreco.huma-num.fr/languages/savo1255 / doi:10.34847/nkl.b74d1b33.

Wegener, Claudia, Aurélie Cauchard, Ian Scales and Eva Schultze-Berndt (eds.). 2007–2015. Collection “Savosavo and Gela”. The Language Archive. https://hdl.handle.net/1839/fe2f3be5-57dc-4ccd-912d-57ed111e653c.

Language corpora and datasets in development

Gipper, Sonja & Jeremías Ballivián Torrico. 2023 (forthc.). Yurakaré Glottobank. Endangered Languages Archive. http://hdl.handle.net/2196/p88m9804-7505-6k4w-990n-b5n00239047x.

Kapitonov, Ivan. Submitted. Kunbarlang, central Arnhem Land. PARADISEC.

Mitchell, Alice. In development. A video corpus of Barabaiga- and Gisamjanga-Datooga conversation.