01_corpus:01_subcorpora
Differences
This shows you the differences between two versions of the page.
Previous revision | |||
01_corpus:01_subcorpora [2020/04/16 11:34] – simone | — | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== 1.1 Sub-corpora ====== | ||
- | Based on the annotation of the languages per chat, different sub-corpora were created. | ||
- | |||
- | The following basic considerations were applied when creating the sub-corpora: | ||
- | |||
- | ===== Definitions for sub-corpora ===== | ||
- | |||
- | * Each chat was to be assigned to only one language-sub-corpus. | ||
- | * Additionally, | ||
- | * Where additional tasks were performed on individual chats (e.g. normalization or part-of-speech tagging) we created additional sub-corpora exist per language. | ||
- | |||
- | |||
- | ===== Main sub-corpora ===== | ||
- | |||
- | * WUS: All data, i.e. the whole corpus | ||
- | * WUS_DEU: All data where non-dialectal German provides the most messages | ||
- | * WUS_DEU_DEMOG: | ||
- | * WUS_FRA: All data where French provides the most messages | ||
- | * WUS_FRA_DEMOG: | ||
- | * WUS_GSW: All data where dialectal German provides the most messages | ||
- | * WUS_GSW_DEMOG: | ||
- | * WUS_ITA: All data where Italian provides the most messages | ||
- | * WUS_ITA_DEMOG: | ||
- | * WUS_ROH: All data where Romansh provides the most messages | ||
- | * WUS_ROH_DEMOG: | ||
- | |||
- | |||
- | ===== Smaller corpora ===== | ||
- | |||
- | Next to these main sub-corpora, | ||
- | * WUS_SMALL: Chats that are either smaller than 100 messages or where the majority of messages are not in a national language. | ||
- | * WUS_SMALL_DEMOG: | ||
- | * WUSdemographics: | ||
- | * WUS_ARGDROP and WUS_ARGDROP_language: | ||
- | |||
- | |||
- | ===== More information about the subcorpora ===== | ||
- | The individual sub-corpora are well documented in terms of size etc. within the browsing tool. Check the according [[02_browsing: | ||
- | |||
01_corpus/01_subcorpora.txt · Last modified: 2022/06/27 09:21 by 127.0.0.1