01_corpus:01_subcorpora
Differences
This shows you the differences between two versions of the page.
Previous revision | |||
— | 01_corpus:01_subcorpora [2022/06/27 09:21] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== 1.1 Sub-corpora ====== | ||
+ | Based on the annotation of the languages per chat, different sub-corpora were created. | ||
+ | |||
+ | The following basic considerations were applied when creating the sub-corpora: | ||
+ | |||
+ | ===== Definitions for sub-corpora ===== | ||
+ | |||
+ | * Each chat was to be assigned to only one language-sub-corpus. | ||
+ | * Additionally, | ||
+ | * Where additional tasks were performed on individual chats (e.g. normalization or part-of-speech tagging) we created additional sub-corpora per language. | ||
+ | |||
+ | |||
+ | ===== Main sub-corpora ===== | ||
+ | |||
+ | * WUS: All data, i.e. the whole corpus | ||
+ | * WUS_DEU: All data where non-dialectal German provides the most messages | ||
+ | * WUS_DEU_DEMOG: | ||
+ | * WUS_FRA: All data where French provides the most messages | ||
+ | * WUS_FRA_DEMOG: | ||
+ | * WUS_GSW: All data where dialectal German provides the most messages | ||
+ | * WUS_GSW_DEMOG: | ||
+ | * WUS_ITA: All data where Italian provides the most messages | ||
+ | * WUS_ITA_DEMOG: | ||
+ | * WUS_ROH: All data where Romansh provides the most messages | ||
+ | * WUS_ROH_DEMOG: | ||
+ | |||
+ | Additionally to these corpora, you also see corpora with lowercase letters in the browser (e.g. deu-rftagged, | ||
+ | ===== Smaller corpora ===== | ||
+ | |||
+ | Next to these main sub-corpora, | ||
+ | * WUS_SMALL: Chats that are either smaller than 100 messages or where the majority of messages are not in a national language. | ||
+ | * WUS_SMALL_DEMOG: | ||
+ | * WUSdemographics: | ||
+ | * WUS_ARGDROP and WUS_ARGDROP_language: | ||
+ | |||
+ | |||
+ | ===== Other corpora in the browsing tool ===== | ||
+ | Additionally to these corpora, you also see corpora with lowercase letters in the browser (e.g. deu-rftagged, | ||
+ | |||
+ | ===== More information about the subcorpora ===== | ||
+ | The individual sub-corpora are well documented in terms of size etc. within the browsing tool. Check the according [[02_browsing: | ||
+ | |||