01_corpus:02_preprocessing:04_languages
Differences
This shows you the differences between two versions of the page.
Previous revision | |||
— | 01_corpus:02_preprocessing:04_languages [2022/06/27 09:21] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== 1.2.4 Languages and varieties ====== | ||
+ | ===== Languages and varieties per chat ===== | ||
+ | In order to assign a language tagging to each chat, we looked at the first 250 messages and assigned two possible attributes per language: | ||
+ | |||
+ | * lang_100_and_more: | ||
+ | * lang_less_than_100: | ||
+ | |||
+ | for the following languages: | ||
+ | * fra: French | ||
+ | * ita: Italian | ||
+ | * roh: Any variety of Romansh | ||
+ | * gsw: dialectal German as used in Switzerland | ||
+ | * deu: non-dialectal German | ||
+ | * eng: English | ||
+ | * spa: Spanish | ||
+ | * sla: Any Slavic language | ||
+ | |||
+ | Please note: In the browsing tool ANNIS, we created sub-corpora per language, where each message appears in one and only one sub-corpus. In most cases, this it the language that delivers more than 100 chats. If there are two languages providing more than 100 messages, we arbitrarily prioritized the languages: ROH > GSW > FRA > DEU > ITA > ENG/ | ||
+ | |||
+ | If you want to work with all chats that contain a specific language in more than 100 messages, use the query '' | ||
+ | |||
+ | For an overview over languages and varieties in the corpus consult: | ||
+ | Ueberwasser, | ||
+ | |||
+ | |||
+ | ===== Languages and varieties per message ===== | ||
+ | The information of the main language of a message is saved in the annotation // | ||
+ | |||
+ | Available languages: | ||
+ | * fra: French | ||
+ | * ita: Italian | ||
+ | * roh: Any variety of Romansh | ||
+ | * gsw: dialectal German as used in Switzerland | ||
+ | * deu: non-dialectal German | ||
+ | * eng: English | ||
+ | * spa: Spanish | ||
+ | * sla: Any Slavic language | ||
+ | |||
+ | Romansh varieties: | ||
+ | |||
+ | * roh-ja: Jauer Romansh | ||
+ | * roh-sr: romontsch sursilvan | ||
+ | * roh-st: rumàntsch sutsilvan | ||
+ | * roh-sm: rumantsch surmiran | ||
+ | * roh-pt: rumauntsch puter | ||
+ | * roh-vl: rumantsch vallader | ||
+ | * roh-gr: rumantsch grischun |