User Tools

Site Tools


01_corpus:01_subcorpora

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
01_corpus:subcorpora [2019/11/06 09:23] simone01_corpus:01_subcorpora [2022/06/27 09:21] (current) – external edit 127.0.0.1
Line 1: Line 1:
-====== 1.Sub-corpora ====== +====== 1.Sub-corpora ====== 
-Based on the annotation of the languages per chat, different sub-corpora were created. Sub-corpora were built for the needs of the project members. Hopefully, they will be usefull for other research questions as well. +Based on the annotation of the languages per chat, different sub-corpora were created.  
  
 The following basic considerations were applied when creating the sub-corpora: The following basic considerations were applied when creating the sub-corpora:
Line 6: Line 6:
 ===== Definitions for sub-corpora ===== ===== Definitions for sub-corpora =====
  
-  * Each chat was to be assigned to only one language-sub-corpus. As mentioned [[01_corpus:02_preprocessing:03_language_per_chat|above,]] each chat has an annotation for languages that occur in 100 and more messages (meta::lang_100_and_more). However, there are chats with e.g. more than 100 messages in French and in German. In that case, the language that provides most chats is used for the chat to be assigned to a language-sub-corpus. Example: if a chat is built up of 150 messages in French and 120 messages in German, it appears in the main corpus (WUS) as well as in the French sub-corpora (WUS_FRA and WUS_FRA_DEMOG) but not in any of the German sub-corpora.+  * Each chat was to be assigned to only one language-sub-corpus. 
   * Additionally, we differentiate between chats where we have demographic information for all participants and those where we do not. In the former case, the sub-corpus gets the extension _DEMOG.   * Additionally, we differentiate between chats where we have demographic information for all participants and those where we do not. In the former case, the sub-corpus gets the extension _DEMOG.
   * Where additional tasks were performed on individual chats (e.g. normalization or part-of-speech tagging) we created additional sub-corpora per language.   * Where additional tasks were performed on individual chats (e.g. normalization or part-of-speech tagging) we created additional sub-corpora per language.
Line 13: Line 13:
 ===== Main sub-corpora ===== ===== Main sub-corpora =====
  
-These rules result in the following main corpora: 
   * WUS: All data, i.e. the whole corpus   * WUS: All data, i.e. the whole corpus
   * WUS_DEU: All data where non-dialectal German provides the most messages   * WUS_DEU: All data where non-dialectal German provides the most messages
-  * WUS_DEU_DEMOG: subgroup thereof where we have the permission from all communication partners to use their texts.+  * WUS_DEU_DEMOG: subgroup thereof where we have demographic information from all communication partners.
   * WUS_FRA: All data where French provides the most messages   * WUS_FRA: All data where French provides the most messages
-  * WUS_FRA_DEMOG: subgroup thereof where we have the permission from all communication partners to use their texts.+  * WUS_FRA_DEMOG: subgroup thereof where we have demographic information from all communication partners.
   * WUS_GSW: All data where dialectal German provides the most messages   * WUS_GSW: All data where dialectal German provides the most messages
-  * WUS_GSW_DEMOG: subgroup thereof where we have the permission from all communication partners to use their texts.+  * WUS_GSW_DEMOG: subgroup thereof where we have demographic information from all communication partners.
   * WUS_ITA: All data where Italian provides the most messages   * WUS_ITA: All data where Italian provides the most messages
-  * WUS_ITA_DEMOG: subgroup thereof where we have the permission from all communication partners to use their texts. +  * WUS_ITA_DEMOG: subgroup thereof where we have demographic information from all communication partners.
-  * *WUS_ITA_TT: Some Italian chats were manually normalized (ie. converted into a standard spelling) and automatically annotated for Part of Speech.+
   * WUS_ROH: All data where Romansh provides the most messages   * WUS_ROH: All data where Romansh provides the most messages
-  * WUS_ROH_DEMOG: subgroup thereof where we have the permission from all communication partners to use their texts. +  * WUS_ROH_DEMOG: subgroup thereof where we have demographic information from all communication partners.
  
 +Additionally to these corpora, you also see corpora with lowercase letters in the browser (e.g. deu-rftagged, ita-tagged, roh etc.). These corpora contain data from our [[https://wiki.linguistik.uzh.ch/sms4science|SMS project]].
 ===== Smaller corpora ===== ===== Smaller corpora =====
  
-Next to these main corpora, there are some smaller corpora: +Next to these main sub-corpora, there are some smaller sub-corpora: 
-  * WUS_SMALL: chats that are either smaller than 100 messages or where the languages cannot be clearly distinguished +  * WUS_SMALL: Chats that are either smaller than 100 messages or where the majority of messages are not in a national language. 
-  * WUS_SMALL_DEMOG: subgroup thereof where we have the permission from all communication partners to use their texts+  * WUS_SMALL_DEMOG: subgroup thereof where we have demographic information from all communication partners. 
-  * WUSdemographics: only demographic data per person. This corpus is much faster if you want to look up demographic data only. +  * WUSdemographics: Only demographic data per person. This sub-corpus is much faster if you want to look up demographic data only. 
-  * WUS_ARGDROP and WUS_ARGDROP_language: corporafor which argument drop has been manually annotated.+  * WUS_ARGDROP and WUS_ARGDROP_language: Sub-corpora for which argument drop has been manually annotated. For the architecture of the annotations and scientific considerations behind it see [[http://www.unige.ch/lettres/linge/syntaxe/journal/Volume11/11_Stuntebeck_2018.pdf|Stuntebeck, Franziska (2018): "Annotating Argument Drop in the Swiss WhatsApp Corpus". In: Generative Grammar in Geneva (GG@G) XI, 175-187.]]
  
  
 +===== Other corpora in the browsing tool =====
 +Additionally to these corpora, you also see corpora with lowercase letters in the browser (e.g. deu-rftagged, ita-tagged, roh etc.). These corpora contain data from our [[https://wiki.linguistik.uzh.ch/sms4science|SMS project]].
  
 +===== More information about the subcorpora =====
 +The individual sub-corpora are well documented in terms of size etc. within the browsing tool. Check the according [[02_browsing:01_sub_corpora|section]] for more information.
  
  
01_corpus/01_subcorpora.1573028595.txt.gz · Last modified: 2022/06/27 09:21 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki