User Tools

Site Tools


01_corpus:start

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
01_corpus:start [2019/10/30 13:37] – ↷ Page moved from corpus:start to 01_corpus:start simone01_corpus:start [2022/06/27 09:21] (current) – external edit 127.0.0.1
Line 1: Line 1:
-====== The corpus ======+====== 1. THE CORPUS ====== 
 +The corpus consists of 617 chats that were sent in by the Swiss population in 2014 through a fixed procedure that was communicated in the press in order to get people interested. The individual chats were checked for their [[01_corpus:02_preprocessing|permission]] to use them and chats that did not have it were removed. Furthermore, available [[01_corpus:03_demographics|demographic data]] were linked to the chats.
  
-The corpus consists of 617 chats that were sent in by the Swiss population in 2014 through a [[corpus:01_collection|fixed procedure]] that was communicated in the press in order to get people interested. The individual chats were checked for their [[corpus:02_preprocessing|permission]] to use them and for chats that had to be [[corpus:02_preprocessing:03_removed|removed]]. Furthermore, [[corpus:03_demographics|demographic data]] (were provided) were linked to the chats.+Next processing steps comprised [[01_corpus:02_preprocessing:01_anonymization|anonymization]], the annotation of a [[01_corpus:02_preprocessing:04_languages|main language]] per chat and thus the creation of [[01_corpus:01_subcorpora|subcorpora]], application of further annotations (for [[01_corpus:02_preprocessing:04_languages|languages]], i.e. each message was annotated for its most likely language as opposed to the chat annotation performed in the first step), [[01_corpus:02_preprocessing:06_pos|part of speech annotations]], [[01_corpus:02_preprocessing:07_normalization|normalization]] for part of the dialectal Swiss German data.
  
-In a first step the most basic processing of the data took place such as to allow the project members to work with the dataThis included the [[corpus:02_preprocessing:01_anonymization|anonymization]] and the annotation of a [[corpus:02_preprocessing:02_language_per_chat|main language]] per chat and thus the creation of [[corpus:subcorpora|subcorpora]].+Our authentic WhatsApp chats were gathered in summer 2014Not all made it into the corpus (e.g. doublets, chats or message without permission etc.). In its present form, the corpus comprises:
  
-In a later step, more [[corpus:04_annotations|annotations]] were applied to the corpus. This included a more profound annotation of [[corpus:02_preprocessing:02_languages|languages]] (i.e. each message was annotated for its language as opposed to the chat annotation performed in the first step), [[corpus:pos|part of speech annotations]] were applied and the German dialectal data was [[corpus:normalization|normalized]].+  * Number of chats617 
 +  * Number of messages (with permission to be used): 763’644 
 +  * Number of informants (who gave their permission): 944 
 +  *  Number of tokens: 5'155'476 (without redactedQ.* (cf. [[01_corpus:02_preprocessing:02_without_permission|Messages without permission]])) 
 +  * Number of emojis: 382'116 
 + 
 +The corpus is built up of chats in all four national languages of Switzerland, i.e. Swiss German dialect, non-dialectal German, French, Italian and varieties of Romansh. In more detail, the following languages and varieties can be found in the corpus: 
 + 
 +Available languages: 
 +  * fra: French 
 +  * ita: Italian 
 +  * roh: Any variety of Romansh 
 +  * gsw: dialectal German as used in Switzerland 
 +  * deu: non-dialectal German 
 +  * eng: English 
 +  * spa: Spanish 
 +  * sla: Any Slavic language 
 + 
 +Romansh varieties: 
 + 
 +  * roh-ja: Jauer Romansh 
 +  * roh-sr: romontsch sursilvan 
 +  * roh-st: rumàntsch sutsilvan 
 +  * roh-sm: rumantsch surmiran 
 +  * roh-pt: rumauntsch puter 
 +  * roh-vl: rumantsch vallader 
 +  * roh-gr: rumantsch grischun  
 + 
 +The tool used to browse is [[https://corpus-tools.org/annis/|ANNIS]] and was developed and made available by Anke Lüdeling and her team: 
 + 
 +Krause, Thomas & Zeldes, Amir (2016): ANNIS3: A new architecture for generic corpus query and visualization. in: Digital Scholarship in the Humanities 2016 (31). [[http://dsh.oxfordjournals.org/content/31/1/118|http://dsh.oxfordjournals.org/content/31/1/118]]
  
  
01_corpus/start.1572439058.txt.gz · Last modified: 2022/06/27 09:21 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki