User Tools

Site Tools


01_corpus:02_preprocessing:06_pos

Differences

This shows you the differences between two versions of the page.


Previous revision
01_corpus:02_preprocessing:06_pos [2022/06/27 09:21] (current) – external edit 127.0.0.1
Line 1: Line 1:
 +====== 1.2.6 Part of Speech Tagging ======
 +Some sub-corpora have been annotated with Part Of Speech annotations. This concerns WUS_DIALOG_GSW, WUS_FRA, 
 +WUS_FRA_DEMOG, WUS_ITA, WUS_ITA_DEMOG.
 +
 +===== French =====
 +The whole French corpus has been annotated with [[https://team.inria.fr/almanach/fr/melt/|MElt]] (Modified French TreeBank) using the tag set [[http://french-postaggers.tiddlyspot.com/|CC Tagset]]. Available annotations are "mftb_pos" (for part of speech) and "mftb_lem" (for the lemma). The following tags are used:
 +
 +  * ''ADJ'' adjective
 +  * ''ADJWH'' interrogative adjective
 +  * ''ADV'' adverb
 +  * ''ADVWH'' interrogative adverb
 +  * ''CC'' coordinating conjunction
 +  * ''CLO'' object clitic pronoun
 +  * ''CLR'' reflexive clitic pronoun
 +  * ''CLS'' subject clitic pronoun
 +  * ''CS'' subordinating conjunction
 +  * ''DET'' determiner
 +  * ''DETWH'' interrogative determiner
 +  * ''ET'' foreign word
 +  * ''I'' interjection
 +  * ''NC'' common noun
 +  * ''NPP'' proper noun
 +  * ''P'' preposition
 +  * ''P+D'' preposition+determiner amalgam
 +  * ''P+PRO'' prepositon+pronoun amalgam
 +  * ''PONCT'' punctuation mark
 +  * ''PREF'' prefix
 +  * ''PRO'' full pronoun
 +  * ''PROREL'' relative pronoun
 +  * ''PROWH'' interrogative pronoun
 +  * ''V'' indicative or conditional verb form
 +  * ''VIMP'' imperative verb form
 +  * ''VINF'' infinitive verb form
 +  * ''VPP'' past participle
 +  * ''VPR'' present participle
 +  * ''VS'' subjunctive verb form
 +
 +
 +
 +===== Swiss German dialect =====
 +Five chats of the Swiss German dialectal data (34,683 tokens) have been manually normalized and annotated for Part of Speech. The according corpus is called WUS_DIALOG_GSW. Three annotations have been added to each token:
 +
 +  * gloss: The manual normalization
 +  * tt_pos: Part of Speech annotation with [[https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] based on the manually normalized tokens.
 +  * tt_lem: The lemma as assigned by TreeTagger
 +
 +
 +The [[https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/stts_guide.pdf|tagset]] uses the following tags:
 +
 +  * ''ADJA'' attributive adjective (including participles used adjectivally)
 +  * ''ADJD'' predicate adjective; adjective used adverbially
 +  * ''ADV'' adverb (never used as attributive adjective)
 +  * ''APPR'' preposition left hand part of double preposition
 +  * ''APPRART'' preposition with fused article
 +  * ''APPO'' postposition
 +  * ''APZR'' right hand part of double preposition
 +  * ''ART'' article (definite or indefinite)
 +  * ''CARD'' cardinal number (words or figures); also declined
 +  * ''FM'' foreign words (actual part of speech in original language may be appended, e.g. FMADV/ FM-NN)
 +  * ''ITJ'' interjection
 +  * ''KON'' co-ordinating conjunction
 +  * ''KOKOM'' comparative conjunction or particle
 +  * ''KOUI'' preposition used to introduce infinitive clause
 +  * ''KOUS'' subordinating conjunction
 +  * ''NA'' adjective used as noun
 +  * ''NE'' names and other proper nouns
 +  * ''NN'' noun (but not adjectives used as nouns)
 +  * ''PAV [PROAV]'' pronominal adverb
 +  * ''PAVREL'' pronominal adverb used as relative
 +  * ''PDAT'' demonstrative determiner
 +  * ''PDS'' demonstrative pronoun
 +  * ''PIAT'' indefinite determiner (whether occurring on its own or in conjunction with another determiner)
 +  * ''PIS'' indefinite pronoun
 +  * ''PPER'' personal pronoun
 +  * ''PRF'' reflexive pronoun
 +  * ''PPOSS'' possessive pronoun
 +  * ''PPOSAT'' possessive determiner
 +  * ''PRELAT'' relative depending on a noun
 +  * ''PRELS'' relative pronoun (i.e. forms of //der// or //welcher//)
 +  * ''PTKA'' particle with adjective or adverb
 +  * ''PTKANT'' answer particle
 +  * ''PTKNEG'' negative particle
 +  * ''PTKREL'' indeclinable relative particle
 +  * ''PTKVZ'' separable prefix
 +  * ''PTKZU'' infinitive particle //zu//
 +  * ''PWS'' interrogative pronoun
 +  * ''PWAT'' interrogative determiner
 +  * ''PWAV'' interrogative adverb
 +  * ''PWAVREL'' interrogative adverb used as relative
 +  * ''PWREL'' interrogative pronoun used as relative
 +  * ''TRUNC'' truncated form of compound
 +  * ''VAFIN'' finite auxiliary verb
 +  * ''VAIMP'' imperative of auxiliary
 +  * ''VAINF'' infinitive of auxiliary
 +  * ''VAPP'' past participle of auxiliary
 +  * ''VMFIN'' finite modal verb
 +  * ''VMINF'' infinitive of modal
 +  * ''VMPP'' past participle of auxiliary
 +  * ''VVFIN'' finite full verb
 +  * ''VVIMP'' imperative of full verb
 +  * ''VVINF'' infinitive of full verb
 +  * ''VVIZU'' infinitive with incorporated //zu//
 +  * ''VVPP'' past participle of full verb
 +
 +As in the French corpus, there are also combined tags such as //VAFIN+PPER// when a personal pronoun is agglutinated to a verb (//hätti// for 'hätte ich').
 +
 +
 +===== Italian =====
 +The Italian corpus is annotated with the [[https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]], too, but based on the original tokens, i.e. not manually normalized. 
 +
 +   * tt_pos: Part of Speech annotation with TreeTagger
 +   * tt_lem: The lemma as assigned by TreeTagger
 +
 +The following PoS [[https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/italian-tagset.txt|tagset]] was used:
 +  * ''ABR'' abbreviation
 +  * ''ADJ'' adjective
 +  * ''ADV'' adverb
 +  * ''CON'' conjunction
 +  * ''DET:def'' definite article
 +  * ''DET:indef'' indefinite article
 +  * ''FW'' foreign word
 +  * ''INT'' interjection
 +  * ''LS'' list symbol
 +  * ''NOM'' noun
 +  * ''NPR'' name
 +  * ''NUM'' numeral
 +  * ''PON'' punctuation
 +  * ''PRE'' preposition
 +  * ''PRE:det'' preposition+article
 +  * ''PRO'' pronoun
 +  * ''PRO:demo'' demonstrative pronoun
 +  * ''PRO:indef'' indefinite pronoun
 +  * ''PRO:inter'' interrogative pronoun
 +  * ''PRO:pers'' personal pronoun
 +  * ''PRO:poss'' possessive pronoun
 +  * ''PRO:refl'' reflexive pronoun
 +  * ''PRO:rela'' relative pronoun
 +  * ''SENT'' sentence marker
 +  * ''SYM'' symbol
 +  * ''VER:cimp'' verb conjunctive imperfect
 +  * ''VER:cond'' verb conditional
 +  * ''VER:cpre'' verb conjunctive present
 +  * ''VER:futu'' verb future tense
 +  * ''VER:geru'' verb gerund
 +  * ''VER:impe'' verb imperative
 +  * ''VER:impf'' verb imperfect
 +  * ''VER:infi'' verb infinitive
 +  * ''VER:pper'' verb participle perfect
 +  * ''VER:ppre'' verb participle present
 +  * ''VER:pres'' verb present
 +  * ''VER:refl:infi'' verb reflexive infinitive
 +  * ''VER:remo'' verb simple past
  

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki