01_corpus:02_preprocessing:01_anonymization
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
01_corpus:02_preprocessing:01_anonymization [2019/12/18 08:51] – simone | 01_corpus:02_preprocessing:01_anonymization [2022/06/27 09:21] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Anonymization ====== | + | ====== |
- | The data in the corpus was anonymized with the same methodology that we already applied successfully in the [[http:// | + | |
===== General privacy ===== | ===== General privacy ===== | ||
Line 8: | Line 7: | ||
===== First names ===== | ===== First names ===== | ||
- | A reference list of first names in different languages was used to remove all first names. As always with such a task, it was a balance act between precision and recall. On the one hand, all first names should be removed from the data, on the other hand no information that is homograph to a first name should get lost. | + | All first names found are based on freely available reference lists in the respective language. It was then decided to not actually remove first names, but to rotate them, meaning that the name Peter in a chat would not get replaced by e.g. [FirstName], |
- | To get the best possible result, it was decided to not actually remove first names, but to rotate them, meaning that the name Peter in an chat would not get replaced by e.g. [FirstName], | + | |
- The text remains easy to read. | - The text remains easy to read. | ||
- | - Because Peter is always replaced with Ferdinand, all occurrences of the same name remain the same. Conversations can therefor | + | - Because Peter is always replaced with Ferdinand, all occurrences of the same name remain the same. Conversations can therefore |
- | - Names that did not get replaced because of homography are not recognizable as such, i.e. if the name //Minna// appears in a chat, nobody can know, whether this is a replaced name or whether it is a name that was not replaced because it is a homograph to some rare word in Romansh. The scientists working with the data will therefor always assume that first names they come across are actually not the real names used in the chat. | + | |
- | Tests show, that more than 95% of all first names were in fact removed. | + | |
- | We tried to assign the same sex to the rotated names as to the original one, such as to keep the text readable. Very often, this resulted in good replacements, | + | Tests showed, that more than 95% of all first names were found and rotated in this way. |
+ | |||
+ | We tried to assign the same sex to the rotated names as to the original one, such as to keep the text readable. Very often, this resulted in good replacements, | ||
===== Last names ===== | ===== Last names ===== | ||
- | Only very few last names can in fact be found in the data. Because of this limitation, the same procedure as with first names could not be applied, because additionally some of the last names used are very rare if not unique. It was therefor | + | Only very few last names can in fact be found in the data. It was decided to replace all last names with [LastName]. |
===== Numbers ===== | ===== Numbers ===== | ||
- | In an effort | + | In order to remove information about phone numbers, bank accounts etc., all numbers with three and more digits where removed and each digit was replaced with one N. Reliability here lies at 100%. |
===== E-Mail addresses ===== | ===== E-Mail addresses ===== | ||
- | All email adresses | + | All email addresses |
===== Street addresses ===== | ===== Street addresses ===== |
01_corpus/02_preprocessing/01_anonymization.1576655469.txt.gz · Last modified: 2022/06/27 09:21 (external edit)