- 1. THE CORPUS
- 2. USING THE CORPUS
- 3. PROJECT/PUBLICATIONS
This is an old revision of the document!
After the Collecting the data, we had around 650 chats in different languages but no idea which chat was in which language. Furthermore, we had given a promise to anonymize the data and we did not have a tool to browse the data in the available format. Thus, before making the data available to the research team, we had to pre-process them. We thus had to perform some steps before our research could start.
The data in the corpus was anonymized my means with the same methodology that we already applied successfully in the SMS corpus.
While the project did not have the intention of collecting private information about the informants (other than what they provided in the questionnaire), it could still not be assumed that the informants would not sent personal information in their SMS, so it was the team's task to remove specific pieces of information again. These steps were performed by means of computational linguistics. The stories told, however might still allow you to recognize individual informants. If that is the case, we asked to comply with common research ethics and keep that knowledge to yourself.
A reference list of first names in different languages was used to remove all first names. As always with such a task, it was a balance act between precision and recall. On the one hand, all first names should be removed from the data, on the other hand no information that is homograph to a first name should get lost. To get the best possible result, it was decided to not actually remove first names, but to rotate them, meaning that the name Peter in an SMS would not get replaced by e.g. [FirstName], but by e.g. Ferdinand. This procedure has several advantages:
Tests show, that more than 95% of all first names were in fact removed.
Only very few last names can in fact be found in the data. Because of this limitation, the same procedure as with first names could not be applied, because additionally some of the last names used are very rare if not unique. It was therefor decided to replace all last names with [LastName] instead. In a combined effort of manually analyzing and means of computer linguistics, more than 95% of all last names were removed. Numbers In an effort to remove information about phone numbers, bank accounts etc., all numbers with three and more digits where removed and each digit was replaced with one N. The phone number 079 987 65 43 would thus become NNN NNN 65 43, while 0799876543 would be NNNNNNNNNN. Reliability here lies with 100%.
All email adresses were removed and replaced with firstname.lastname@example.org, while keeping the number of characters. email@example.com would therefore become firstname.lastname@example.org, while email@example.com would become firstname.lastname@example.org.
Street addresses were removed and replaced by [StreetAddress].
WWW addresses were kept since they contain information publicly available.
Names of cities were kept because they cannot be considered as private information and because they may be important for the understanding of the text.