User Tools

Site Tools


02_browsing:04_queries:03_regex

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
02_browsing:04_queries:02_regex [2020/04/17 21:16] simone02_browsing:04_queries:03_regex [2022/06/27 09:21] (current) – external edit 127.0.0.1
Line 1: Line 1:
-====== 2.4.Regular Expressions ======+====== 2.4.Regular Expressions ======
 In order to search for spelling variants, different forms of a lemma or else, you need to formulate RegEx expressions in ANNIS. For this, you put your query in between slashes.  In order to search for spelling variants, different forms of a lemma or else, you need to formulate RegEx expressions in ANNIS. For this, you put your query in between slashes. 
  
Line 10: Line 10:
  
 ===== A (very) short introduction to RegEx ===== ===== A (very) short introduction to RegEx =====
-"In computing, regular expressions, also referred to as RegEx or RegExp, provide a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters." ([[http://en.wikipedia.org/wiki/Regular_expression][Wikipedia]]). +RegEx takes a pattern of characters you enter into the search field and looks for matches of these characters in the database. Let us assume that the database to be queried is a string of characters like "the man manually attached the tube in Manchester" without any separation of tokens and words. Let us now query the three letters //man//. In this case, RegEx looks for the letter <m> followed by an <a> and then an <n> in the database, regardless of what the pattern is preceded or followed by. As a result, you will get //man// and //manual//, but you will not get //Manchester//, because the RegEx search is case sensitive, see below.
- +
-As Wikipedia tells us, RegEx takes a pattern of characters you enter into the search field and looks for matches of these characters in the database. Let us assume that the database to be queried is a string of characters like "the man manually attached the tube in Manchester" without any separation of tokens and words. Let us now query the three letters //man//. In this case, RegEx looks for the letter <m> followed by an <a> and then an <n> in the database, regardless of what the pattern is preceded or followed by. As a result, you will get //man// and //manual//, but you will not get //Manchester//, because the RegEx search is case sensitive, see below.+
  
 However, RegEx also allows you to search for such things as alternatives (//man// or //men//), for word boundaries etc. RegEx is a syntax widely spread in programming languages. In what follows, we try to offer an easy overview over the functions you might use most often in this corpus.  For more information, we refer you to [[http://www.regular-expressions.info/regular-expressions.info]]. However, RegEx also allows you to search for such things as alternatives (//man// or //men//), for word boundaries etc. RegEx is a syntax widely spread in programming languages. In what follows, we try to offer an easy overview over the functions you might use most often in this corpus.  For more information, we refer you to [[http://www.regular-expressions.info/regular-expressions.info]].
Line 43: Line 41:
  
 ==Variable letters== ==Variable letters==
-If you are looking for any letter, you can use ''\w'' (Remember as: word character.)+If you are looking for any letter, you can use ''\w'' (remember as: word character.)
  
  
Line 85: Line 83:
  
 == Diacritica== == Diacritica==
-This corpus is set up so as to recognize umlauts and letters with accents as individuals (Keep in mind that this is not the case in many other uses of RegEx. Especially in programs that were developed in the US, a <ü> is not considered as a letter but rather as a boundary). In our corpus, seearching for ''/mange/'' will therefore not find any occurrences of //mangé//.+This corpus is set up so as to recognize umlauts and letters with accents as individuals (keep in mind that this is not the case in many other uses of RegEx. Especially in programs that were developed in the US, a <ü> is not considered as a letter but rather as a boundary). In our corpus, seearching for ''/mange/'' will therefore not find any occurrences of //mangé//.
  
 === Digits=== === Digits===
Line 103: Line 101:
 Many different characters can occur in between your letters and digits: commas, full stops, spaces etc. Most of these characters can be used for queries like letters or numbers: Many different characters can occur in between your letters and digits: commas, full stops, spaces etc. Most of these characters can be used for queries like letters or numbers:
    * space    * space
-   coma+   comma
    * dash (-)    * dash (-)
    * semicolon (;)    * semicolon (;)
Line 112: Line 110:
    * exclamation mark (!)    * exclamation mark (!)
  
-NB: most of these characters do have a special function as well when they appear in a specific position. As you will see below, { } is one of the possible way to search for repeating characters. Thus, the character <{> can be recognized as a character in its own right or as a syntactic function depending on its position. The same goes for most of these characters.+NB: most of these characters do have a special function as well when they appear in a specific position. As you will see below, { } is one of the possible ways to search for repeating characters. Thus, the character <{> can be recognized as a character in its own right or as a syntactic function depending on its position. The same goes for most of these characters.
  
 Other separators are reserved by the RegEx syntax. To use them by their ordinary value, you have to place a backslash in front of them. Thus, you type in ''/m\*n/'' to look for //m*n//. These characters are: Other separators are reserved by the RegEx syntax. To use them by their ordinary value, you have to place a backslash in front of them. Thus, you type in ''/m\*n/'' to look for //m*n//. These characters are:
Line 141: Line 139:
 Accordingly, the system for querying is different. If you query for ''/man/'' on the token level, you will find exactly one occurrence, namely the token //man//, because all other tokens contain more than those three characters, e.g. //manually// contains five more characters. Accordingly, the system for querying is different. If you query for ''/man/'' on the token level, you will find exactly one occurrence, namely the token //man//, because all other tokens contain more than those three characters, e.g. //manually// contains five more characters.
  
-If you query for //man// on the message level, you will find nothing, because ANNIS will search for a whole message that contains only these three characters. In order to actually find the word you are looking for, you have to query for "any characters followed by the string //man// followed by any characters" (the function "any characters" will be introduces later on in detail). Such a string will look like:+If you query for //man// on the message level, you will find nothing, because ANNIS will search for a whole message that contains only these three characters. In order to actually find the word you are looking for, you have to query for "any characters (''.*''followed by the string //man// followed by any characters" (the function //any characters// consists of the character full stop that stands for //any character// as shown above. The asterisk stands for an endless repetition as explained in the next section). Such a string will look like:
  
-''msg=/.*?man.*/''+''msg=/.*man.*/''
  
 and will find //man// but also //manually//. and will find //man// but also //manually//.
Line 166: Line 164:
 Example: Example:
 ''/h+a+l+o+/'' ''/h+a+l+o+/''
-will find all variants of hallo+will find all variants of //hallo//
  
  
02_browsing/04_queries/03_regex.1587150988.txt.gz · Last modified: 2022/06/27 09:21 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki