02_browsing:04_queries:03_regex
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
02_browsing:04_queries:02_regex [2020/04/17 21:00] – simone | 02_browsing:04_queries:03_regex [2022/06/27 09:21] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Regular Expressions ====== | + | ====== |
In order to search for spelling variants, different forms of a lemma or else, you need to formulate RegEx expressions in ANNIS. For this, you put your query in between slashes. | In order to search for spelling variants, different forms of a lemma or else, you need to formulate RegEx expressions in ANNIS. For this, you put your query in between slashes. | ||
Line 10: | Line 10: | ||
===== A (very) short introduction to RegEx ===== | ===== A (very) short introduction to RegEx ===== | ||
- | "In computing, regular expressions, | + | RegEx takes a pattern of characters you enter into the search field and looks for matches of these characters in the database. Let us assume that the database to be queried is a string of characters like "the man manually attached the tube in Manchester" |
- | + | ||
- | As Wikipedia tells us, RegEx takes a pattern of characters you enter into the search field and looks for matches of these characters in the database. Let us assume that the database to be queried is a string of characters like "the man manually attached the tube in Manchester" | + | |
However, RegEx also allows you to search for such things as alternatives (//man// or //men//), for word boundaries etc. RegEx is a syntax widely spread in programming languages. In what follows, we try to offer an easy overview over the functions you might use most often in this corpus. | However, RegEx also allows you to search for such things as alternatives (//man// or //men//), for word boundaries etc. RegEx is a syntax widely spread in programming languages. In what follows, we try to offer an easy overview over the functions you might use most often in this corpus. | ||
Line 43: | Line 41: | ||
==Variable letters== | ==Variable letters== | ||
- | If you are looking for any letter, you can use '' | + | If you are looking for any letter, you can use '' |
Line 85: | Line 83: | ||
== Diacritica== | == Diacritica== | ||
- | This corpus is set up so as to recognize umlauts and letters with accents as individuals (Keep in mind that this is not the case in many other uses of RegEx. Especially in programs that were developed in the US, a <ü> is not considered as a letter but rather as a boundary). In our corpus, seearching for ''/ | + | This corpus is set up so as to recognize umlauts and letters with accents as individuals (keep in mind that this is not the case in many other uses of RegEx. Especially in programs that were developed in the US, a <ü> is not considered as a letter but rather as a boundary). In our corpus, seearching for ''/ |
=== Digits=== | === Digits=== | ||
Line 103: | Line 101: | ||
Many different characters can occur in between your letters and digits: commas, full stops, spaces etc. Most of these characters can be used for queries like letters or numbers: | Many different characters can occur in between your letters and digits: commas, full stops, spaces etc. Most of these characters can be used for queries like letters or numbers: | ||
* space | * space | ||
- | | + | |
* dash (-) | * dash (-) | ||
* semicolon (;) | * semicolon (;) | ||
Line 112: | Line 110: | ||
* exclamation mark (!) | * exclamation mark (!) | ||
- | NB: most of these characters do have a special function as well when they appear in a specific position. As you will see below, { } is one of the possible | + | NB: most of these characters do have a special function as well when they appear in a specific position. As you will see below, { } is one of the possible |
Other separators are reserved by the RegEx syntax. To use them by their ordinary value, you have to place a backslash in front of them. Thus, you type in ''/ | Other separators are reserved by the RegEx syntax. To use them by their ordinary value, you have to place a backslash in front of them. Thus, you type in ''/ | ||
Line 129: | Line 127: | ||
===Word boundaries=== | ===Word boundaries=== | ||
- | In ANNIS you can query on different layers. | + | In ANNIS you can query on different layers. |
- | Let us look again at the phrase | + | Let us look again at the sentence |
|the|man|manually|attached|the|tube|in|manchester| | |the|man|manually|attached|the|tube|in|manchester| | ||
Line 139: | Line 137: | ||
|the man manually attached the tube in Manchester| | |the man manually attached the tube in Manchester| | ||
- | Accordingly, | + | Accordingly, |
- | If you query for //man// on the message level, you will find nothing, because ANNIS will search for a whole message that contains only these three characters. In order to actually find the word you are looking for, you have to query for "any characters followed by the string //man// followed by any characters" | + | If you query for //man// on the message level, you will find nothing, because ANNIS will search for a whole message that contains only these three characters. In order to actually find the word you are looking for, you have to query for "any characters |
- | msg=/.*?man.*/ | + | '' |
and will find //man// but also // | and will find //man// but also // | ||
- | If you want to find only //man//, you have to query for the three letters surrounded by boundaries (ie. spaces, tabs, fullstops, commas, new-lines etc.). The string for a boundary is //\b//. The query for //man// and only //man// within a message would thus look as follows: | + | If you want to find only //man//, you have to query for the three letters surrounded by boundaries (ie. spaces, tabs, fullstops, commas, new-lines etc.). The string for a boundary is '' |
- | msg=/ | + | '' |
Line 156: | Line 154: | ||
====Quantifiers==== | ====Quantifiers==== | ||
- | Sometimes you might be looking for an expression which can be written with or without repeating letters. E.g. you might want to look for //hallo, haaallo, halooooo// | + | Sometimes you might be looking for an expression which can be written with or without repeating letters |
- | | + | |
- | * ***** an asterisk means a repetition of 0 or more times | + | |
- | | + | |
- | | + | |
Example: | Example: | ||
- | / | + | '' |
- | will find all variants of hallo | + | will find all variants of //hallo// |
- | + | ||
- | + | ||
- | Using quantifiers is much more capable and demanding than this. The examples given here are called | + | |
- | + | ||
- | Hint: it you find these options too complicated, | + | |
==== Alternatives==== | ==== Alternatives==== | ||
- | Above, you have seen that you can query for different letters in one spot, e.g. you can search for //man// and //men// with the expression | + | Above, you have seen that you can query for different letters in one spot, e.g. you can search for //man// and //men// with the expression |
Example: | Example: | ||
- | n(8|acht|ight|uit) | + | '' |
will look for: | will look for: | ||
- | n8 | + | * //n8// |
- | nacht | + | * //nacht// |
- | night | + | * //night// |
- | nuit | + | * //nuit// |
==== A final word==== | ==== A final word==== | ||
- | What you have read here, is only a selection of the possibilities RegEx offers. To keep things more or less simple for you, we tried to document all the features you are likely to use while omitting everything you probably will not care about. Also, there are different implementations of RegEx in different programs and they support different features | + | What you have read here is only a selection |
02_browsing/04_queries/03_regex.txt · Last modified: 2022/06/27 09:21 by 127.0.0.1