Regular Expressions: Difference between revisions
imported>Bale This sometimes bedevils me. I figure it will bedevil others. |
imported>Bale fix wiki markup |
||
(2 intermediate revisions by 2 users not shown) | |||
Line 97: | Line 97: | ||
: E.g. {{Pspan|<.+?>}} will match "This is a {{Pspan|<nowiki><b></nowiki>}}test<nowiki></b></nowiki>." so that you can pick out html tags. | : E.g. {{Pspan|<.+?>}} will match "This is a {{Pspan|<nowiki><b></nowiki>}}test<nowiki></b></nowiki>." so that you can pick out html tags. | ||
===Special Groups=== | |||
These are special types of groups that can allow you to do more advanced things with regex. For the most part, they can be declared by adding a question mark to the beginning of a special group. | |||
Noncapturing groups are declared by adding a colon after the question mark. | |||
: E.g. {{Pspan|(?:hello)}} will match "{{Pspan|hello}}world" without actually creating a new capturing group. | |||
Special quantifiers can be used in a standalone special group that has a length of 0. | |||
: E.g. {{Pspan|(?i)hello}} will match "{{Pspan|HELLO}}", since {{Pspan|(?i)}} tells the regex to ignore case. | |||
Negative/positive lookahead/lookbehind can help you finetune your regex by restricting what can appear without capturing more of the target string | |||
: E.g. {{Pspan|(?<!\\S)([\\d]+)(?!\\S)}} uses negative lookahead and negative lookbehind, and so it will only match a series of digits not bounded by non-space characters, which is to say, a series of digits bounded by spaces. Presumably it would be functionally identical to {{Pspan|<nowiki>(?<=\\s)([\\d]+)(?=\\s)</nowiki>}}. | |||
Since ASH's regular expressions are directly passed to Java, ASH does not support named groups. | |||
===Testing Resources=== | ===Testing Resources=== | ||
Line 107: | Line 120: | ||
Using regular expressions in ash follows this basic formula: | Using regular expressions in ash follows this basic formula: | ||
#First a regular expression needs to be defined with the [[matcher]] datatype. Defining a matcher also requires the use of the | #First a regular expression needs to be defined with the [[matcher]] datatype. Defining a matcher also requires the use of the {{f|create_matcher}} function. | ||
#:'''Important''': in ash backslashes have special meaning inside a string, so any backslashes need to be backslashed or else ash will interpret them differently. | #:'''Important''': in ash backslashes have special meaning inside a string, so any backslashes need to be backslashed or else ash will interpret them differently. | ||
#Then the matcher can be operated upon by the various [[String Handling Routines#Regular Expressions|regex functions]], notably | #Then the matcher can be operated upon by the various [[String Handling Routines#Regular Expressions|regex functions]], notably {{f|find}}. | ||
#Finally, if there were capturing groups in the matcher, they can be checked using the | #Finally, if there were capturing groups in the matcher, they can be checked using the {{f|group}} function. | ||
{{CodeSample| | {{CodeSample| |
Latest revision as of 21:48, 2 August 2011
Introduction
Regular expressions, (commonly shortened to regex), are a language designed to enable creating very explicit patterns for searching strings. The regex language has wildcards for virtually every possible pattern of characters you might want to search for. Only some of the generally most common forms of regexes will be described on this page. For more details you are advised to search the internet where you will find many detailed resources on the subject. This writer will point the student at this tutorial in particular.
Commonly used Regular Expressions
Literal Characters
A character will match the first instance of itself in a string.
- E.g. a will match the first a in "Jack is a dull boy."
- E.g. cat is a set of three literal character which will find a match in "about cats and dogs."
Special Characters
It's often more interesting to search for less specific patterns than literal characters. There are a number of characters reserved for this purpose. These special characters are often called "metacharacters". If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash.
- Backslash \
- Used to grant special meaning to a normally literal character or employ a special character as a literal.
- E.g. If you want to find the beginning of a word, the combination \b will match a word's boundary.
- E.g. to match "1+1=2", the correct regex is 1\+1=2. Otherwise, the plus sign will have a special meaning.
- Question mark ?
- The question mark makes the preceding token in the regular expression optional.
- E.g. colou?r matches both "colour" and "color".
- Asterisk or star *
- The asterisk attempts to match the preceding token zero or more times.
- Plus sign +
- The plus attempts to match the preceding token once or more.
- Period or dot .
- Matches any character except for line breaks.
- Caret ^
- Matches the beginning of the string only.
- E.g. ^the will match only the first word in "the way of the world."
- Dollar sign $
- Matches the end of the string only.
- E.g. dog$ will match only the last word in "dog eat dog".
- Opening and closing round brackets ( and )
- Used for grouping allowing a regex operator (like +) to be applied to the entire group. It also creates a capturing group for storing the match.
- Opening and closing square bracket [ and ]
- Used to create "character sets" to match only one of several characters. Inside these brackets different rules apply to several characters: ^-
- E.g. gr[ae]y will match either "gray" or "grey", but it will not match "graey".
- Opening and closing braces { and }
- This is a limited repetition operator matching only {min,max} of what preceeds it.
- E.g. \b[1-9][0-9]{2,4}\b matches a number between 100 and 99999. (\b matches a word boundry.)
- Vertical bar or pipe symbol |
- This is an "or" operator to match one of several possibilities.
- E.g. \b(cat|dog|fish)\b will match either "cat", "dog" or "fish".
Other Common Matchers
\w | Matches any word character (alphanumeric & underscore). |
\W | Matches any character that is not a word character (alphanumeric & underscore). |
\d | Matches any digit character (0-9). |
\D | Matches any character that is not a digit character (0-9). |
\s | Matches any whitespace character (spaces, tabs, line breaks). |
\S | Matches any character that is not a whitespace character (spaces, tabs, line breaks). |
\n | Line break character. |
\t | Tab character. |
\b | Matches a word boundary position such as whitespace or the beginning or end of the string. |
\B | Matches any position that is not a word boundary. |
[A-Za-z] | Matches any single character in the range a-z or A-Z. |
[^A-Za-z] | Matches any single character, except for the range a-z or A-Z. |
Greedy vs Lazy Matching
Beware greedy matching! Matchers that can match multiple tokens will attempt to match as much as possible.
- E.g. <.+> will match "This is a <b>test</b>." instead of matching only the html tag as you might have expected.
To fix this, you can make the match lazy, by adding a ? after the greedy character.
- E.g. <.+?> will match "This is a <b>test</b>." so that you can pick out html tags.
Special Groups
These are special types of groups that can allow you to do more advanced things with regex. For the most part, they can be declared by adding a question mark to the beginning of a special group.
Noncapturing groups are declared by adding a colon after the question mark.
- E.g. (?:hello) will match "helloworld" without actually creating a new capturing group.
Special quantifiers can be used in a standalone special group that has a length of 0.
- E.g. (?i)hello will match "HELLO", since (?i) tells the regex to ignore case.
Negative/positive lookahead/lookbehind can help you finetune your regex by restricting what can appear without capturing more of the target string
- E.g. (?<!\\S)([\\d]+)(?!\\S) uses negative lookahead and negative lookbehind, and so it will only match a series of digits not bounded by non-space characters, which is to say, a series of digits bounded by spaces. Presumably it would be functionally identical to (?<=\\s)([\\d]+)(?=\\s).
Since ASH's regular expressions are directly passed to Java, ASH does not support named groups.
Testing Resources
If you're not sure if your regex will work, try testing it with one of these resources:
Using Regexes in KoLmafia
Regular expressions in ASH are wrappers for the Java java.util.regex package. You can find detailed information about that in this Java Tutorial. Only the highlights will be described in this section.
Using regular expressions in ash follows this basic formula:
- First a regular expression needs to be defined with the matcher datatype. Defining a matcher also requires the use of the
create_matcher()
function.- Important: in ash backslashes have special meaning inside a string, so any backslashes need to be backslashed or else ash will interpret them differently.
- Then the matcher can be operated upon by the various regex functions, notably
find()
. - Finally, if there were capturing groups in the matcher, they can be checked using the
group()
function.
This example will use a regular expression to determine how many chamois are left in the slime tube's bucket.
// Visit the bucket to get the page's text
string page = visit_url("clan_slimetube.php?action=bucket");
// In the following matcher: (is|are) will match either word
// (\\d+) will match and capture 1 or more digits. Note the double-backslash!
matcher cham_left = create_matcher("(There (is|are) (\\d+) chamoi(s|x))( in the bucket.)" , page);
// Then you use find to capture patterns with the parenthesis.
if(cham_left.find()) {
// Finally group() is used to reference the patterns that were captured.
print(cham_left.group(1)+ " left"+ cham_left.group(5), "blue");
} else
print("The bucket is empty.", "blue");