Regular Expressions

From Kolmafia
Revision as of 07:07, 10 May 2010 by imported>Bale
Jump to navigation Jump to search

Introduction

Regular expressions, (commonly shortened to regex), are a language designed to enable creating very explicit patterns for searching strings. The regex language has wildcards for virtually every possible pattern of characters you might want to search for. Only some of the generally most common forms of regexes will be described on this page. For more details you are advised to search the internet where you will find many detailed resources on the subject. This writer will point the student at this tutorial in particular.


Commonly used Regular Expressions

Literal Characters

A character will match the first instance of itself in a string.

  • E.g. a will match the first a in "Jack is a dull boy."
  • E.g. cat is a set of three literal character which will find a match in "about cats and dogs."


Special Characters

It's often more interesting to search for less specific patterns than literal characters. There are a number of characters reserved for this purpose. These special characters are often called "metacharacters". If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash.

  • Backslash \
    Used to grant special meaning to a normally literal character or employ a special character as a literal.
    E.g. If you want to find the beginning of a word, the combination \b will match a word's boundary.
    E.g. to match "1+1=2", the correct regex is 1\+1=2. Otherwise, the plus sign will have a special meaning.
  • Question mark ?
    The question mark makes the preceding token in the regular expression optional.
    E.g. colou?r matches both "colour" and "color".
  • Asterisk or star *
    The asterisk attempts to match the preceding token zero or more times.
  • Plus sign +
    The plus attempts to match the preceding token once or more.
  • Period or dot .
    Matches any character except for line breaks.
  • Caret ^
    Matches the beginning of the string only.
    E.g. ^the will match only the first word in "the way of the world."
  • Dollar sign $
    Matches the end of the string only.
    E.g. dog$ will match only the last word in "dog eat dog".
  • Opening and closing round brackets ( and )
    Used for grouping allowing a regex operator (like +) to be applied to the entire group. It also creates a backreference storing the match.
  • Opening and closing square bracket [ and ]
    Used to create "character sets" to match only one of several characters. Inside these brackets different rules apply to several characters: ^-
    E.g. gr[ae]y will match either "gray" or "grey", but it will not match "graey".
  • Opening and closing braces { and }
    This is a limited repetition operator matching only {min,max} of what preceeds it.
    E.g. \b[1-9][0-9]{2,4}\b matches a number between 100 and 99999. (\b matches a word boundry.)
  • Vertical bar or pipe symbol |
    This is an "or" operator to match one of several possibilities.
    E.g. \b(cat|dog|fish)\b will match either "cat", "dog" or "fish".


Other Common Matchers

\w Matches any word character (alphanumeric & underscore).
\W Matches any character that is not a word character (alphanumeric & underscore).
\d Matches any digit character (0-9).
\D Matches any character that is not a digit character (0-9).
\s Matches any whitespace character (spaces, tabs, line breaks).
\S Matches any character that is not a whitespace character (spaces, tabs, line breaks).
\n Line break character.
\t Tab character.
\b Matches a word boundary position such as whitespace or the beginning or end of the string.
\B Matches any position that is not a word boundary.
[A-Za-z] Matches any single character in the range a-z or A-Z.
[^A-Za-z] Matches any single character, except for the range a-z or A-Z.


Testing Resources

If you're not sure if your regex will work, try testing it with one of these resources:


Using Regexes in KolMafia

Regular expressions in ASH are wrappers for the Java java.util.regex package. You can find detailed information about that in this Java Tutorial. Only the highlights will be described in this section.

Using regular expressions in ash follows this basic formula:

  1. First a regular expression needs to be defined with the matcher datatype. Defining a matcher also requires the use of the create_matcher() function.
    Important: in ash backslashes have special meaning inside a string, so any backslashes need to be backslashed or else ash will interpret them differently.
  2. Then the matcher can be operated upon by the various regex functions, notably find().
  3. Finally, if there were backreferences in the matcher, they can be checked using the group() function.

This example will use a regular expression to determine how many chamois are left in the slime tube's bucket.

// Visit the bucket to get the page's text
string page = visit_url("clan_slimetube.php?action=bucket");

// In the following matcher: (is|are) will match either word
//     (\\d+) will match and capture 1 or more digits. Note the double-backslash!
matcher cham_left = create_matcher("(There (is|are) (\\d+) chamoi(s|x))( in the bucket.)" , page);

// Then you use find to capture patterns with the parenthesis.
if(cham_left.find()) {

   // Finally group() is used to reference the patterns that were backreferenced.
   print(cham_left.group(1)+ " left"+ cham_left.group(5), "blue");
} else
   print("The bucket is empty.", "blue");