Difference between revisions of "Regular Expressions"

Revision as of 09:54, 8 May 2010

This is just a stub. A place to put some real information later...

Regular expressions in ASH mostly are wrappers for the Java java.util.regex package. You can find information about that here: Java Regex Tutorial

There's a good resource for regexp language here.

Awesome tools for testing regexp here:

Introduction

Regular expressions, (commonly shortened to regex), are a language designed to enable creating very explicit patterns for searching strings. The regex language has wildcards for virtually every possible pattern of characters you might want to search for. Only some of the generally most common forms of regexes will be described on this page. For more details you are advised to search the internet where you will find many detailed resources on the subject. This writer will point the student at this tutorial in particular.

Commonly used Regular Expressions

Literal Characters

A character will match the first instance of itself in a string.

E.g. a will match the first a in "Jack is a dull boy."
E.g. cat is a set of three literal character which will find a match in "about cats and dogs."

Special Characters

It's often more interesting to search for less specific patterns than literal characters. There are a number of characters reserved for this purpose. These special characters are often called "metacharacters". If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash.

Backslash \
Used to grant special meaning to a normally literal character or employ a special character as a literal.

E.g. If you want to find the beginning of a word, the combination \b will match a word's boundary.

E.g. to match "1+1=2", the correct regex is 1\+1=2. Otherwise, the plus sign will have a special meaning.
Question mark ?
The question mark makes the preceding token in the regular expression optional.

E.g. colou?r matches both "colour" and "color".
Asterisk or star *
The asterisk attempts to match the preceding token zero or more times.
Plus sign +
The plus attempts to match the preceding token once or more.
Period or dot .
Matches any character except for line breaks.
Caret ^
Matches the beginning of the string only.

E.g. ^the will match only the first word in "the way of the world."
Dollar sign $
Matches the end of the string only.

E.g. dog$ will match only the last word in "dog eat dog".
Opening and closing round brackets ( and )
Used for grouping allowing a regex operator (like +) to be applied to the entire group. It also creates a backreference storing the match.
Opening and closing square bracket [ and ]
Used to create "character sets" to match only one of several characters. Inside these brackets different rules apply to several characters: ^-

E.g. gr[ae]y will match either "gray" or "grey", but it will not match "graey".
Opening and closing braces { and }
This is a limited repetition operator matching only {min,max} of what preceeds it.

E.g. \b[1-9][0-9]{2,4}\b matches a number between 100 and 99999. (\b matches a word boundry.)
Vertical bar or pipe symbol |
This is an "or" operator to match one of several possibilities.

E.g. \b(cat|dog|fish)\b will match either "cat", "dog" or "fish".

Other Common Matchers

/w

Matches any word character (alphanumeric & underscore).

/W

Matches any character that is not a word character (alphanumeric & underscore).

/d

Matches any digit character (0-9).

/D

Matches any character that is not a digit character (0-9).

/s

Matches any whitespace character (spaces, tabs, line breaks).

/S

Matches any character that is not a whitespace character (spaces, tabs, line breaks).

/n

Line break character.

/t

Tab character.

/b

Matches a word boundary position such as whitespace or the beginning or end of the string.

[A-Za-z]

Matches any single character in the range a-z or A-Z.

[^A-Za-z]

Matches any single character, except for the range a-z or A-Z.

Using Regexes in KolMafia

Regular expressions in ASH are wrappers for the Java java.util.regex package. You can find detailed information about that in this Java Tutorial. Only the highlights will be described in this section.

@@ Line 28: / Line 28: @@
 It's often more interesting to search for less specific patterns than literal characters. There are a number of characters reserved for this purpose. These special characters are often called "metacharacters". If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash.
-*Opening and closing square bracket '''[''' and ''']'''
+* Backslash '''\'''
-*:Used to create "character sets" to match only one of several characters.
+*: Used to grant special meaning to a normally literal character or employ a special character as a literal.
-*:E.g. {{Pspan|gr[ae]y}} will match either "gray" or "grey", but it will not match "graey".
+*: E.g. If you want to find the beginning of a word, the combination {{Pspan|\b}} will match a word's boundary.
-*Backslash '''\'''
+*: E.g. to match "1+1=2", the correct regex is <span style="font-weight: bold; font-style: italic; padding-left: .2em; padding-right: .2em; color: #006400; background: #fff8dc;">1\+1=2</span>. Otherwise, the plus sign will have a special meaning.
-*:Used to grant special meaning to a normally literal character or employ a special character as a literal.
+* Question mark '''?'''
-*:E.g. If you want to find the beginning of a word, the combination {{Pspan|\b}} will match a word's boundary.
+*: The question mark makes the preceding token in the regular expression optional.
-*:E.g. to match "1+1=2", the correct regex is <span style="font-weight: bold; font-style: italic; padding-left: .2em; padding-right: .2em; color: #006400; background: #fff8dc;">1\+1=2</span>. Otherwise, the plus sign will have a special meaning.
+*: E.g. {{Pspan|colou?r}} matches both "colour" and "color".
-*Question mark '''?'''
+* Asterisk or star '''*'''
-*:The question mark makes the preceding token in the regular expression optional.
+*: The asterisk attempts to match the preceding token zero or more times.
-*:E.g. {{Pspan|colou?r}} matches both "colour" and "color".
+* Plus sign '''+'''
-*Asterisk or star '''*'''
+*: The plus attempts to match the preceding token once or more.
-*:The asterisk attempts to match the preceding token zero or more times.
+* Period or dot '''.'''
-*Plus sign '''+'''
+*: Matches any character except for line breaks.
-*:The plus attempts to match the preceding token once or more.
+* Caret '''^'''
-*Period or dot '''.'''
+*: Matches the beginning of the string only.
-*:Matches any character except for line breaks.
+*: E.g. {{Pspan|^the}} will match only the first word in "{{Pspan|the}} way of the world."
-*Caret '''^'''
+* Dollar sign '''$'''
-*:Matches the beginning of the string only.
+*: Matches the end of the string only.
-*:E.g. {{Pspan|^the}} will match only the first word in "{{Pspan|the}} way of the world."
+*: E.g. {{Pspan|dog$}} will match only the last word in "dog eat {{Pspan|dog}}".
-*Dollar sign '''$'''
+* Opening and closing round brackets '''(''' and ''')'''
-*:Matches the end of the string only.
+*: Used for grouping allowing a regex operator (like +) to be applied to the entire group. It also creates a backreference storing the match.
-*:E.g. {{Pspan|dog$}} will match only the last word in "dog eat {{Pspan|dog}}".
+* Opening and closing square bracket '''[''' and ''']'''
-*Opening and closing round brackets '''(''' and ''')'''
+*: Used to create "character sets" to match only one of several characters. Inside these brackets different rules apply to several characters: ^-
-*:Used for grouping allowing a regex operator (like +) to be applied to the entire group. It also creates a backreference storing the match.
+*: E.g. {{Pspan|gr[ae]y}} will match either "gray" or "grey", but it will not match "graey".
-*Opening and closing braces '''{''' and '''}'''
+* Opening and closing braces '''{''' and '''}'''
-*:This is a limited repetition operator matching only {min,max} of what preceeds it.
+*: This is a limited repetition operator matching only {min,max} of what preceeds it.
-*:E.g. {{Pspan|\b[1-9][0-9]{2,4}\b}} matches a number between 100 and 99999. (\b matches a word boundry.)
+*: E.g. {{Pspan|\b[1-9][0-9]{2,4}\b}} matches a number between 100 and 99999. (\b matches a word boundry.)
-*Vertical bar or pipe symbol '''|'''
+* Vertical bar or pipe symbol '''|'''
-*:This is an "or" operator to match one of several possibilities.
+*: This is an "or" operator to match one of several possibilities.
-*:E.g. <span style="font-weight: bold; font-style: italic; padding-left: .2em; padding-right: .2em; color: #006400; background: #fff8dc;">\b(cat|dog|fish)\b</span> will match either "cat", "dog" or "fish".
+*: E.g. <span style="font-weight: bold; font-style: italic; padding-left: .2em; padding-right: .2em; color: #006400; background: #fff8dc;">\b(cat|dog|fish)\b</span> will match either "cat", "dog" or "fish".
+===Other Common Matchers===
+* /w
+: Matches any word character (alphanumeric & underscore).
+* /W
+: Matches any character that is not a word character (alphanumeric & underscore).
+* /d
+: Matches any digit character (0-9).
+* /D
+: Matches any character that is not a digit character (0-9).
+* /s
+: Matches any whitespace character (spaces, tabs, line breaks).
+* /S
+: Matches any character that is not a whitespace character (spaces, tabs, line breaks).
+* /n
+: Line break character.
+* /t
+: Tab character.
+* /b
+: Matches a word boundary position such as whitespace or the beginning or end of the string.
+* [A-Za-z]
+: Matches any single character in the range a-z or A-Z.
+* [^A-Za-z]
+: Matches any single character, except for the range a-z or A-Z.
 ==Using Regexes in KolMafia==