The best way to learn about the regular expression used in Filter Rules > Pattern Filtering is understanding basic principles, and going through some examples:
The way Pattern Filtering works is SpamTitan is pretty simple. It identifies if the pattern that you specify is the part of the message that you specify, and:
* If it is a Whitelisted Pattern, SpamTitan subtracts 100 from the spam score (putting it way above the Spam threshold)
* If it is a Blacklsited Pattern, SpamTitan adds 100 to the spam score (putting it way above the Spam threshold)
Wildcards are a very important part of regular expressions, here are the ones that you are going to use the most:
. -- matches any character, except the newline (including spaces, tabs, letters, and numbers)
\d -- matches any digit (0-9)
\s -- whitespace (space or tabs)
\w -- any word character (a-z and A-Z)
[] -- matches any character within the brackets (e.g. [b,t,z] will match a b, t, or a z, [a-z0-9] will match a to z or 0 to 9)
\ -- used as an "escape" to strip special properties of any character (e.g., if you want a period, use "\." instead of ".")
(|) -- matches any of the patterns inside the parentheses (e.g., (\.tw|\.jp) will match ".tw" or ".jp")
* -- matches zero or more of the preceding element or pattern
+ -- matches one or more of the preceding element or pattern
? -- matches zero or one of the preceding element or pattern
{n} -- matches exactly "n" number of the previous character or pattern
{n,m} -- matches between "n" and "m" number of the previous character or pattern
{n,} -- matches at least "n" of the previous character or pattern
^ -- matches the beginning of a string
$ -- matches the end of a string
\b -- word break (matches the beginning {\<} or end {\>} of a word, it is NOT a catchall for tabs, spaces, etc. between words)
==> This is different that the wildcards used in SQL or file system wildcards ==> Always test your regex for matching and false positives, using a testing tool like http://regexpal.com/
One of the most common patterns is to white/blacklist is sending email address, but the address can take on many different forms that make it cumbersome to white or blacklist the email addresses.
Example: If you want to match all of the mktomail.com subdomains, use the following pattern:
/(From:|Return\-Path:) .+\@.*mktomail\.com/
==> This will match @mktomail.com and all subdomains.
==> The (From:Return\-Path:) is used so that only inbound mail is affected (otherwise if you had to whitelist/release an affected message, you couldn't reply to it.
Example: Blacklisting a top level domain (System Setup > Mail Relay > Sender Controls > Blacklisted Top Level Domains (TLDs) cannot be worked around by whitelisting the domain, email address, or even the IP address. To get around this limitation, you can apply the following blacklisted pattern (applied against "Any Header"):
/(From:Return\-Path:) .+\@.+\.link\b/ (if you don't have the \b at the end, you will block "any" email address that has "link" in it, like linkedin.
/.link/ would match not only .link top level domains, but also blink, linkedin. Always make your patterns as specific as possible.
If you want to blacklist emails that have and IP address in them, use the following pattern and apply to the message body, breaking this down:
-- To match the http:// or https:// protocol, use the following:
/(https?:\/\//
==> the s? is used to match both http:// and https://
-- to match the IP address octet, do the following.
1-9 -- [1-9]
10-99 -- [1-9][0-9] or [1-9]\d
==> 1-99 can be simplified to [1-9]\d?
100-199 -- 1[0-9][0-9] or 1\d\d
200-255 -- 2[0-5][0-5]
==> An "octet" can be specified with the following ([1-9]\d?|1\d\d|2[0-5][0-5])
==> There are four octets, with a period between each of them: /(([1-9]\d?|1\d\d|2[0-5][0-5])\.){3}([1-9]\d?|1\d\d|2[0-5][0-5])\b/
Combining the two yeilds: /https?:\/\/(([1-9]\d?|1\d\d|2[0-5][0-5])\.){3}([1-9]\d?|1\d\d|2[0-5][0-5])\b/
There was recently a Spam Subject that was "Unpaid Invoic". To match this, but not Invoice, use:
/Unpaid Invoic\b/
Unless you specifically want to use a single character from a list, never use brackets ("[" and "]"), if you need this to match the string literal, escape them with the \, for example, if you want to match [Quote Number]
- /[Quote Number]/ will match if the subject has any of the following: e,o,u,q,t,n,m,r
- /\[Quote Number\]/ will match if the brackets are there.
- An even better match would be to include the format of the numbers. If you want to match [Quote Number]: followed by 5-9 digits, use:
/[Quote Number\]: \d{5,9}/
ALWAYS monitor Reporting > History after you make changes to pattern filters.
Pattern Filtering is part of Spam Filtering, so if you are testing, make sure that you are not testing from a Whitelisted IP Address or Whitelisted email address.