Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Michael Ash's Regex Blog

Regex Musings

Just a dash?

Now I've seen plenty of tutorials and documentation mentioning some of the special characters inside a character class.  The hat (^) symbol means negate the following characters if it is the first character in the class.

 

            Ex.  [^ABC] means not A or B or C

 

A lesser known special character is the period (.) Which outside the character class usually means any character except a newline but inside the character class matches only a period.

In the string 123.456 

            .+ matches the entire string

 

            but [.]+ only matches the period in the string

 

I say this is lesser known because most of the expressions I’ve seen escape the period within the character class [\.] While this isn't wrong is also isn't necessary but I think a lot of people think it is.

 

The only other special character I see mentioned regarding a character class is \b which has a totally different meaning inside the brackets than outside.

 

But the special character I haven't seen mentioned is the dash (-)

 

While working on my latest version of my datetime regex I realized something tricky about character classes.  I use the dash as one of my three date separators a period or a slash or a dash using the character class [./-]  notice that none of the characters are escaped .  I mentioned to someone about adding an additional separator, a colon (:), to the class when I realized you can't just add it to the end of the existing characters.  [./-:] does not simply add the colon to the previous group of three characters.  It also adds all of the digits and removes the dash. Though most may know you can use the dash to specify a range [a-zA-Z]  for all the characters in the English alphabet or [0-9] ,for digit zero through nine, are some common constructs.  But it's not restricted to being used with only those characters or combinations, those are just used the most. In fact the range can be between any two SINGLE characters. So [&-Z] is perfectly valid to match all digit and upper case letters plus a lot of other characters, like = or @ and others. By default it will just be the range between the two ASCII or Unicode values so you can type in a value or use the ASCII \xXX syntax or Unicode characters with the \uXXXX syntax. The only catch is the second character must have a higher (ASCII or Unicode) value than the first

 

But here's the tricky part its only a range indicator if it's between two SINGLE characters, or an escaped equivalent of a single character \xXX or \uXXXX.  Otherwise it only matches a dash.  In my character class used in my datetime regex [./-] it only matches a dash because it at the end of the list of characters.  If it was at the beginning it would also work but [.-/]  is not the same.  In fact the dash doesn't match.  I think I realized this in an earlier version of the datetime regex but I didn't fully grasp what was happening.  You'll notice I stressed the words 'single character' because as of this writing if the dash is between and  character escape that represent multiple characters and a single character a dash matches only a dash

For instance [0-Z] matches a bunch of characters include digits and Uppercase English letters but [\d-Z] matches any digit or a dash or a Z. 

 

Bottom line if you want to match a dash in a character classes with other characters place it first or last or be safe and  escape it.
Published Thursday, June 17, 2004 5:37 PM by mash
Filed under:

Comments

 

mash said:

For what it's worth, the character '-' is called a hyphen, not a dash. A true dash when printed looks longer than a hyphen. Since keyboards like a separate dash key a dash is usually typed as two hyphens. The hyphen joins pieces into multi-word compounds and the dash divides parts of a sentence--usually where one part is not a complete sentence.

I only bother to make this pedantic distinction because I've been puzzling over how to make a regular expression respect a double-hyphen dash as a separate. It's OK to split after the dash, but not before or in the middle of it.

So for example, given the string "foo--bar" I want my regular expression to create the following split

foo--
bar

but not

foo
--bar

and not

foo-
-bar

Anyone solved that problem?
June 23, 2004 9:44 AM
 

mash said:

You are correct Sir! The character is the hyphen. I probably shouldn't have used the word dash since there are different types of dashes. I hope that didn't confuse the point I was trying to get across

BTW (?<=--)\b will split foo--bar the way you wish, if you can use lookbehinds.
June 24, 2004 1:37 AM
 

mash said:

Alright, just a dash eh? how about just a quote?

Simple simple simple. In fact, I did it once before.
Subject: Perl, Regular Expressions
Problem: fields seperated by Double Quotes
Solution: split() function

alright, here's a sample of what I had before, but I forget the Regular Expression:

#!perl
$temp = 'someinformation"moreinformation"evenmoreinformation""""blankfields"""moreinfo';
@Rows = split ( /"/, $temp );
for(@rows)
{
print $_ + "\n";
}

Focus on the regular expression. If anybody can get this to work, email me.
June 30, 2004 11:25 AM
 

Michael Ash's Regex Blog said:

The square brackets character class is one of the more misunderstood of the basic regex features. This

January 31, 2008 1:55 PM
 

rimonabantexcellence site title said:

June 4, 2013 3:21 PM
Anonymous comments are disabled