I'm trying to parse out data from a report file using regex in a .net framework.
The format of the data looks like this: "recordNum","location","lastName","firstName","dob", etc.
I've been able to grab the first field using both "([0-9A-Z]*)" and "([^"]*)" . I was able to grab the second field using ","([^"]*)"
But what if I want to grab the 10th or 30th field? I can't seem to find the correct way to implement a \n where 'n' is the occurence number.
I can't give exact data, because it's a medical record, but it looks a lot like this:
"PATIENT","000099999","O OUT","DOE","JOHN","","01/01/1950",...
"EMPLOYER","SELF","","","",""
"GUARANTOR","DOE","JOHN","","999-99-9999",...
Yes, to clarify I do mean Nth per line. Well I have to map all of them, so let's assume I'm looking to grab just the first name or "John" in the example provided. I'm hoping the solution is something that will allow me to simply cut and paste the regex and change an iteration variable (ie \4 for the 4th field, \5 for the fifth, etc.).
But to further clarify, I'm essentially writing an XML templates that will pour through 70,000 records per month. So the solution can not be dependent on data in the fields since that changes. The regex has to work something like "after the 8th comma, grab the alpha-numeric characters between the quotes".
How about:
^("[^"]*",){2}("[^"]*")
with the 'multiline' option set. This assumes that each item in the source text is surrounded by double-quotes and that there are no 'escaped double quotes' within an item (e.g. "hello"" world"). Look at match group #2 for the item you are after.
Just set the quantifier to be 1 less than the item you want. This works if you are after the 1st item (in which case the quantifier would be {0}).
Susan
Susan, that is very close! You are a regex genious!
Because this is being used within some alpha software our vendor sent us (which is at least smart enough to allow me to select the desired line), I can't use the leading carat. So I used ("[^"]*",){4}("[^"]*"). That grabbed the 5th field, which is the first name. It selected "JOHN", including the quotes and trailing comma, but I think I can play with it from there and see if I can figure it out...not that I'd complain if you wanted to strip out the unnecessary chars.
I owe you dinner!
("[^"]*",){4}"([^"]*)"
again, look at the match group #2.
I'm not sure what is going on when you say that you were grabbing the trailing comma - I can understand the surrounding double-quotes as they are inside the grouping brackets (outside in my suggestion immediately above) but in either case the trailing double quote should stop the match! The comma WILL be included in the match group #1 item(s) but those are being ignored anyway.
It all depends what your software supports, you might try:
(?m:(?:"[^"]*",){2}"([^"]*)")
---------------------------------------------------------------------- (?m: group, but do not capture (with ^ and $ matching start and end of line) (case- sensitive) (with . not matching \n) (matching whitespace and # normally):---------------------------------------------------------------------- (?: group, but do not capture (2 times):---------------------------------------------------------------------- " '"'---------------------------------------------------------------------- [^"]* any character except: '"' (0 or more times (matching the most amount possible))---------------------------------------------------------------------- ", '",'---------------------------------------------------------------------- ){2} end of grouping---------------------------------------------------------------------- " '"'---------------------------------------------------------------------- ( group and capture to \1:---------------------------------------------------------------------- [^"]* any character except: '"' (0 or more times (matching the most amount possible))---------------------------------------------------------------------- ) end of \1---------------------------------------------------------------------- " '"'---------------------------------------------------------------------- ) end of grouping----------------------------------------------------------------------
ddrudik, that did it!
With that code, I am able to simply alter the number between the braces to select the correct field. Thank you for the code break-down as well.
I appreciate the help from both you and Aussie Susan, my Regex-Fu is pretty "weak tea" compared to my other coding skills. I think I'm going to have to pick up a copy of Mastering Regular Expressions and start learning in earnest.
Damn, now I owe two dinners!
I see I left out the ^ from the original pattern, thus the (?m:) is not necessary:
(?:"[^"]*",){2}"([^"]*)"