|
|
Capture function call arguments
Last post 08-15-2008, 1:35 PM by kajic. 23 replies.
-
07-27-2008, 4:56 PM |
-
kajic
-
-
-
Joined on 04-18-2007
-
-
Posts 24
-
-
|
Capture function call arguments
Hi, I would like to parse some source code files for a particular function call. I am interested in the arguments sent to the function "Language::get()". If you consider this code snippet
$hello = Language::get('key1', array('firstName',$firstName), array('lastName', $obj->getLastName($arg1->get(), $arg2)) , array('something', $something)).Language::get('key2', array('gender', $gender)); $bar = Language('key3');
In this case I would expect the regexp to capture the following: 'key1', array('firstName', $firstName), array('lastName', $obj->getLastName($arg1->get(), $arg2)), array('something',
$something) 'key2', array('gender', $gender) 'key3'
I have been attempting to construct a regexp but the problems I encounter is either not being able to stop matching when one function call ends and another begins or being able to continue past parentheses that are part of the arguments sent to the function. I have been able to capture all parentheses until the last (ie. ignore parentheses that are part of the arguments), but not to stop if the actual call to Language::get has ended and a new function call has been initiated. In that case my regexp will match until the end of the second function call, all the way to the last ) of that call. My current regexp looks like this. I have had alot of different attempts but this is the simplest one that actually tackles one of the two described problems. Language::get\((.*)\)
My second problem is the following: I will need another regexp to parse each of the three matches above. This time I need to figure out the name and value of each captured argument. For example parsing following:
'key1', array('firstName', $firstName), array('lastName', $obj->getLastName($arg1->get(), $arg2)), array('something',
$something)
I would expect the regexp to create two capturing groups where the first will capture names: firstName lastName something
and the second capturing group will capture values: $firstName $obj->getLastName($arg1->get(), $arg2) $something
Here the problem has been to not to stop matching a value to early if a closing parantheses is part of that value. I.e. when I match $obj->getLastName($arg1->get(), $arg2) my regexp will stop as soon as it encounters the ) that is part of get(). The regexp looks like this: , array\('([^']*)', ([^)]*)\) I have been attempting to experiment with look ahead / behind as it feels like the right way to solve these problems but I have had very little luck there. I am doing this with two regular expressions because I think there is no way to do all this with a single expression but if I can actually do it with one it would be great. I apriciate any help / pointers you can give me with these problems. Thanks!
/Robert Kajic
|
|
-
07-27-2008, 7:52 PM |
|
|
Re: Capture function call arguments
What platform, regex and language are you using. The problem you are having can be solved using recurrence (PCRE), balanced named groups (.NET) or additional coding (Perl). (By the way, the 'Posting Guidelines' in the sticky note at the top of this forum list the things we really need to know before we can help you). In general, regex's cannot count, so they cannot find (for example) 'matching parentheses' which is what you are trying to do. However, there are language extensions (listed above) which can be used for this. I would also suggest that you use the phrase' matching parentheses' along with the name of your regex engine in google and you should find many examples of how to do this. Once we know which style of solution to apply, then we can help you further. Susan
|
|
-
07-28-2008, 4:02 AM |
-
kajic
-
-
-
Joined on 04-18-2007
-
-
Posts 24
-
-
|
Re: Capture function call arguments
Thanks for your answer Susan. I have to admit I didn't read the sticky, I thought I would be able to tell you everything you needed without doing so but I was wrong. Sorry. I am using PHP so I have access to PCRE-style regexes. There is a also POSIX Extended-style regex library but in my case it doesn't provide the functionality I need. I intend to use preg_ replace_ callback for my first regexp and preg_ match_ all for the second (perhaps this information is irrelevant but I guess it doesn't hurt to tell you). I hope I have provided you with enough information about my platform and it's regex capabilities to help me. In the meanwhile I will continue searching for a solution, thanks for pointing me to the search phrase "matching parentheses". /Robert Kajic
|
|
-
07-28-2008, 7:33 AM |
-
ddrudik
-
-
-
Joined on 05-24-2007
-
USA
-
Posts 2,079
-
-
|
Re: Capture function call arguments
This worked for me with your sample text saved as file.txt: <?php $file=file_get_contents('file.txt'); $names=Array(); $values=Array(); function getfunc($match) { global $results; echo '<pre>'.$match[1].'<hr>'; $results[]=preg_split('/,(?![^()]*\))/',$match[1]); } preg_match_all('/Language::get\((.*?)(?=(?:\).)?Language::get|\);)/is',$file,$lines); for ($i = 0; $i < count($lines[1]); $i++) { preg_match_all('/array\(\s*\'([^\']*)\'\s*,\s*(.*?)(?=\)(?:\s*,\s*array\(|\s*$))/i',$lines[1][$i],$line); $names=array_merge($names, $line[1]); $values=array_merge($values, $line[2]); } echo '<pre>'.print_r($names,true).print_r($values,true); ?>
|
|
-
07-28-2008, 10:00 AM |
-
kajic
-
-
-
Joined on 04-18-2007
-
-
Posts 24
-
-
|
Re: Capture function call arguments
Thanks for your help! I noticed I had to change the first regexp into Language::get\((.*?)(?=(?:\).)?Language::get|\)(?:;|\.|\+)) in order for it to stop matching if I chained Language::get() with some other non-Language::get method. For example like this: $bar = Language::get('key3').substr($hello, 0, 1); Other than this I havent found anything wrong with the expression. The second one seems to work just fine as is. I kind of understand you are using the same technique in both expressions, but I am not quite able to follow them to the end. I would love if you could walk me through the expressions and their components so that I can understand them better. /Robert Kajic
|
|
-
07-28-2008, 10:54 AM |
-
ddrudik
-
-
-
Joined on 05-24-2007
-
USA
-
Posts 2,079
-
-
|
Re: Capture function call arguments
(?is-mx:Language::get\((.*?)(?=(?:\).)?Language::get|\);))
matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?is-mx: group, but do not capture (case-insensitive) (with . matching \n) (with ^ and $ matching normally) (matching whitespace and # normally): ---------------------------------------------------------------------- Language::get 'Language::get' ---------------------------------------------------------------------- \( '(' ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- .*? any character (0 or more times (matching the least amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- (?= look ahead to see if there is: ---------------------------------------------------------------------- (?: group, but do not capture (optional (matching the most amount possible)): ---------------------------------------------------------------------- \) ')' ---------------------------------------------------------------------- . any character ---------------------------------------------------------------------- )? end of grouping ---------------------------------------------------------------------- Language::get 'Language::get' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- \) ')' ---------------------------------------------------------------------- ; ';' ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------
(?i-msx:array\(\s*'([^']*)'\s*,\s*(.*?)(?=\)(?:\s*,\s*array\(|\s*$)))
matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?i-msx: group, but do not capture (case-insensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- array 'array' ---------------------------------------------------------------------- \( '(' ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ' '\'' ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- [^']* any character except: ''' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- ' '\'' ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- , ',' ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ( group and capture to \2: ---------------------------------------------------------------------- .*? any character except \n (0 or more times (matching the least amount possible)) ---------------------------------------------------------------------- ) end of \2 ---------------------------------------------------------------------- (?= look ahead to see if there is: ---------------------------------------------------------------------- \) ')' ---------------------------------------------------------------------- (?: group, but do not capture: ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- , ',' ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- array 'array' ---------------------------------------------------------------------- \( '(' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- $ before an optional \n, and the end of the string ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------
|
|
-
07-28-2008, 10:58 AM |
-
ddrudik
-
-
-
Joined on 05-24-2007
-
USA
-
Posts 2,079
-
-
|
Re: Capture function call arguments
And, your pattern reads as: (?is-mx:Language::get\((.*?)(?=(?:\).)?Language::get|\)(?:;|\.|\+)))
matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?is-mx: group, but do not capture (case-insensitive) (with . matching \n) (with ^ and $ matching normally) (matching whitespace and # normally): ---------------------------------------------------------------------- Language::get 'Language::get' ---------------------------------------------------------------------- \( '(' ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- .*? any character (0 or more times (matching the least amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- (?= look ahead to see if there is: ---------------------------------------------------------------------- (?: group, but do not capture (optional (matching the most amount possible)): ---------------------------------------------------------------------- \) ')' ---------------------------------------------------------------------- . any character ---------------------------------------------------------------------- )? end of grouping ---------------------------------------------------------------------- Language::get 'Language::get' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- \) ')' ---------------------------------------------------------------------- (?: group, but do not capture: ---------------------------------------------------------------------- ; ';' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- \. '.' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- \+ '+' ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- Note that you can shorten (?:;|\.|\+) to [;.+] without issue.
|
|
-
07-28-2008, 12:29 PM |
-
kajic
-
-
-
Joined on 04-18-2007
-
-
Posts 24
-
-
|
Re: Capture function call arguments
Ah. I can get similar output from RegexBuddy, but I dont find it very informative :) Its only when I actually understand a regexp that I think the verbose output "makes sense" :) Anyway, thanks for your help.
|
|
-
07-28-2008, 7:02 PM |
-
kajic
-
-
-
Joined on 04-18-2007
-
-
Posts 24
-
-
|
Re: Capture function call arguments
I have been going over the expressions you gave me and I think I am starting to understand the first a little better (it's a little modified). Language::get\((.*?)(?=(?:\).)?Language::get|\)[;+])\) Here is my "theory", please correct me where I'm wrong.
The basic idea seems to be to force the lazy dot-group to continue matching in order to satisfy the look ahead. The look ahead in turn is constructed in such a way that it will first match the "end" of the function call, ie. ); If another function call to Language::get is located just after the end of the first then the first option in the lookahead (Language::get) will kick in and prevent the lazy dot group from further matching. The part that I can't seem to quite understanad is (?:\).)? What is the practical meaning of that? Please explain :) When I fully understand the first regexp I will go over to the second and chances are I will come back with more questions then :)
|
|
-
07-28-2008, 8:01 PM |
|
|
Re: Capture function call arguments
Taking the example directly from the PCRE manual, the pattern: #\(((?>[^()]+)|(?R))*\)# when applied to your test text of: $hello = Language::get('key1', array('firstName',$firstName), array('lastName', $obj->getLastName($arg1->get(), $arg2)) , array('something', $something)).Language::get('key2', array('gender', $gender)); $bar = Language('key3'); generates the matches: - [0]=> array
- [0]=>('key1', array('firstName',$firstName), array('lastName',
$obj->getLastName($arg1->get(), $arg2)) , array('something',
$something))
- [1]=>('key2', array('gender', $gender))
- [2]=>('key3')
- [1]=> array
- [0]=>('something', $something)
- [1]=>('gender', $gender)
- [2]=>'key3'
As you can see, each match returns from the opening parenthesis to the matching close parentheses, even if there are intervening parentheses. (Don't worry about the match group #1 captures at the moment)
I don't have time right now to address the 2nd part of your question, but the basic pattern will be the same. Susan
|
|
-
07-28-2008, 11:03 PM |
-
ddrudik
-
-
-
Joined on 05-24-2007
-
USA
-
Posts 2,079
-
-
|
Re: Capture function call arguments
Note that the function block in my code example can be removed, it was unused code from a previous test with preg_replace_callback which wasn't needed in the solution. When I read back through the pattern I tested and submitted I can't seem to find a good reason for the unescaped dot, the best I can guess is that I meant to escape it with or enclose in [ ] to match a literal dot. Unfortunately I didn't see my typo because . would match a literal dot since dot is in . character set. It seems you have a good understanding of the (?=) construct. There's a number of different ways to tackle this, but the lookahead is used in this case to determine what constitutes the end of one match and the start of another, without a lookahead it wouldn't be able to easily separate the matches in this case. Likely there are other patterns that would work equally as well if not better than those that worked for me, if nothing else maybe my code gave you a logic framework to work from. Aussie Susan's recursive pattern is a useful one in this case as it fits your matching parens content well, however I would pair that with something else to make sure that only my Language::get parens groups are returned.
|
|
-
07-29-2008, 7:06 AM |
-
kajic
-
-
-
Joined on 04-18-2007
-
-
Posts 24
-
-
|
Re: Capture function call arguments
Aussie Susan, I think I will have to look into recursive patterns in order to understand how your example works. Thank you for showing it to me. By the way, could you recommend a tool similar to RegexBuddy that supports recursive patterns? I find RegexBuddy indispensable when testing expressions and it would be nice if I didn't have to resort to creating a test script in php in order to test a recursive pattern.
ddrudik, I can assure you that your examples have been very helpful. Not least in helping me to better understand how positive look ahead works. It felt really good when I finally realized the fact that the look ahead forced the lazy capturing group to keep on matching :)
I have been experimenting a little more with the first expression and here are some thoughts. It seems that the expression Language::get\((.*?)(?=(?:\)\.)?Language::get|\);)\) has the benefit of allowing ). to be captured while it on the other hand will not properly stop matching if Language::get(..) is chained with anything other than another Language::get
The more specific Language::get\((.*?)(?=Language::get|\)[;.])\) is able to stop matching when Language::get is chained but it doesn't allow ). to be captured.
Ideally I would like to mix the behavior of both these expressions in such a way that the new expression allows ). to be captured but also breaks when Language::get is chained with something else. This is not a crucial feature for my application but if there is some obvious and simple way to achieve it please tell me about it.
|
|
-
07-29-2008, 9:05 AM |
-
ddrudik
-
-
-
Joined on 05-24-2007
-
USA
-
Posts 2,079
-
-
|
Re: Capture function call arguments
kajic, the more text samples the better when testing patterns, please provide a more extensive text example containing more Language::get blocks chained to each other and chained to other functions as well as nested ( ) unrelated to Language::get. Consider this code (commented and echo'd to screen throughout to help explain): <?php // get the file contents into var $file=file_get_contents('file.txt'); echo '<pre>$file before replacement:<br>'.$file.'<hr>'; // initialize vars $names=Array(); $values=Array(); $i=0; // this function is to replace the balanaced parens blocks with placeholders function replacebalancedparens($match) { global $i; $replacement=chr(1).$i.chr(2); $i++; return $replacement; } // Aussie Susan's PCRE balanaced parens groups pattern $balancedparenspattern='/\(((?>[^()]+)|(?R))*\)/'; // create an array from the balanced parens groups preg_match_all($balancedparenspattern,$file,$balancedparensarray); echo '$balancedparensarray:<br>'.print_r($balancedparensarray,true).'<hr>'; // replace the balanced parens groups with placeholders for the lookahead operation $file=preg_replace_callback($balancedparenspattern,'replacebalancedparens',$file); echo '<pre>$file after replacement:<br>'.$file.'<hr>'; // lookahead operation to only match the placeholders preceded by Language::get $languagegetblockspattern='/(?<=\bLanguage::get\x1)\d+(?=\x2)/i'; preg_match_all($languagegetblockspattern,$file,$languageblocksarray); echo '$languageblocksarray:<br>'.print_r($languageblocksarray,true).'<hr>'; // populate an array with the balanced parens groups that are preceded by Language::get foreach ($languageblocksarray[0] as $value) { $languageblocksbalancedparensarray[]=$balancedparensarray[0][$value]; } echo '$languageblocksbalancedparensarray:<br>'.print_r($languageblocksbalancedparensarray,true).'<hr>'; // match on array constructs and populate $names and $values arrays foreach ($languageblocksbalancedparensarray as $arrays) { // remove the outer-most parens to process $arrays=substr(substr($arrays, 1), 0, -1); // match on the array constructs, matching the outer-most closing parens with \)(?![^()]*\)) preg_match_all("/array\s*\(\s*'([^']*)'\s*,\s*(.*?)\s*\)(?![^()]*\))/is",$arrays,$arrayparts); // populate $names and $values arrays $names=array_merge($names, $arrayparts[1]); $values=array_merge($values, $arrayparts[2]); } echo '$names array:<br>'.print_r($names,true).'<hr>$values array:<br>'.print_r($values,true).'<hr>'; ?> Tested against this file.txt file: $hello = Language::get('key1', array('firstName',$firstName), array('lastName', $obj->getLastName($arg1->get(), $arg2)) , array('something', $something)).Language::get('key2', array('gender', $gender)).somefunc('test'); $test = foo('test'); $test2 = array(array('test',$test)); $bar = NotLanguage::get('key3'); $bar = Language::get('key3');
|
|
-
07-29-2008, 7:21 PM |
|
|
Re: Capture function call arguments
I don't know RegexBuddy myself, but looking at their web page, they say that they use PCRE as one of the regex engines (along with .NET and several other langauges). I'm not sure if this means that it actually uses the regex engine or that it can translate what you have into the appropriate syntax. However, you may be able to test my suggested pattern directly by selecting the appropriate engine. Note that the recursive syntax is a PCRE extension and is not recognised in other regex engines - hence my original question about making sure we knew what you were using. Susan
|
|
-
07-30-2008, 5:08 AM |
-
kajic
-
-
-
Joined on 04-18-2007
-
-
Posts 24
-
-
|
Re: Capture function call arguments
Susan, I don't think RegexBuddy has any such option. What program do you use to test your expressions?
ddrudik, Ah, I think I partially understand the recursive expression \(((?>[^()]+)|(?R))*\) now. The part which I don't understand is why the matching group is made atomic? What is the benefit of that? By the way, $foo = Language::get('key1', array('argument1', $arrray[0]->getPart0().$obj->getPart1().$obj->getPart2())); is not matchable by your new "array constructs" pattern. The original pattern array\(\s*\'([^\']*)\'\s*,\s*(.*?)(?=\)(?:\s*,\s*array\(|\s*$)) is able to match it just fine though. Thanks again for all your help, both of you. You've given me alot to think about :)
|
|
Page 1 of 2 (24 items)
1
|
|
|