Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

preg_replace whitespaces into underscore between double quotes

Last post 08-30-2012, 7:08 PM by Aussie Susan. 7 replies.
Sort Posts: Previous Next
  •  08-17-2012, 5:36 AM 86192

    preg_replace whitespaces into underscore between double quotes

    Hi,

    I need a preg_replace pattern to do the following:

    - change all whitespaces between words between double quotes into underscore

    In this forum I found a solution

     (?=[^"]*"([^"]*"[^"]*")*$)

    (with trailing space) but does not seem to work with preg_replace. Can anybody help me out with the pattern?

    Thanks,

    Peter

  •  08-19-2012, 8:24 PM 86221 in reply to 86192

    Re: preg_replace whitespaces into underscore between double quotes

    Try:

    \x20(?![^"]*("[^"]*"[^"]*)*$)

    To understand the way it works you need to go back to some basics.

    How do you determine if you are within double quotes or not? Given that many regex variants don't allow variable length lookbehind expressions, the easy way is to check to see if there is an even or odd number of double quotes between where you are and what you consider to be a suitable "end" point - in this case we will assume the end of the entire string but there could be performance issues if we use that for every long blocks of text.

    Therefore the first step is to move forward looking for the first double-quote after the target character and stop just before it: '[^"]*'

    If we don't locate even 1 double-quote then we can't be inside a double-quoted string. If we do locate the next double-quote, then we need to keep searching to make sure that it is an "end" double-quote and not the start of one. The way to do this is to locate as many matching double-quotes as we can.

    Therefore we match the double-quote we have just found, skip forward to the next one and match that:

    "[^"]*"

    Of course, this could be followed by more characters that are no double-quotes and so we get

    "[^"]*"[^"]*

    As this will match pairs of double-quotes and the trailing characters after the 2nd of the pair, we need to let this repeat as many times as possible

    ("[^"]*"[^"]*)*

    At the end of this process we should be at the end of the string: the other alternative is that we are at a double-quote that is NOT part of a balanced pair and therefore would mean that the initial double-quote we found is actually the start of a pair and not the end (as we need). Therefore a check for being at the end of the text is all we need at this point. Put this all together and we get:

    [^"]*("[^"]*"[^"]*)*$

    As we don't want to actually capture all of those characters, we need to use a lookahead. Remember that what that pattern will do is to successfully match pairs of double-quotes (in even number of double-quotes): what we want to do is the exact opposite - find an odd number of double-quotes. Therefore we can use a negative lookahead to "flip" the result of the lookahead's match. When we add the check for the target character (the space in this case - and I've used the '\x20' form to make it easy to see that a space is being used) we get:

    \x20(?![^"]*("[^"]*"[^"]*)*$)

    Now, if you look at your pattern, and analyse what it is doing, you will see why it is possibly not working as you would expect. If we look at the inner-group:

    ([^"]*"[^"]*")*$

    you will see that it will skip non-double-quote characters, match the double-quote, skip more characters and end up matching the 2nd double-quote. At this point it will have 2 options: the first is to repeat the sub-pattern and find the end of the next pair of double-quotes and the second is to be at the end of the string. The implication of this is that the string MUST end with the double-quoted characters

    My suggestion does not have this requirement as the order of the components in the repeating group is moved around a little to allow non-quoted characters to follow the ending double-quote. There may be none, but this is not a requirement. However, by moving the double-quote character to the start of the repeating group, I had to prefix that with something that would get me to the first double-quote but would NOT be part of the repeating group.

    I hope all of this makes sense! I've tried it on some dummy text that I made up (actually the text from another question in this forum that used quoted strings as well) in a regex test platform that uses the PHP functions and it seems to work as you requite. However it would have been useful to have some text sample to work with.

    Susan

     

  •  08-21-2012, 9:29 AM 86243 in reply to 86221

    Re: preg_replace whitespaces into underscore between double quotes

    Dear Susan,

    Thank you for the pattern and especially for the explanation. Actually there is no sample text, these regex pattern is to create arguments for a function where the elements of the arguments are separated by spaces (thats's why I have to use the underscore).

    The permormance issues are not a great risk, because the whole string to be processed is maximized to be 150 characters long.

    One more optional requirement for the pattern: Would it be possible to also remove the double quotes with the pattern after changing the given spaces into underscores?

     

    Thank you very much!
    Peter

  •  08-21-2012, 7:39 PM 86250 in reply to 86243

    Re: preg_replace whitespaces into underscore between double quotes

    Peter,

    You cannot remove the double-quotes using the same pattern as the one to change the spaces into underscores. This is because the one I suggested works by matching  each individual space character in the text - only after it has found a space does it then check to see if it is within the double quotes. That is why the replacement string is just the single underscore character (and I now realise I didn't go into that part but I assume you know that aspect already). [Don't get confused by the fact that the lookahead checks for multiple characters - only the characters matched in the main part of the pattern (and not any lookahead/lookbehind) are part of the matched string used for replacement purposes.]

    Also, if you did match the text from the opening to the closing double-quote when you found the first suitable whitespace, then you would not be able to locate any subsequent spaces in the same section of text.

    However, you can remove the double-quotes as a separate pass with a pattern such as:

    "([^" ]*)"

    This will locate the first double-quote, scan forward until it finds either a space or the next double-quote and the require that the next character really IS a double-quote (and not a space; there is a 3rd alternative which is the end of the text but that is like finding another space in this situation). If the last character is NOT another double-quote then the first one cannot have been part of a "pair" and so the match will fail. However if is IS a double-quote, then you can replace the entire match (which will include the opening and closing quotes) with the text captured in match group #1 (the text without the double-quotes).

    What I have assumed here is that the first pass will have removed all space characters between the double quotes. Therefore there should be a contiguous series of characters in between that does NOT have a space. This is really just an attempt to double-check that we have found the correct pair of quotes.

    Which brings me to another point I didn't mention before. There is a large assumption that the double-quote characters are correctly paired and you don't have text such as

    Hello "she said" to "the large dog

    However it should handle text such as

    She said "Hello ""Sam"" and how are you" to the large dog

    but you do need to be on the lookout for such oddities and perhaps extend the pattern to account for any you may encounter in your situation.

    Susan

  •  08-29-2012, 7:11 AM 86340 in reply to 86250

    Re: preg_replace whitespaces into underscore between double quotes

    Susan,

     

    Thank you very much. Sorry for the delay, I was on a holiday. I feel a continuously increasing pressure to learn these regex patterns because they are so useful. To your issue regarding unpaired double quotes: there is a check if the number of double quotes is odd or even and as far as i know it cannot be more than 4 (only two arguments can consist of more words).

    As trying to follow and understand what you said one more question came into my mind. My task is to convert one line of string sequence into arguments of my function. This line can contain words separated by spaces and words separated with spaces between double quotes. Because of this we need to change spaces between words between double quotes into underscore. After this change some trim(), implode(), etc php functions are used to generate an array in which the arguments are the elements. So it is done by replacing spaces to underscores, double quotes to nothing, trimming excess whitespaces, creating the array.

    Wouldn't it be more efficient to use a regex pattern and preg_match to create this array in one step? The trick is that we don't know how many elements there would be so the number of match groups are unknown. Any help welcome but it's absolutely understandable if you don't want to get involved so deeply.

    Your help was so great, thank you very much!
    Peter

  •  08-29-2012, 7:52 PM 86343 in reply to 86340

    Re: preg_replace whitespaces into underscore between double quotes

    Please don't take this the wrong way, Peter, but if you had described this as your problem right from the start, then things might have been a little easier. In fact we have already solved this to a large extent.

    What you are looking at here is to split the string on the space character except when the space is between double-quotes.

    If you go back to my first response you will see that I started by looking for a pattern that would tell if a space was not between double-quotes ensuring that an even number (including 0) of double-quotes followed the space character. I then logically negated it as the focus was on those spaces that were between double-quotes.

    Now the focus is back on the unquoted spaces as these are the ones that you want to use to split your string. However we can use the same sub-pattern:

    [^"]*("[^"]*"[^"]*)*$

    The complete pattern is now to locate a space character and check that is IS outside double-quotes as in:

    \x20(?=[^"]*("[^"]*"[^"]*)*$)

    We can now look at another function available in most regex libraries called "split". For this function, you use a pattern to identify a character (sequence) that you want to use to break up the text. By giving it the pattern above, it will split the line up at each space character that is NOT in double-quotes and return the resulting sub-strings as an array (of as many elements as there are sub-strings).

    The advantage of this approach is that we don't need to convert the spaces in the double-quoted strings into underscores - they are simply ignored by the pattern used to do the splitting.

    There are some "disadvantages" with this approach. The first is that the above pattern will split on EACH space character so if you have the string:

     Hello  "she said" to the large dog

    (note the space at the start and the two spaces between "hello" and " "she") you will get back the array values

    [0] (null)
    [1] Hello
    [2] (null)
    [3] "she said"
    [4] to
    [5] the
    [6] large
    [7] dog

    You can get rid of the multiple space issue (if it IS an issue) by altering the pattern to:

    \x20+(?=[^"]*("[^"]*"[^"]*)*$)

    This will get rid of the item [2] in the above array as the multiple spaces are now treated as a single character sequence to use for the split.

    To get rid of the initial null array element, you can either use the "trim()" (or equivalent) string function before-hand or skip over null strings in your subsequent processing of the split output array.

    Susan

  •  08-30-2012, 11:06 AM 86359 in reply to 86343

    Re: preg_replace whitespaces into underscore between double quotes

    Dear Susan,

     Thanks for your answer. Sorry for the misunderstanding. My only thin excuse is that I did not know that this has been my problem. I just realized - by reading your explanations - that my way of thinking was not right. Again, sorry for this and thank you for all your great help.

     

    Peter

  •  08-30-2012, 7:08 PM 86363 in reply to 86359

    Re: preg_replace whitespaces into underscore between double quotes

    As I said, please don;t take it the wrong way. One thing I have learned over the (all to many) years is that you solve the problem you define.

    The most important thing is that you have learned more about the (generic) capabilities of regex libraries.

    Susan

View as RSS news feed in XML