Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Handling BBcode mismatched tags.

Last post 09-18-2008, 3:02 PM by prometheuzz. 13 replies.
Sort Posts: Previous Next
  •  09-13-2008, 2:39 AM 46294

    Handling BBcode mismatched tags.

    I'm using PHP's preg_replace (Perl-compatible PCRE) and I'm trying to write a regex that handles bbcode and replaces it with the html equivalent. Just to avoid tripping this board's bbcode, I'm going to use parenthesis for mine.

    We have 4 basic tags: (b) (i) (u) and (s), along with the end tags (/b) (/i) (/u) (/s). We're all familiar with how it's used.

    The problem comes when they start mixing and matching multiple tags. I don't want them to alternate the tags because it will break the HTML. Here's 6 cases to demonstrate.

    1) Good: (b)Bold(/b)(i)Italic(/i)

    2) Good: (b)(i)Both(/i)(/b)

    3) Bad: (b)(i)Alternating tags(/b)(/i)

    4) Bad: (b)(i)Misplaced tag(/b)

    5) Good: (b)(i)(u)(s)All tags(/s)(/u)(/i)(/b)

    6) Bad: (b)(i)(u)(s)Swapped end tags(/s)(/i)(/u)(/b)

     

    I've tried the following regular expression, which is supposed to detect mismatched arguments:

    \(([a-z]*?)\)(.*?)\(([a-z]*?)\)(.*?)\(\/\1\)

    or if it's easier to read, here's the actual version which uses square brackets:

    \[([a-z]*?)\](.*?)\[([a-z]*?)\](.*?)\[\/\1\]

    And replace the found cases with ($1)$2($3)$4(/$3)(/$1)

    Which handles case 1, 3, and 4 fine, but can't handle cases 2, 5, and 6 properly.

     

    So my question to you is how would I handle multiple embedded tags? I know how to use \1 and \2, but here we're dealing with a dynamic number of arguments.

    Should I just break out the old tokenizer?

  •  09-13-2008, 4:24 AM 46297 in reply to 46294

    Re: Handling BBcode mismatched tags.

    Could you give the desired output/repacements for all your six cases:

    1) Good: (b)Bold(/b)(i)Italic(/i)

    2) Good: (b)(i)Both(/i)(/b)

    3) Bad: (b)(i)Alternating tags(/b)(/i)

    4) Bad: (b)(i)Misplaced tag(/b)

    5) Good: (b)(i)(u)(s)All tags(/s)(/u)(/i)(/b)

    6) Bad: (b)(i)(u)(s)Swapped end tags(/s)(/i)(/u)(/b)

     

    And in what programming language are you doing this?

  •  09-13-2008, 4:46 AM 46298 in reply to 46297

    Re: Handling BBcode mismatched tags.

    try this

    (?xis)
    (
         \((b|u|i|s)\)
        (?:(?:(?!\(/?(?2)\)).)++|(?1))+
         \(/\2\)
    )
    (?!\s*\(/(?2)\))


    http://portal-vreme.ro
  •  09-13-2008, 4:54 AM 46299 in reply to 46297

    Re: Handling BBcode mismatched tags.

    @KillahBeez: I'm not having much luck implementing what you wrote. If I said that currently I'm using PHP preg_replace('/\[([a-z]*?)\](.*?)\[([a-z]*?)\](.*?)\[\/\1\]/is','($1)$2($3)$4(/$3)(/$1)',$text);

    Would you be able to better clue me in on what I might try to put in there? I gather the first (?xis) indicates that I use /xis at the end of the regex. Is the third section the replacement string? I tried preg_replace('/\[(b|u|i|s)\](?:(?:(?!\[/?(?2)\]).)++|(?1))+\[/\2\]/xis','(?!\s*\[/(?2)\])',$text); but it just errored about Unknown modifier '?' in the regex.

     

    @prometheuzz

    Sorry, PHP, using preg_replace, which is PCRE. So every regex I say here also starts with / and ends with /is

    Now for the desired output. I plan to run through 2 replacements. One to ensure proper nesting, and the second to replace with the actual html tags. The second half is done, so here I'll just give the desired output for the first replacement (anything in bold indicates what changed):

    1) (b)Bold(/b)(i)Italic(/i)

    2) (b)(i)Both(/i)(/b)

    3a) (b)(i)Alternating tags(/i)(/b)

    3b) (b)(i)Alternating tags(/i)(/b)(/i)

    (either would be acceptable, since the second replacement algorithm ignores unmatched end tags)

    4a) (b)(i)Misplaced tag(/i)(/b)

    4b) (b)Misplaced tag(/b)

    4c) (b)(i)Misplaced tag(/b)

    (any would be acceptable, since the second replacement algorithm would simply ignore the unmatched (i) tag)

    5) (b)(i)(u)(s)All tags(/s)(/u)(/i)(/b)

    6a) (b)(i)(u)(s)Swapped end tags(/s)(/u)(/i)(/u)(/b)

    6b) (b)(i)(u)(s)Swapped end tags(/s)(/u)(/i)(/b)

    6c) (b)(u)(i)(s)Swapped end tags(/s)(/i)(/u)(/b)

    6d) (b)(i)(u)(s)Swapped end tags(/s)(/u)(/i)(/b)

    (any combination thereof, as long as all matched start and end tags are properly nested. As above, unmatched tags are acceptable, since the second replacement algorithm would ignore them)

     

    After my second replacement algorithm runs through it, it will find all matched tags and replace them with their html equivelents. I've already coded this.The end result for examples 3 and 4 would be thus:

    3a) <b><i>Alternating tags</i></b>

    3b) <b><i>Alternating tags</i></b>(/i)

    4a) <b><i>Misplaced tag</i></b>

    4b) <b>Misplaced tag</b>

    4c) <b>(i)Misplaced tag</b>

  •  09-13-2008, 5:08 AM 46300 in reply to 46298

    Re: Handling BBcode mismatched tags.

    You could also use the pattern without the last subpattern to match the balanced subgroup only.

    So, can be

    (?xis)
    (
         \((b|u|i|s)\)
        (?:(?:(?!\(/?(?2)\)).)++|(?1))+
         \(/\2\)
    )

    The fact that PCRE use only / as delimiter is wrong, you could use any combination you like.


    http://portal-vreme.ro
  •  09-13-2008, 11:29 AM 46304 in reply to 46294

    Re: Handling BBcode mismatched tags.

    Since the forum keeps interpreting parts of the source code as mark-up tags, I've uploaded it here: http://iruimte.nl/regex/bb-tags.txt

    Is that what you meant to do?

  •  09-13-2008, 3:20 PM 46307 in reply to 46294

    Re: Handling BBcode mismatched tags.

    If one option is to just convert the proper tag pairs and delete the rest of the tags:

    http://pastebin.com/f3a2d7c5


  •  09-13-2008, 5:02 PM 46310 in reply to 46307

    Re: Handling BBcode mismatched tags.

    The error you have is from the fact that you use the delimiter /, which is also use in the pattern.

    Look at this also maybe can help

    http://pastebin.com/m3f22b955


    http://portal-vreme.ro
  •  09-16-2008, 10:38 AM 46381 in reply to 46304

    Re: Handling BBcode mismatched tags.

    prometheuzz, that's very close to what I want, except it appears to be greedy and I'm having trouble extending it into multi-letter tags like (url) (/url)

    When I attempted to do the latter by using grouping (b|u|i|s|url), I found that the (/url) part got converted into (/l)(/r)(/u), heh.

     

    killahbeez and ddrudik, I'm not doing a direct conversion to the html equivelents. The site is w3c compliant, so (b) gets replaced with <strong>, (i) with <em>, (u) with <ins>, and (s) with <del>. This is the purpose of the secondary phase that I mentioned before.

    That said, I told the secondary phase to replace <b> with <strong> and so on, but killahbeez' code does not seem to pass the 3rd test, and ddrudik's code seems to result in the entire page content being flushed.

  •  09-16-2008, 1:12 PM 46390 in reply to 46381

    Re: Handling BBcode mismatched tags.

    IsmAvatar:

    prometheuzz, that's very close to what I want, except it appears to be greedy and I'm having trouble extending it into multi-letter tags like (url) (/url)

    ...

     Yes, you need to adjust the get_closing_tags function: the way I wrote it is a bit of a hack, it simply reverses the string and then wraps '(/' and ')' around it.

  •  09-17-2008, 11:28 AM 46425 in reply to 46390

    Re: Handling BBcode mismatched tags.

    Ah, took me a minute to figure out what that was doing, but I see now.

    function get_closing_tags($opening_tags) {
    $arr = explode("[",$opening_tags);
    $arr = array_reverse($arr);
    return "[/" . substr(implode("[/",$arr),0,-2);
    }

    This seems to be working well.

     

    Which brings up another problem. I don't believe it's related to your code, but... I want to implement a (quote) tag that they can embed. For example:

    Depth 0 (quote) Depth 1 (quote) Depth 2 (/quote) Depth 1 (/quote) Depth 0

    But my usual approach won't work here:

    $str = preg_replace("#\[quote\](.*?)\[/quote\]#is","<div class='quote'>$1</div>",$str);

    Results in:

    Depth 0 <div class="quote"> Depth 1 (quote) Depth 2 </div> Depth 1 (/quote) Depth 0

     

    How would I write my regex to resolve from the inside out? My idea was to simply check that it didn't contain (quote) before attempting to resolve.

    #\[quote\]((?:^\[quote\])*?)\[/quote\]#is

    But then my preg_replace doesn't replace anything.

  •  09-18-2008, 2:38 AM 46451 in reply to 46425

    Re: Handling BBcode mismatched tags.

    IsmAvatar:

    ...

    #\[quote\]((?:^\[quote\])*?)\[/quote\]#is

    But then my preg_replace doesn't replace anything.

    I didn't test it with your example string, but if you want some sort of 'does not contain string X' functionality with more than one character, you can't use a negating character class: that only works for single characters.

    Try something like this:

    #\[quote\]((?:(?!\[quote\]).)*)\[/quote\]#is

    (untested!)

  •  09-18-2008, 10:38 AM 46462 in reply to 46451

    Re: Handling BBcode mismatched tags.

    Ah, right. Thanks for pointing that out.

    One last question:

    Depth 0 (u) Depth 1 (b) Depth 2 (/b) Depth 1 (/u) Depth 0

    This should remain unchanged when I run it through your nesting fix. Unfortunately, this is the result:

    Depth 0 (u) Depth 1 (b) Depth 2 (/u) Depth 1 (/u) Depth 0

    Notice the first (/b) gets replaced with a (/u)

  •  09-18-2008, 3:02 PM 46477 in reply to 46462

    Re: Handling BBcode mismatched tags.

    My solution is not suited for nested tags.
View as RSS news feed in XML