Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Retrieving html tags and according content out of an html-code block

Last post 07-27-2008, 11:23 PM by ddrudik. 4 replies.
Sort Posts: Previous Next
  •  07-24-2008, 7:44 AM 44522

    Retrieving html tags and according content out of an html-code block

    Hi, I write in javascript.

    I have an html-string in a variable and like to retrieve all the tag names in one array and all the non-html-contents of those tags in a second array:

     e.g.

     html_string: <h1>This is a headline</h1>

    <p>But I have more <b> to offer </b>. There is still a lot of content to see</p><font size="+1">Do you see it?</font>

     
    Here I need two regular expressions. First should result in an array with

    0 => h1 

    1 => p

    2 => b

    3 => font

     
    and the second with

    0 => This is the headline

    1 => But I have more to offer. There is still a lot of content to see

    2 => to offer

    3 =>Do you see it?

    I just need to know what text (words) are in what html tags. Any other solution that provides me that information is fine too.

     

    Thanks a lot in advance.

    schingeldi

     

  •  07-24-2008, 7:04 PM 44543 in reply to 44522

    Re: Retrieving html tags and according content out of an html-code block

    How do you want to handle nested and overlapping tags?

    Must this be done in a single pass?

    Susan 

  •  07-25-2008, 2:58 AM 44554 in reply to 44543

    Re: Retrieving html tags and according content out of an html-code block

    Thanks for your answer. Nested and overlapping tags can also be handled in loops. The only restrictions that exists is I am using javascript. ;-)
  •  07-27-2008, 7:23 PM 44623 in reply to 44554

    Re: Retrieving html tags and according content out of an html-code block

    OK, I have 2 solutions:

    1) Use the pattern:

    <(\w+)[^>]*>(.*?)</\1>

    with the 'singleline' option set will give you the tag name in match group #1 and the text between the opening and corresponding closing tag in match group #2. Each match will need to be re-processed to get any nested tags (in your example, the first time through I get:

    <h1>This is a headline</h1>
    <p>But I have more <b> to offer </b>. There is still a lot of content to see</p>
    <font size="+1">Do you see it?</font>

    as the complete matches (NB: match group are NOT shown in the above). As you can see in the 2nd match, the nested <b> is included in the <p> tag text. However, you said that you can use programming loops to overcome this so I'll leave that to you.

    2) This is not a regex problem and if I were to do this (as I have done in the past) I would use the HTML DOM, using the innterText property at the various levels through the structure. This has the advantage, that you can do what you are asking for very simply (literally one a few lines of code) and you won't get the nested tags within the text.

    Susan 

  •  07-27-2008, 11:23 PM 44627 in reply to 44522

    Re: Retrieving html tags and according content out of an html-code block

    schingeldi, if you are interested in seeing what a match loop would look like, the following is what I put together.  Note that there is some isue with the operation in which the outer nested tag does not match properly on the second pass, if you happen to work that out I would be curious where my code was at issue, but in any case I believe that an operation such as this would be successful since it matches only on inner-most tags and then replaces those tags with empty string to continue the do-while loop until no more tags are found.

    <html>
    <head>
    <script type="text/javascript">
    function regextest(){
      var re = /<\s*(\w+)[^>]*>([^<>]+)<\s*\/\s*\1\s*>/g;
      var sourcestring = document.getElementById("targetspan").innerHTML;
      var tagnames = new Array();
      var nonhtmlcontent = new Array();
      var tag="";
      var nonhtml="";
      var zz = 0;
      var yy = 1;
      do {
      alert("Pass: "+yy+"\r\nSourcestring:\r\n"+sourcestring);
        for (var matches = re.exec(sourcestring); matches != null; matches = re.exec(sourcestring)) {
          alert("Match: "+zz);
          tagnames[zz]=matches[1];
          nonhtmlcontent[zz]=matches[2];
          tag=tag+"tagnames["+zz+"] = "+tagnames[zz]+"\r\n";
          nonhtml=nonhtml+"nonhtmlcontent["+zz+"] = "+nonhtmlcontent[zz]+"\r\n";
          zz++;
        }
        sourcestring = sourcestring.replace(re, "$2");
        yy++
      } while (sourcestring.match(re))
      alert("Array results:\r\n"+tag+nonhtml);
    }
    </script>
    </head>
    <body onload="regextest();">
    <span id="targetspan">
     <h1>This is a headline</h1>

    <p>But I have more <b> to offer </b>. There is still a lot of content to see</p><font size="+1">Do you see it?</font>
    </span>
    </body>
    </html>


View as RSS news feed in XML