Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Extract content from HTML tags

Last post 09-04-2008, 11:30 AM by prometheuzz. 6 replies.
Sort Posts: Previous Next
  •  09-03-2008, 11:01 AM 45893

    Extract content from HTML tags

    Hello everybody,

    Im working in Java and i Want to extract from the follwoing text:

     <tr> 
    <td width="471" valign="middle"> <div align="justify"><span class="c10_justified"><strong><u>EXTRACTME</a></u></strong><br>
    Address
    , telphone:xxxxxx

    </span></div></td>
                              </tr>

     the text EXTRACTME and the Address,telephone:xxxxxxx

    keep in mind that the html has break lines also inside.

    Any suggestion?

    Regards,

     

  •  09-03-2008, 3:28 PM 45903 in reply to 45893

    Re: Extract content from HTML tags

    Try this:

     

    import java.util.regex.*;

    public class Test {   
        public static void main(String[] args) {
            String text =
                "some text...\n"+
                "<tr>\n"+
                "  <td width=\"471\" valign=\"middle\"> <div align=\"justify\">"+
                "  <span class=\"c10_justified\"><strong><u>EXTRACTME</a></u></strong><br>\n"+
                "    Address\n"+
                "       , telphone:xxxxxx\n"+
                "  </span></div></td>\n"+
                "\n"+
                "</tr> \n"+
                "some more text...";
            String regex = "(?is)<tr>.*<span.*?>(?:\\s*<.*?>)+(.*?)(?:\\s*<.*?>)+(.*?),(.*?):(.*?)</span>.*?</td>";
            Matcher m = Pattern.compile(regex).matcher(text);
            while(m.find()) {
                System.out.println("group(1) = "+m.group(1).trim());
                System.out.println("group(2) = "+m.group(2).trim());
                System.out.println("group(3) = "+m.group(3).trim());
                System.out.println("group(4) = "+m.group(4).trim());
            }
        }
    }

  •  09-03-2008, 3:28 PM 45904 in reply to 45893

    Re: Extract content from HTML tags

    Did you leave text out of the real sample?  It doesn't look formatted as one would expect.  Provide real text (change numbers to 1's and name letters to a's if you need to) to get a working pattern.


  •  09-04-2008, 9:41 AM 45948 in reply to 45904

    Re: Extract content from HTML tags

    The code snippet provided works in the context of this small string but it cannot work as a part of a whole web page.

    This is the target URL that i want to parse --> http://www.menuthessaloniki.gr/list_det.asp?area=1&offset=30

    Thanks 

  •  09-04-2008, 9:50 AM 45949 in reply to 45948

    Re: Extract content from HTML tags

    misge:

    The code snippet provided works in the context of this small string but it cannot work as a part of a whole web page.

    This is the target URL that i want to parse --> http://www.menuthessaloniki.gr/list_det.asp?area=1&offset=30

    Thanks 

    Could you be more specific? What does "it cannot work" exactly mean? I say it can work.

    Perhaps you could post your code here, then I (or someone else) can have a look at it and maybe spot an error.

  •  09-04-2008, 10:33 AM 45953 in reply to 45949

    Re: Extract content from HTML tags

    ok here is the code

    <code>    public static void main(String[] args) throws IOException,

       BadLocationException {

    URL url = new URL(

    "http://www.menuthessaloniki.gr/list_det.asp?area=1&offset=30");

    BufferedReader in = new BufferedReader(new InputStreamReader(url

    .openStream()));

    String str;

    StringBuffer buffer = new StringBuffer();

     

    while ((str = in.readLine()) != null) {

       buffer.append(str);

    }

    in.close();

    System.out.println(buffer.toString());

    str = buffer.toString().replaceAll("\n", " ");

    Pattern pattern = null;

    pattern = Pattern.compile("(?is)<tr>.*<span.*?>(?:\\s*<.*?>)+(.*?)(?:\\s*<.*?>)+(.*?),(.*?):(.*?)</span>.*?</td>");

    Matcher m = pattern.matcher(str);

            if(m.find()) {

                System.out.println("group(1) = "+m.group(1).trim());

                System.out.println("group(2) = "+m.group(2).trim());

                System.out.println("group(3) = "+m.group(3).trim());

                System.out.println("group(4) = "+m.group(4).trim());

            }

        }

    </code> 

     

  •  09-04-2008, 11:30 AM 45955 in reply to 45953

    Re: Extract content from HTML tags

    import java.io.*;
    import java.net.URL;
    import java.util.*;
    import java.util.regex.*;

    public class Test {
        public static void main(String[] args) throws IOException {
             URL url = new URL("http://www.menuthessaloniki.gr/list_det.asp?area=1&offset=30");
             Scanner data = new Scanner(url.openStream());
             StringBuffer buffer = new StringBuffer();
             while(data.hasNextLine()) {
                buffer.append(data.nextLine()).append(' ');
             }
             String str = buffer.toString();
             Pattern pattern = Pattern.compile("(?is)<strong><u>([^<]++)(?:(?!</u>).)*+</u>"+
                                     "</strong>(?:<br>|\\s++)++([^,<]++),[^:]++:([\\s\\d]++)");
             Matcher m = pattern.matcher(str);
             while(m.find()) {
                 System.out.println("group(1) = "+m.group(1).trim());
                 System.out.println("group(2) = "+m.group(2).trim());
                 System.out.println("group(3) = "+m.group(3).trim());
                 System.out.println();
             }
         }
    }
View as RSS news feed in XML