I found a regex to filter html so (see other post on extracting readable text from html). I would like to improve this by also filtering client side code.
the regex to extract words from html is:
<\!\w+(?:\s+[^>]*?)+\s*>|<\w+(?:\s+\w+([^>]*)(?:\s*=\s*(?:\"[^\"]*\"|'[^']*'|[^\"'>\s]+))?)*\s*\/?>|<\/\w+\s*>|<\!--[^-]*-->
To test it I pasted the source code from a random web page (for example http://www.tijd.be/) in a textbox and performed a regex.replace(htmltext,"") with the above expression to get rid of all the html tags and other rubish. However it still leaves some client side script code (see below) in the plain text of which I would like to get rid of as well.
Can anyone extend the regex or provide an additional one to get rid of the javascript (or other script language) code that's still in there ?
Some examples of code that remain in the text:
@import "http://static.tijd.be/adm/css/homepage.css";
var static = 'http://static.tijd.be'; var pagekey = 'homepage';
if (navigator.userAgent.indexOf('Mac') >= 0 && s.getTimezoneOffset() >= 720)
s.setTime (s.getTime() - 1440*60*1000);
add('homepage','homepage','skyscraper','2','','120x600,160x600,120x300');
&
if (document.cookie != null && document.cookie.length > 0) {
var cookie = document.cookie.match(new RegExp('user=[\"]?([^;\"]*)[\"]?'));
if (cookie!=null) $('personal').innerHTML='Welkom '+unescape(cookie[1])+' (Afmelden)';
}
'
€
h3.kopartikel{
text-align: left;
margin:5px 0 5px 40px;
}
-->
<!---->