|
|
match everything except CSS, javascript, urls, divs, spans using PCRE
Last post 02-15-2010, 1:42 PM by james438. 11 replies.
-
02-09-2010, 5:29 PM |
-
james438
-
-
-
Joined on 10-10-2008
-
-
Posts 10
-
-
|
match everything except CSS, javascript, urls, divs, spans using PCRE
I am trying to create a PCRE that will replace all of the hyphens in a document unless it is javascript or CSS or in an url, div, span tags. I have tried many different things and have discovered so many "anomalies" in PCRE that I don't really know where to begin, but many of my questions would probably be better put into separate threads. Here is the test script I am using: <?php $text="-te-st-te-st-"; $text=preg_replace('/(.*?)(st.*?te)(.*?)/es',"str_replace('-','–','$1$3')",$text); echo "$text"; ?> In the above I am attempting to change all of the hyphens to short HTML dashes unless they are located between " st" and " te". The string might also be in the form of " -te-yy-st-" where all of the hyphens would be replaced or "-pstp --- kktep--p-" where the three hyphens in a row would not be replaced, because they are between " st" and " te". Here is some actual code/text I would like to test against: <style type="text/css"> .q { font-family:courier; border-style: solid; border-width: 3px; border-color: #525252; padding-left:25px; padding-right:25px; PADDING-TOP:20px; PADDING-BOTTOM:20px; color:#ffffff; background:#434343; margin: 12px 80px 12px 40px; } </style> <form name="count"> <input type="text" size="69" name="count2"> </form>
XXXX-XXXX <script>
/* Count down until any date script- By JavaScript Kit (www.javascriptkit.com) Over 200+ free scripts here! */
//change the text below to reflect your own, var before="Christmas!" var current="Today is Christmas. Merry Christmas!" var montharray=new Array("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")
function countdown(yr,m,d){ theyear=yr;themonth=m;theday=d var today=new Date() var todayy=today.getYear() if (todayy < 1000) todayy+=1900 var todaym=today.getMonth() var todayd=today.getDate() var todayh=today.getHours() var todaymin=today.getMinutes() var todaysec=today.getSeconds() var todaystring=montharray[todaym]+" "+todayd+", "+todayy+" "+todayh+":"+todaymin+":"+todaysec futurestring=montharray[m-1]+" "+d+", "+yr dd=Date.parse(futurestring)-Date.parse(todaystring) dday=Math.floor(dd/(60*60*1000*24)*1) dhour=Math.floor((dd%(60*60*1000*24))/(60*60*1000)*1) dmin=Math.floor(((dd%(60*60*1000*24))%(60*60*1000))/(60*1000)*1) dsec=Math.floor((((dd%(60*60*1000*24))%(60*60*1000))%(60*1000))/1000*1) if(dday==0&&dhour==0&&dmin==0&&dsec==1){ document.forms.count.count2.value=current return } else document.forms.count.count2.value="Only "+dday+ " days, "+dhour+" hours, "+dmin+" minutes, and "+dsec+" seconds left until "+before setTimeout("countdown(theyear,themonth,theday)",1000) } //enter the count down date using the format year/month/day countdown(2002,12,25) </script> <p align="center"><font face="arial" size="-2">This free script provided by</font><br> <font face="arial, helvetica" size="-2"><a href="http://javascriptkit.com">JavaScript Kit</a></font></p> <div class='q'><a href="http://www.marvel.com/universe/X-23">X-23</a> is a new character in the <span style='background-color:yellow;'>Marvel universe</span></div>
I would like to format the code above for readability, but I could not find any options for using BB code.
|
|
-
02-09-2010, 6:08 PM |
|
|
Re: match everything except CSS, javascript, urls, divs, spans using PCRE
I'm not sure if I can help you with your specific problem but your comment about there being "...so many "anomalies" in PCRE..." and looking at the combination of your pattern and description makes me think that ther are some fundamental aspects of regex patterns that you may be misunderstanding. One of the first rules is to clearly state what you are trying to find using simple, "mechanical" rules that can be followed one character at a time. For example, you say you want to find a hyphen unless it is between the opening marker of "st" and the ending marker of "te". Imagine you are the regex engine and you are working your way through the text one character at a time: how you your recognise if a hyphen is between the markers or not? Remember that, it make this determination you also need to work one character at a time. Onvce you have figured this out then you can start to write your pattern. Secondly, you only need to match as much of the text as you need to to achieve your ultimate goal. Looking at your pattern:
(.*?)(st.*?te)(.*?) what this does is to start at the beginning of the string and match everything until you get to the first instance of the "st" character sequence - perhaps! In fact there is no requirement here for the regex engine to match any characters at all so you could end up with anything (and what you do get may be dependent on the particular regex variant you are using).If there is a hyphen here that you would want locate so you can replace it, then you would skip right over it and not even know it's there
The next part matches the "st" characters and then matches all characters to the next instance of the "te" character sequence. Going back to your requirement to look for a hyphen, there is no mention of a hyphen here at all so I'm not sure what this part is really trying to do in terms of your requirements. The last part again grabs anything (or nothing or everything) after the "te" sequence, possibly to the end of the string. The same problem occurs here as the similar first part of your pattern: if there is a hyphen after the "te" then it will just be absorbed and not located by the pattern. The problem with this is that you are replacing a (possibly variable and unknown) block of text in the source string and doing nothing with any hyphens that may be buried in there.
So what might work: The first thing is to find a hyphen so we start with the pattern: - That will locate ALL hyphens, including those within the markers, so we need to somehow determine if this is the case. Because you are using PCRE, you have the capability of using lookaheads and lookbehinds. However, PCRE limits a lookbehind pattern to being fixed length and your specification allows there to be any number of characters between the "st" and the subsequent hyphen. Therefore we have to use lookaheads which can be of variable length. So once we have found our hyphen we need to start scanning ahead to see if we can find either a "st" or "te" marker. If we first get to a "te" marker we can say that we must have been within the markers (assuming they are uniquely used within the text and they that cannot be nested). However if we first get to a "st" marker, then we must have been outside the markers. There is also another possibility and that is we get to the end of the string. To do this we make the pattern: -(?!((?!st|te).)*te) Once we have located a hyphen we go into a negative lookahead that is made up of 2 main parts. The first part - '((?!st|te).)*' - allows us to step forward one character at a time as long as the next 2 characters are neither "st" or "te". If they are then we have finished this part of the lookahead; if not then we keep on going, possibly reaching the end of the string. Whatever causes this part of the lookahead to stop, we next check to see if the following characters are "te". If they are, then we must be between the markers and so the "negative" lookahead will fail and so reject this particular hyphen; the regex engine will then step over it and carry on looking for the next one. The other cases are either the end of the string or the characters "st" and in either case the hyphen could not have been between the markers and so the overall match succeeds. If you place this pattern within the "preg_replace" function, you can now tell it to replace the hyphen with whatever replacement text you want: something like (untested): $newText = preg_replace("/-(?!((?!st|te).)*te)/i", "–", $text); Remember that a lookahead only does that - it remembers where it got up to in the source string and only checks out (but doesn't capture) whatever might follow. Therefore the regex replace function ONLY sees the hyphen as this is the only character in the pattern that is actually matched. Also the regex replace function removes ALL of the characters matched by the pattern and inserts whatever is in the replacement string. In this case we do not need to capture anything else within the pattern and so the replacement string is only the literal text. I hope this makes sense. Susan
|
|
-
02-09-2010, 7:01 PM |
-
james438
-
-
-
Joined on 10-10-2008
-
-
Posts 10
-
-
|
Re: match everything except CSS, javascript, urls, divs, spans using PCRE
Thank you very much for the informative post. It makes a lot of sense. One part is confusing me though. I have some understanding of lookaheads and use them on occasion, but how do they operate when they are nested like they are in the example you posted? Using english I tend to think of it as a double negative. I am going to try and study it some more and play around with it to see how it operates. Thanks again. Update: here is the script that I am currently using. It seems to be working fine. The first part may not be quite right, but here is what I have: $text=preg_replace("/-(?!((?!\/script|\style).)*\/script>|\/style)/is", "–", $text); $text=preg_replace('/–(?=(.*?>))/',"-",$text);
|
|
-
02-09-2010, 10:27 PM |
|
|
Re: match everything except CSS, javascript, urls, divs, spans using PCRE
I think of a lookahead (or lookbehind for that matter but the "logic" starts to get mind-blowing on occasions!) like a recursive call to the regex engine. Normally the regex engine keeps a pointer into the text of the next character that it will use in the comparison process with the pattern. When it comes to a lookahead, it (in effect) starts a new match based on the pattern within the lookahead with the text pointer copied from the "outer" level. In this way it carries on performing the normal pattern matching process until it gets to the end of the pattern when it returns the "success"/"fail" result. At this point, the "outer" level still has the same text pointer reference but know has the result of the lookahead which it uses directly (positive lookahead) or the compliment (negative lookahead). Therefore, if there are nested lookaheads, they each, individually, work the same way in returning the overall end result. Any help?
Susan
|
|
-
02-10-2010, 2:37 AM |
-
james438
-
-
-
Joined on 10-10-2008
-
-
Posts 10
-
-
|
Re: match everything except CSS, javascript, urls, divs, spans using PCRE
Yes, that helps. Nested negative lookaheads make sense to me now and I no longer see them as double negatives. It seems that the more I stare at this problem the more sense it makes. My main goal is to replace the dashes unless they are a part of a stylesheet, javascript, or HTML tag like P, DIV, SPAN, or Anchor. My current PCRE based on the help you have given me looks like this: $text=preg_replace("/-(?!((?!<\/script|<\/style|<style|<script).)*?(\/script|\/style))/is", "–", $text); One problem, and I am sure there are others, is what about hyphens located before "<style" or "<script" in the beginning of my string or between "</script" and "<style"? I updated the original post slightly with the updated version of the string I am testing against.
|
|
-
02-10-2010, 10:02 PM |
|
|
Re: match everything except CSS, javascript, urls, divs, spans using PCRE
What the pattern does is to locate the hyphen and then look forward to the first starting/ending script/style tag and then confirm that it is the "ending" version. What this means is that any hyphen found before a starting tag will be processed (it doesn't match the lookahead and so the overall result of the "negative" lookahead is "success"). By the way, you can make your pattern a little simpler (??) by using: -(?!((?!</?script|</?style).)*(/script|/style)) By using the '?' quantifier after the forward slash before the tag name, it will match both the starting and ending tags. This makes it a bit easier to read (well I think so anyway). Using this "character by character" style of checking and moving forward (i.e. the inner-lookahead) the greediness of the '*' quantifier does not really matter. The number of characters matched is really determined by the lookahead and not what follows it which is where the '*' vs '*?' issue comes into play. If you are using the forward slash within your pattern, you can simplify things a bit by choosing some other character as the pattern delimiter - I often use ~ as I rarely need it within the pattern itself. In this way you don't need to escape the forward slashes within the pattern - again this is an 'ease of reading' thing. Susan
|
|
-
02-11-2010, 5:18 AM |
-
james438
-
-
-
Joined on 10-10-2008
-
-
Posts 10
-
-
|
Re: match everything except CSS, javascript, urls, divs, spans using PCRE
Very sleepy, but I wanted to post this before I fall asleep. I think that with this script we are on the wrong track. I think what is needed for this string is a conditional substring. The following is a typical, although short string: $text="te-xt <style type=\"text/css\"> te-xt </style> te-st - comic character and is <a href=\"http://www.anime-views.com\">X-23</a> found in X-Force comics te-st.";
and next is the pattern that I am working with. $text=preg_replace("/-(?(?=(.*?(<)(style|script|a)))(.)|(?!(.*?(\/style|\/script|\")(>))))/is","–$4",$text); echo "$text";
The pattern works, but I still need to refine it a little more yet. I'll try and post more later.
|
|
-
02-11-2010, 6:02 PM |
|
|
Re: match everything except CSS, javascript, urls, divs, spans using PCRE
OK, try this. Text string: te-xt <style type=\"text/css\"> te-xt </style> te-st - comic character and is <a href=\"http://www.anime-views.com\">X-23</a> found in X-Force comics te-st. Pattern: -(?!((?!</?(script|style|a)\b).)*(</(script|style|a))) Replacement string: – Options: Ignore case - on, Single line - on, all others off Result: te–xt <style type=\"text/css\"> te-xt </style> te–st – comic character and is <a href=\"http://www.anime-views.com\">X-23</a> found in X–Force comics te–st. Using Expresso (.NET based) test platform. I note that there were a couple of typo's in my previous pattern - sorry. Also I have factored out some of the repeating characters at the start (and end) of the various search strings. I've also included the "a" tag that you included - note that I've used the '\b' anchor after the "a" to stop it matching on tags such as "aardvark"
Susan
|
|
-
02-12-2010, 2:40 AM |
-
james438
-
-
-
Joined on 10-10-2008
-
-
Posts 10
-
-
|
Re: match everything except CSS, javascript, urls, divs, spans using PCRE
I am not sure what typos you were talking about in your earlier scripts. $text=preg_replace("/-(?!((?!<\/?(script|style|a)\b).)*(<\/(script|style|a>)))/is","–",$text); works wonderfully. I tried many different things and was unable to get it to fail. I did escape the slashes though. Is it possible to modify it to also match the hyphens in hyperlinked text while failing the hyperlink itself? For example: <a href=\"http://www.anime-views.com\">X-23</a> becomes: <a href=\"http://www.anime-views.com\">X–23</a> At present the script works well enough that I have incorporated it into my site. The only places where it did not work properly were in places where I was using improper coding techniques such as inline styling as in <div style='font-size:18px;text-align:center;'>ti-tle</div>. My plan is to update my pages to <div class='o'>title</div> instead. I have been meaning to correct that practice for a while now.
|
|
-
02-14-2010, 7:47 PM |
|
|
Re: match everything except CSS, javascript, urls, divs, spans using PCRE
The problem you are starting to get into is that different starting and ending criteria - it was (almost!) OK when you could had multiple tags such as "script" and "style" (there was always the opportunity for a "<style...>" tag to be matched with a "</scrip>" tag in the pattern) but when you start mixing in "<a..." tags that end at the closing ">" and not the "</a>" then things are going to get rough. I suggest that you use the pattern you have at the moment as a first pass as it seems to get the majority of the cases you are after. You can then construct a second pattern that will locate all hyphens that are not inside tags (or whatever the appropriate criteria are) and change those. Susan
|
|
-
02-14-2010, 8:36 PM |
-
james438
-
-
-
Joined on 10-10-2008
-
-
Posts 10
-
-
|
Re: match everything except CSS, javascript, urls, divs, spans using PCRE
Thanks for your help, it has been a tremendous help and I learned a few lessons along the way! I am glad a forum like this exists. Finding good tutorials and lessons about PCRE is not very easy to find and the PCRE lessons at php.net do not have many examples. My ultimate goal would be to match all hyphens that are not inside tags along but avoiding those found in scripts or styles scripts and styles.
|
|
-
02-15-2010, 1:42 PM |
-
james438
-
-
-
Joined on 10-10-2008
-
-
Posts 10
-
-
|
Re: match everything except CSS, javascript, urls, divs, spans using PCRE
I wanted to post what I am currently using for anyone else who comes across this thread looking for a solution. $text=preg_replace("/-(?!((?!<\/?(script|style|a|object|iframe)\b).)*(<\/(script|style|a>|object|iframe)))/is","–",$text); $text=preg_replace('/(<a.+>)(.+)(<\/a>)/Ue',"'$1'.str_replace('-','–','$2').'$3'",$text); $text=preg_replace('/(style=(\'|\'))(.+)(\'|\')/Ue',''$1'.str_replace('–','-','$3').'$4'',$text); $text=preg_replace('/(<button\sonclick=(\"|\'))(.+)(\'|\")/Ue',"'$1'.str_replace('–','-','$3').'$4'",$text); Line 1 will replace hyphens with dashes, but ignore those found within object, iframe, style, style, script, and anchor tags. Line 2 will replace hyphens with dashes found between the anchor tags. Line 3 will revert the &ndash; found in inline styling to hyphens. Line 4 is just extra, but reverts the &ndash; found within the <button> tags to hyphens. The script will replace the hyphens found in <img src="/images/x-x.jpg">
|
|
|
|
|