Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Remove rtf tags (vba, vbscript-flavor)

Last post 11-10-2009, 1:17 AM by beginner_. 4 replies.
Sort Posts: Previous Next
  •  11-06-2009, 4:07 AM 57183

    Remove rtf tags (vba, vbscript-flavor)

    Hi,

    Pattern = "({\\)(.+?)(})|(\\)(.+?)(\b)"

    Pattern seems to be ok most of the time. But for certain strings it does nto work correctly and I do not get why it does not work.

     

    Example:

    {\rtf1\ansi\ansicpg1252\deff0{\fonttbl{\f0\fnil\fcharset0 MS Sans Serif;}{\f1\froman\fprq2\fcharset2 Symbol;}}
    {\colortbl ;\red0\green0\blue0;}
    \viewkind4\uc1\pard\lang2055\f0\fs17 EC50 dilute 200 \cf1\f1\fs16 m\cf0\f0\fs17 M - 0.01 \cf1\f1\fs16 m\cf0\f0\fs17 M\par
    }
     

    is returned as 

    }

     EC50 dilute 200  m M - 0.01  m

     

    The bracket and whitespace is correct when looking at my regex but can be removed easly with a second one and/or vba Replace function.

    Missing is the last "M" before \par at the end of the string. Why is it removed?

    Second Example:

    {\rtf1\ansi\ansicpg1252\deff0{\fonttbl{\f0\fnil\fcharset0 MS Sans Serif;}}
    \viewkind4\uc1\pard\lang2055\f0\fs17 ( 755 )  coexpressed enzyme  2x educt                                                                                                                                                                                                                                                            ( 763 ) coexpressed enzyme 2x substrate  higher conc\par
                                                                                                                                                                                \par
    }

     

    is returned as:

    }
     ( 755 )  coexpressed enzyme  2x educt                                                                                                         

     

    so alot of the text gets lost and I don't see why.

     

  •  11-08-2009, 5:06 PM 57217 in reply to 57183

    Re: Remove rtf tags (vba, vbscript-flavor)

    Using the "Microsoft VBScript regular expressions V5.5" and the following code:

    Option Explicit

    Public Sub test()
        Dim pattern As RegExp
        Dim result As String
        Dim text As String
       
        text = "{\rtf1\ansi\ansicpg1252\deff0{\fonttbl{\f0\fnil\fcharset0 MS Sans Serif;}{\f1\froman\fprq2\fcharset2 Symbol;}}" + _
    "{\colortbl ;\red0\green0\blue0;}" + _
    "\viewkind4\uc1\pard\lang2055\f0\fs17 EC50 dilute 200 \cf1\f1\fs16 m\cf0\f0\fs17 M - 0.01 \cf1\f1\fs16 m\cf0\f0\fs17 M\par" + _
    "}"
        Set pattern = New RegExp
        pattern.pattern = "({\\)(.+?)(})|(\\)(.+?)(\b)"
        pattern.Global = True
        pattern.IgnoreCase = True
        result = pattern.Replace(text, "")
        Debug.Print result
    End Sub

    the output from "result" is:

    } EC50 dilute 200  m M - 0.01  m M}

     

    You will note that I've lost the line breaks - adding them back in (using vbcrlf's in the appropriate places) I get:

    }

     EC50 dilute 200  m M - 0.01  m M
    }

    which would seem to be what you are after.

    Perhaps if you show us the code you are using it might help.

    Susan

  •  11-09-2009, 7:50 AM 57228 in reply to 57217

    Re: Remove rtf tags (vba, vbscript-flavor)

    This is my vba function (called in an MS Access query):

     Public Function remove_rtf(rtf_string As String) As String
        
        
        If Not rtf_string = "" Then
        
            Dim temp As String
            
            Dim objRegEx As Object
            Set objRegEx = CreateObject("vbscript.regexp")
                With objRegEx
                    .Global = True
                    .IgnoreCase = False
                    .MultiLine = True
                    .Pattern = "({\\)(.+?)(})|(\\)(.+?)(\b)"
            End With
            
            temp = objRegEx.Replace(rtf_string, "")
            
            Dim objRegEx2 As Object
            Set objRegEx2 = CreateObject("vbscript.regexp")
                With objRegEx2
                    .Global = True
                    .IgnoreCase = False
                    .MultiLine = True
                    .Pattern = "^[\s]+|[\s]+$"
            End With
            
            temp = Replace(temp, "}", "")
            remove_rtf = objRegEx2.Replace(temp, "")
            
            
        Else
        
            remove_rtf = rtf_string
            
        End If
        
        
    End Function

    EDIT:

    just read:

    VBScript does not have an option to make the dot match line break characters

     

    maybe that's the problem.

     

  •  11-09-2009, 5:08 PM 57235 in reply to 57228

    Re: Remove rtf tags (vba, vbscript-flavor)

    I have just run your code with a small calling function that passes the same input as I used before and the result was

    EC50 dilute 200  m M - 0.01  m M

    for the first sample text.

    Also, the fact that the "dot" doesn't match line terminators is not an issue for you given your sample text as there are no RTF tags that cross a line boundary. The only part that might be affected is the first alternative, but that would show up by the pattern not matching (and therefore not replacing) any text of the form:

    {\tag.....<line break>
    ...}

    You would be left with (at least) the "{" in the output text.

    Just a note about your pattern: it will not handle correctly the situation (as in your first example) where there are nested "{" "}"s. Given the text:

    {\rtf1\{\fonttbl{\f0}{\f1}}

    (edited down from your first example), your pattern will first match the "{\rtf1\{\fonttbl{\f0}" part and then the "{\f1}" part, leaving the "}" behind. This is because regexs can't (in general) count and therefore cannot be used to find matching items - this is often called the "balanced (or matching) parentheses problem". While some regex variants have extensions that allow this to be overcome, VBScript is not one of them.

    Susan

     

    Edit: Just a thought - try setting the ".global" to false and placing the match in a loop. While this will not fully emulate the action of the "Global" operation (it will re-scan the text form the start which the Global operation will not), it might give you an idea as to why the pattern is incorrectly matching the  final "M" and therefore removing it. If nothing else, you should see the text being incrementally modified and see the intermediate stages.

  •  11-10-2009, 1:17 AM 57240 in reply to 57235

    Re: Remove rtf tags (vba, vbscript-flavor)

    omg. I must apologize. I found the error and as always it's an extremly stupid one. I copied the original data into a new Column. But the original data was a memo field and the new Column Text (which is limited to 255 chars). "EC50 dilute 200  m M - 0.01  m" + the preceding rtf tag is exactly 255 chars. Thats why just the "M" is missing. Kind of a stupid coincidence.
View as RSS news feed in XML