Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Marking Science Homework with a More Sophisticated Regex

Last post 11-29-2012, 9:12 PM by Aussie Susan. 7 replies.
Sort Posts: Previous Next
  •  11-24-2012, 10:30 AM 87117

    Marking Science Homework with a More Sophisticated Regex


    I am using a Regex to mark science homework in a little online program. The Regex searches for keywords (and their alternatives), without them having to be in any particular order. The Regex also includes a search for forbidden words. This is the Regex, where A B C D stand for necessary words and Z for a forbidden word:


     This has worked fantastically well and I now have about 2,500 questions posted using this basic construction with an average marking accuracy of about 99% (Thanks to the genius who posted this for me some 2 years ago!!). 

    What I could do with is to be able to restrict the order of the keywords slightly. I can already fix them in A B C D order using   .*?\b   instead of   ?=.*?\b   . But what would be really useful is to link A with B and C with D so only the following would be allowed:

    A  B  C  D ; B  A  C  D ; A  B  D  C ; B  A  D  C

    I know I can simply do 4 Regexes, but this lacks elegance and I need more complex versions of the above (eg A  B  C  D  grouped separately from E and F). A good example of a use of this is a chemical equation where the reactants can be in any order and  products can be in any order but you can't mix reactants with product in the equation.


    PS I don't really understand the above, so explanations need to be simple!


  •  11-25-2012, 6:01 PM 87118 in reply to 87117

    Re: Marking Science Homework with a More Sophisticated Regex



    I've tried this against the following test cases:

    A  B  C  D
    B  A  C  D
    A  B  D  C
    B  A  D  C
    C  D  A  B
    D  C  A  B
    C  D  B  A
    D  C  B  A
    A  C  B  D
    A  C  C  D
    D  A  C  B

    and it matches only the first 4 cases.

    Lets start by treating and A's etc just as individual characters. Also lets consider we are looking for "A B" or "B A" anywhere and with whitespace the only characters allowed in between/

    The first part is is skip over any character that is neither A nor B. This is simply '[^AB]*'.

    At this point we have 3 situations:
    1) we are at the end of the line (i.e there are no A or B characters) - this will be handled as an error by the fact that neither of the other 2 options will be matched
    2) the next character is "A" and therefore we must have only whitespace to the next character which must be "B"
    3) the next character is "B" and therefore we must have only whitespace to the next character which must be "A"

    We can handle this with an alternation such as


    (remember that alternation has a very low precedence so that the entire pattern on either site will be considered as a whole)

    We can put this together and get:


    At this point, let us remind ourselves that A and B are complete words and not just letters. This means that the '[^AB]' part isn't going to work, but we can create an equivalent that WILL work with words:


    (There is an alternative that is often used in this situation which is '.*?(A|B)' but I don;t want to use this here because that will actually match and therefore set the text pointer to AFTER the A or B where as we want to check that character later on).

    Thus we have


    In this form, we really don't NEED to put parentheses around the A's and B's but if wer did then it would look like:


    This is beginning to look very like the first part of the pattern we had at the start. All we need to do is to create a similar pattern for the C and D values:


    using the same logic.

    If I "extrapolate": from your example abut the reagents, and assume that you are NOT talking about an equilibrium relation where the left and right sides can be exchanged as whole entities (i.e. "A B C D" is value but "C D A B" is not - I'll get to that later if necessary) and assumign that a single answer occurs on a single line then we can have:

    ^      - start at the beginning of the line
    ((?!(A)|(B)).)*((A)\s+(B)|(B)\s+(A))    - require a match of A and B in either order
    ((?!(C)|(D)).)*((C)\s+(D)|(D)\s+(C))    - require a match of C and D in either order
    \s*$   - allow for trailing whitespace and then the end of  the line

    which, when all on one line, is what we started with.

    Now, if you want to handle the "equilibrium" reaction case, we need to turn the middle parts into lookaheads and the last part into a match of everything (as you haev in your original pattern). Therefore we add

    (?= ........  )

    around the 2 middle parts and use '.*$' at the end and get:


    This matches the first 8 of the test cases above.

    By the way, in all of these, I've used the "Ignore Case" and "Ignore Whitespace" options in my testing which has help me to create the patterns as separate lines, and also the "multiline" option which lets me create the multiple line test case.

    If I've made any incorrect assumptions about anything in deriving these patterns, then I hope you can see where to include the corrections, or please let me know and I'll see how we can incorporate them.

    I hope this all makes sense.


  •  11-26-2012, 10:03 AM 87124 in reply to 87118

    Re: Marking Science Homework with a More Sophisticated Regex

    That's excellent. The second version is the one I need and I think tests every possible combination correctly. There is one slight problem that I omitted to flag up. For the second expression I can put any word before, in the middle (between AB and CD) and at the end without problem, but it won't allow any arbitary words in between A and B or C and D. So, for our example of a chemical equation A + B → C + D, it won't tolerate the +'s. Another good example of how I might use this is the following:

    "Sodium chloride is an ionic compound while hydrogen chloride is a covalent one"

    ^(?=((?!(sodium chloride)|(ionic)).)*(((sodium chloride)\s*(ionic))|((ionic)\s*(sodium chloride))))(?=((?!(hydrogen chloride)|(covalent)).)*(((hydrogen chloride)\s*(covalent))|((covalent)\s*(hydrogen chloride)))).*$

    Obviously this won't work because of the extraneous words, but what it needs to do is prevent

    "Sodium chloride is a covalent compound while hydrogen chloride is an ionic one"

     Additionally, I have tried


    as a more complex one that seems to work OK (with above problem), but it does start to become unwieldy.

    I also tried


    for 3 couplets which seems to work OK, so presumably you can go on indefinitely!

    On the issue of white space the Regex seems to ignore it anyway, which is good and I presume  /i  at the end will make it case insensitive if needed.

     Thanks ever so much!


  •  11-26-2012, 5:23 PM 87125 in reply to 87124

    Re: Marking Science Homework with a More Sophisticated Regex

    First a quick explanation of the "ignore whitespace" option: I'm guessing from your comment that you are referring to the fact that the pattern ignores the whitespace in the text it is scanning. This is due to the way that the pattern processes the text and is NOT the meaning of the "ignore whitespace" option. That "ignore whitespace' does is tell the pattern parser (the part that scans your pattern and converts it into an internal state machine that the regex engine then uses to scan your text, to ignore any whitespace in the pattern itself. This lets you write patterns that (for example) can be split over multiple lines in your program, can contain comments or can be spaced out so that it is more "obvious" to some subsequent maintainer how the pattern is constructed.

    In your case, you would probably NOT want to use this option when you have "sodium chloride" in your pattern, as the pattern parser would see this as "sodiumchloride".

    You are correct about the use of the "ignore case" option - that DOES work on the text that is being scanned. (Confusing huh!)

    As for allowing words and other characters in between your pairs of keywords, if you go back to the part of the pattern that looks for the "2nd keyword", as in:

    ((sodium chloride)\s*(ionic))

    you will see the '\s*' bit in the middle. That is saying that only whitespace can occur between the "sodium chloride" and "ionic". The trick here is to work out what you CAN allow in between these words and try to keep this to a reasonable size, because it is going to be used (at least in this pattern) 4 times.

    The initial thought is to allow anything, i.e. to use:

    (sodium chloride).*?ionic

    This will certainly do the trick but it will also match your example where the keywords are incorrectly paired. As the keywords are the main thing here, my suggestion is to match everything EXCEPT any of the keywords:

    (sodium chloride)((?!hydrogen chloride|ionic|covalent|sodium chloride).)*(ionic)

    What this will do is to start by matching "sodium chloride" and then match everything except any of the keywords and finally make sure that the "match everything except.." stopped at the required paired keyword.

    If we put this into the full pattern we get something like:


    (You will see my use of '\s' instead of the space characters - this is because my testing is using the "ignore whitespace" option; see my comments at the start of this entry).

    Given the test strings:

    Sodium chloride is an ionic compound while hydrogen chloride is a covalent one
    Sodium chloride ionic compound while hydrogen chloride covalent one
    Sodium chloride = ionic compound while hydrogen chloride = covalent one
    Sodium chloride is a covalent compound while hydrogen chloride is an ionic one

    this matches the first 3.

    As for adding more components to the pattern, I hope that you can see the "pattern" in the pattern (as it were) and can therefore see how to extend it. The problem you will face is that it will become so big and complex (and also very sensitive to spelling - "sodium chloirde" will fail the student although it is probably clear that they know their chemistry) that it will be next to impossible to maintain. (Also, how about "NaCl is ionic, HCl is covalent" - feel line adding in more options???)

    One final comment (also pertaining to spelling): as written, the pattern will also match

    HydroHexafluroSodium Chloride when stuffed with an subionical compound such as water will generate hydrogen chlorideification as a incovalentious mixture

    The reason for this is that you are only looking for the (say) string of characters that make up "ionic" and are not limiting this to complete words. Unfortunately the cure is almost worse than the complaint: you need to add '\b' at the start and end of each keyword to force it to be a "whole word (or phrase)" only. This makes the pattern:


    You wil see that I have cheated in the part that skips everything but the keywords by placing the '\b's outside some parentheses. This is really the equivalent of factoring in an equation:


    is the same as



  •  11-28-2012, 6:52 AM 87131 in reply to 87125

    Re: Marking Science Homework with a More Sophisticated Regex

    That's brilliant! I think the subtelties of  \s  are beyond me and I've tried it with and without and can't detect any difference. But that doesn't matter, it won't stop me using it. I've found a couple of ways to pare it down a little so it now looks like this:


    I reduced  (A|B|C|D)\b).)* to (C|D)\b).)*  as it doesn't matter if they repeat A or B in the A/B side and I got rid of as many \b's as I could with a bit of bracketting.

    I've also worked out an    A goes with B  but C    version:


    and a A/B  C/D  E/F one:


    One great thing is I can exclude    A is not B    by doing   (C|D|not)\b).)*  or add any other forbiden word between A and B!

    A B C without D E looks daunting, but I could probably work it out!

    I can't thank you enough that really is terrific. I now have a powerful tool that I will be able to use in all sorts of situations.


    PS If you want to turn your mind to another problem .......?

    I need to deal with quite a lot of answers that take the form "The bigger A, the bigger B" for which I might use

     ^(?!.*?(n't|\b(can)?not\b|\bno\b|\bnor\b|\bsmaller))(?=.*?\b(A))(?=.*?\b(B)) (?=.*?\b(bigger|larger)).{0,100}$

    But this doesn't require   bigger|larger   twice and putting   (?=.*?\b(bigger|larger))     twice is redundant. Is there any way of requiring a key word (or alternatives) to be repeated?

  •  11-28-2012, 5:04 PM 87135 in reply to 87131

    Re: Marking Science Homework with a More Sophisticated Regex

    The general way of handling required keywords is with a lookahead of the form:


    This can only be successful if "keyword" appears at least once in the text.

    To make sure that it appears at least twice you use


    and more generally


    with suitable values for the repetition.

    If you want to provide a maximum number of times then you can use something like


    which will require 1 or 2 instances but reject cases with 0 keywords or more than 2. This form is a little more complex because the "natural" extension of the previous pattern might be


    which simply checks that, whatever follows the 2 required keywords does not contain the keyword. However you get caught out by the regex engines backtracking: when it matches the first 2 "keywords" and then finds another, it will backtrack to the '.*?' part and let it add another character which is part of the first keyword. That lets is carry on with the rest of the mathc and, in effect, locates the LAST 2 matches but doesn't stop the first ones being there.


  •  11-29-2012, 5:53 AM 87137 in reply to 87135

    Re: Marking Science Homework with a More Sophisticated Regex

    That looks fantastic, especially    (?=(.*?\bkeyword\b){4})     and    ^(((?!\bkeyword\b).)*\bkeyword\b){1,2}((?!\bkeyword\b).)*$

    But I can't get the first to work without adding     .*    to make     (?=(.*?\bkeyword\b){4}).*   which also works fine in my standard Regex:


    The second works fine on its own, but I can't integrate it into my standard Regex:


    will not allow A B or C before the minimum number of key words and doesn't have a maximum of    \bkeyword\b  ,


    allows A B C anywhere, in any order and split up, but still doesn't limit the number of    \bkeyword\b.

     Almost there!

    Can you recommend any good books on Regex (simple!)?

  •  11-29-2012, 9:12 PM 87139 in reply to 87137

    Re: Marking Science Homework with a More Sophisticated Regex

    I'm sorry, I should have made it clear that I was proposing a small pattern that should be added at the start of another pattern that would do the actual matching. Normally you put these lookahead phrases at the start to verify that it is worth while processing the rest of the pattern: if you have (say) too few instances of a required keyword, then the rest of the pattern might match but the result would be a waste of time.

    That was why things worked when you added the'.*' etc. - you were adding in the parts that would do the actual matching.

    Also, you need to make sure that all of the lookaheads start at the beginning of the text - when you start building up very complex patterns, it is easy to make a tiny error and end up breaking the whole thing. For example, in the last pattern you mention, the


    part actually does some matching and therefore will move the text pointer forward (in this case to after the 2nd "keyword"). I suspect that, if you make this a lookahead as well (including the bit that follows it), leaving just the '.{0,100}$' to do the matching, then it would work better.

    You will have noticed that I've sometimes split the patterns over several lines and use indenting. While this has partly been to make the explanations clearer for you, it is also a good way to visualise the overall structure and to help you spot mistakes - it is simpler to look at a small bit and work out what it does than try to understand the monstrosities that some patterns turn in to.

    As for the book, the one that I swear by is "Mastering Regular Expressions" by Jeffery Friedl. However it is not necessarily the best place to start as it covers a lot of ground and also a lot of the different regex engine capabilities - unfortunately there is no such thing as a "standard" when it comes to regex patterns and each has its own capabilities and (in some cases) syntax to let you express things. My suggestion is to start with a regex test platform that suits your needs and regex language (I use the Expresso one that is based on the .NET library as I can run this on my PC, but there are also lots of web based regex testers out there that use other regex engines) so that you can "play" with the patterns and build them up slowly. Generally you can start with something simple and then add bits as you go, fixing things when the last addition breaks the whole pattern.

    Also I suggest that you have a definite purpose for building up a regex pattern. In your case you do, but for general learning, it is better to have a goal in mind (start small and then build up) as this will give direction to your learning rather than just picking up random bits and pieces.


View as RSS news feed in XML