Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Pulling page numbers out of a citation

Last post 06-29-2009, 5:20 PM by ddrudik. 3 replies.
Sort Posts: Previous Next
  •  06-29-2009, 9:39 AM 54418

    Pulling page numbers out of a citation

    I am working with a tool our library uses, called MetaLib, which allows a person to search one or more of our databases for articles and then connects to the full text by pulling various metadata and putting into an OpenURL.  MetaLib uses Perl-based regex in it's configuration files to pull the data.  I'm new to regex and am having trouble setting up the configuration for a particular database.  The majority of the citation information is kept in one field, field "g".  My problem I have is that the article page numbers may be in one of two formats:

    1) g JOURNAL OF CHINA UNIVERSITY OF GEOSCIENCES v.18, no.1, pp.49-59, March 2007. (ISSN 1002-0705; Over 10 refs)-59

     2) g OIL GEOPHYS. PROSPECTING v.39, no.2, pp.VI,218-221, 4/5/2004. (ISSN 1000-7210; 2 refs; In Chinese)-221

     As you can see, the page numbers may come immediately after the "pp." or they may follow some roman numerals.  I need to create a regular expression that will pull the page numbers either way.  I've created a couple of expressions but neither seems to work:

    (?<=pp\.)(?([A-Z][A-Z]?)(?<=,)(\d{1,}-\d{1,})|(?<=pp\.)(\d{1,}-\d{1,}))

    (?(pp\.[A-Z][A-Z]?)(\d{1,}-\d{1,})|((?<=pp\.)(\d{1,}-\d{1,})))

     Any assistance would be appreciated.

  •  06-29-2009, 10:35 AM 54422 in reply to 54418

    Re: Pulling page numbers out of a citation

    Raw Match Pattern:
    pp\.([A-Z\d,-]+)

    $matches Array:
    (
        [0] => Array
            (
                [0] => pp.49-59,
                [1] => pp.VI,218-221,
            )

        [1] => Array
            (
                [0] => 49-59,
                [1] => VI,218-221,
            )

    )

  •  06-29-2009, 4:57 PM 54431 in reply to 54422

    Re: Pulling page numbers out of a citation

    Thank you for the response.  How would I drop the roman numeral so that whether or not it's present, I would only pull the digits?

     

    Thanks again.

  •  06-29-2009, 5:20 PM 54432 in reply to 54431

    Re: Pulling page numbers out of a citation

    Can your product use regex capture groups?  If so, you could use capture group 1:

    pp\.(?:[IVXCLDM]+,)*((?:\d+(?:-\d+)?,)*\d+(?:-\d+)?)

    If not, can the product apply a regex to the after-match data?  If so, you could use:

    \d+(?:-\d+)?


View as RSS news feed in XML