Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Parsing XML

Last post 11-19-2009, 6:27 AM by Dani3ll3. 3 replies.
Sort Posts: Previous Next
  •  11-13-2009, 12:15 PM 57313

    Parsing XML

    I would like to parse an xml file but I am having some trouble with the regular expression. I want to extract the contents of a tag but the problem is that this tag occurs serveral times in the XML file but I only want the contents of one particular occurence. Basically the problem is as follows;

    I want to extract
    <bp:NAME ***stufff***>(I want this part)</bp:NAME>

    This tag can occur is serval places. For example here;
    <bp:ORGANISM>
    ***bunch of tags***
    <bp:NAME ***stufff***>***stufff***</bp:NAME>
    ***bunch of tags***
    </bp:ORGANISM>

    or here;
    <bp:DATABASE>
    ***bunch of tags***
    <bp:NAME ***stufff***>***stufff***</bp:NAME>
    ***bunch of tags***
    </bp:DATABASE>

    I do not want the content of those tags. I want the content of the <NAME> tag that is not between either the <ORGANISM> tags or the <DATABASE> tags. These tags can be in any order. I for the life of me cannot seem to figure this problem out. I tried several different approaches. For example I tried using the following regex

    (?:<bp:NAME [^>]*>([^<]*).*?<bp:ORGANISM>.*?</bp:ORGANISM>|<bp:ORGANISM>.*?</bp:ORGANISM>.*?<bp:NAME [^>]*>([^<]*))

    This kind of works, the information I want is either in the first captured group or in the second one. So I just check which group is not empty and that is the one I want. But this only works if there is only one other tag containing the name tag (in this particular regular expression that is the organism tag). Since there is another tag (the database tag) I have to work around, and these tags can be in any order, the regular expression then becomes three times as large and then there are six different groups in which the information I want can occur. This does not seem like a good idea to me. There has to be another way to do this. So I tried using the following regex;

    (?:</bp:ORGANISM>)?.*?(?:</bp:DATABASE>)?.*?<bp:NAME [^>]*>([^<]*)

    I thought this would get rid of any occurences of the other tags in front of the name tag, but it doesn't work either. It seems like it is not greedy enough. Well I think you get the point. I don't know what to try next so I really need some help.

     Here is an example of the type of data I will run into. The tags can be in any order and they do not always have to occur. In the example below the <DATABASE> tag is not part of the data and the name tag I want just happens to be in front of the organism tag but this is not always the case. The underlined name tag is the one I want the data from.

    <bp:protein rdf:ID="CPATH-27885">
    −<bp:COMMENT rdf:datatype="xsd:string">
    Belongs to the nuclear hormone receptor family. NR3 subfamily. SIMILARITY: Contains 1 nuclear receptor DNA-binding domain. WEB RESOURCE: Name=NIEHS-SNPs; URL="http://egp.gs.washington.edu/data/pgr/"; WEB RESOURCE: Name=Wikipedia; Note=Progesterone receptor entry; URL="http://en.wikipedia.org/wiki/Progesterone_receptor"; GENE SYNONYMS: NR3C3. COPYRIGHT:  Protein annotation is derived from the UniProt Consortium (http://www.uniprot.org/).  Distributed under the Creative Commons Attribution-NoDerivs License.
    </bp:COMMENT>
    <bp:SYNONYMS rdf:datatype="xsd:string">Nuclear receptor subfamily 3 group C member 3</bp:SYNONYMS>
    <bp:SYNONYMS rdf:datatype="xsd:string">PR</bp:SYNONYMS>
    <bp:NAME rdf:datatype="xsd:string">Progesterone receptor</bp:NAME>
    <bp:ORGANISM>
    −<bp:bioSource rdf:ID="CPATH-LOCAL-112384">
    <bp:NAME rdf:datatype="xsd:string">Homo sapiens</bp:NAME>
    −<bp:TAXON-XREF>
    −<bp:unificationXref rdf:ID="CPATH-LOCAL-112385">
    <bp:DB rdf:datatype="xsd:string">NCBI_TAXONOMY</bp:DB>
    <bp:ID rdf:datatype="xsd:string">9606</bp:ID>
    </bp:unificationXref>
    </bp:TAXON-XREF>
    </bp:bioSource>
    </bp:ORGANISM>
    <bp:SHORT-NAME rdf:datatype="xsd:string">PRGR_HUMAN</bp:SHORT-NAME>
    −<bp:XREF>
    −<bp:relationshipXref rdf:ID="CPATH-LOCAL-112386">
    <bp:DB rdf:datatype="xsd:string">ENTREZ_GENE</bp:DB>
    <bp:ID rdf:datatype="xsd:string">5241</bp:ID>
    </bp:relationshipXref>
    </bp:XREF>
    −<bp:XREF>
    −<bp:unificationXref rdf:ID="CPATH-LOCAL-112387">
    <bp:DB rdf:datatype="xsd:string">UNIPROT</bp:DB>
    <bp:ID rdf:datatype="xsd:string">P06401</bp:ID>
    </bp:unificationXref>
    </bp:XREF>
    −<bp:XREF>
    −<bp:unificationXref rdf:ID="CPATH-LOCAL-112388">
    <bp:DB rdf:datatype="xsd:string">UNIPROT</bp:DB>
    <bp:ID rdf:datatype="xsd:string">A7X8B0</bp:ID>
    </bp:unificationXref>
    </bp:XREF>
    −<bp:XREF>
    −<bp:relationshipXref rdf:ID="CPATH-LOCAL-112389">
    <bp:DB rdf:datatype="xsd:string">GENE_SYMBOL</bp:DB>
    <bp:ID rdf:datatype="xsd:string">PGR</bp:ID>
    </bp:relationshipXref>
    </bp:XREF>
    −<bp:XREF>
    −<bp:relationshipXref rdf:ID="CPATH-LOCAL-112390">
    <bp:DB rdf:datatype="xsd:string">REF_SEQ</bp:DB>
    <bp:ID rdf:datatype="xsd:string">NP_000917</bp:ID>
    </bp:relationshipXref>
    </bp:XREF>
    −<bp:XREF>
    −<bp:unificationXref rdf:ID="CPATH-LOCAL-112391">
    <bp:DB rdf:datatype="xsd:string">UNIPROT</bp:DB>
    <bp:ID rdf:datatype="xsd:string">Q9UPF7</bp:ID>
    </bp:unificationXref>
    </bp:XREF>
    −<bp:XREF>
    −<bp:unificationXref rdf:ID="CPATH-LOCAL-113580">
    <bp:DB rdf:datatype="http://www.w3.org/2001/XMLSchema#string">CPATH</bp:DB>
    <bp:ID rdf:datatype="http://www.w3.org/2001/XMLSchema#string">27885</bp:ID>
    </bp:unificationXref>
    </bp:XREF>
    </bp:protein>

  •  11-13-2009, 4:43 PM 57317 in reply to 57313

    Re: Parsing XML

    Use either XPath or the XML DOM to parse XML file instead of a regular expression. XPath is specifically designed for what you are trying to accomplish.  Regexes are not well suited for this type of task.

    The following XPath expression matches the underlined value in your sample

    /bp:protein/bp:NAME/text()

    NOTE : a namespace had to be added to your root element

     


    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
  •  11-19-2009, 4:52 AM 57412 in reply to 57317

    Re: Parsing XML

    But I have to use regular expression because it's for an assignment and part of a larger java program. How would I do this using regular expressions?
    Filed under:
  •  11-19-2009, 6:27 AM 57413 in reply to 57313

    Re: Parsing XML

    I solved the problem by usinga string-replace.

    //remove all other occurrences of the name tag
    String shortInput = input.replaceAll("<bp:ORGANISM>.*?</bp:ORGANISM>", "");
    shortInput = shortInput.replaceAll("<bp:DATABASE>.*?</bp:DATABASE>", "");
    //find the name tag
    nameMatcher = Pattern.compile("<bp:NAME [^>]*>([^<]*)").matcher(shortInput);

View as RSS news feed in XML