Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Pulling strings from a single (massive) line of text - can't identify end of strings

Last post 02-27-2012, 5:15 PM by Aussie Susan. 3 replies.
Sort Posts: Previous Next
  •  02-23-2012, 10:35 AM 84698

    Pulling strings from a single (massive) line of text - can't identify end of strings

    Hi,  I've got a script that extracts all of the text from a pdf document.  Because of the nature of pdf documents, all of the whitespace and formatting disappears, leaving me with a ~100 page pdf all on a single line of text.

     Within that line of text are a lot of little snippets that I want.  I have gotten so close to what I want, but I'm not quite there.  From what I can tell, I only reliably know the start of the string that I want.  Here's an example of what I'm looking at:

     /mmsi), Sea Tow (http://www.seatow.com/mmsi), or the U.S. Power Squadrons (http://www.usps.org/php/mmsi). Ensure any information originally provided is updated as changes occur. FCC regulations require DSC-equippedradios ?use MMSI assigned by the Commission or its designees? (47 CFR 80.103(b)).Then: Interconnect your radio to a GPS receiver using a two-wire NMEA 0183 interface on all DSC equipped marine radiosand on most GPS receivers. Instructions should be provided in the radio and GPS operator's manual. Further information isprovided and will be routinely updated at http://www.navcen.uscg.gov/?pageName=mtDsc.(USCG)SECTIONINM1/12CHARTCORRECTIONSI-2.1113463Ed.10/10LASTNM49/111/12AddDashed-line circle ?Discol. water? [Ke] (PA)29°0352.0N90°0903.0W(50/11CG8)1135741Ed.5/11LASTNM53/111/12AddDepth 27 feet Obstn [K41] (PA)29°0054N90°3031W(NOS)1135856Ed.7/10LASTNM49/111/12AddDashed-line circle ?Discol. water? [Ke] (PA)29°0352N90°0903W(50/11CG8)1136176Ed.8/11LASTNM53/111/12AddSubmerged well (cov 150ft) [L20]28°5147N89°2413W(See 48/11-11361)(50/11CG8)1137349Ed.9/10LASTNM51/111/12AddTabulation of controlling depths from Subsection I-3(See 42/11-11373)(NOS)1137435Ed.9/09LASTNM51/111/12(Side B)ChangeLegend to ?44 FT?30°2019.6N88°3028.8WLegend to ?29 FT 2002-2011?30°2050.1N88°3355.7WLegend to ?31 FT 2002-2011?30°2110.5N88°3416.7W(See 15, 24/11-11374)AddTabulation of controlling depths from Subsection I-3 (Supersedes 42/11-11374)(NOS)1137537Ed.6/10LASTNM49/111/12ChangeLegend to ?44 FT 2011?30°2021.3N88°3032.3WLegend to ?29 FT 2002-2011?30°2049.7N88°3353.9WLegend to ?31 FT 2002-2011?30°2108.2N88°3406.5W(See 15, 24/11-11375)AddTabulation of controlling depths from Subsection I-3 (Supersedes 42/11-11375)(NOS)1137655Ed.3/11LASTNM53/111/12DeleteDepth 46 feet30°1131N88°0242WDepth 46 feet30°1148N88°0238WDepth 47 feet30°1222N88°0226WDepth 47 feet30°1252N88°0217WDepth 53 feet30°1315N88°0211WDepth 58 feet30°1332N88°0209WDepth 47 feet30°1407N88°0212WAddDashed line (channel limit) [I20] between30°0821N88°0349W30°0846N88°0338WDashed line (channel limit) [I20] between30°0819N88°0343W30°0845N88°0332WSolid line (range line) between30°0820N88°0346W30°0845N88°0335WDashed line (channel limit) [I20] joining30°1116N88°0248W30°1335N88°0211W30°1339N88°0211W30°1432N88°0221WDashed line (channel limit) [I20] joining30°1116N88°0240W30°1334N88°0204W30°1340N88°0204W30°1430N88°0214WSolid line (range line) between30°1339N88°0208W30°1431N88°0218W(NOS)960438Ed.12/14/85LASTNM22/111/12AddDanger circle ?Obstn? [K40]42°4357N132°2415E(53(8204)07St.Petersburg)8069671Ed.1/2/10LASTNMN41/11N1/12AddLight Fl G 4s 14m 6M36°1436N129°2300ELight Fl G 5s 12m 6M36°1422N129°2311E(42(637)11Inchon) SECTION INM 1/12I-3.1SECTION INM 1/12I-3.2SECTION INM 1/12I-3.3SECTION INM 1/12I-3.4SECTION INM 1/12I-3.5SECTION INM 1/12I-3.6I-4.1 SECTION I NM 1/12CHARTS AFFECTED BY NOTICE TO MARINERSNM 33/09 THROUGH NM 1/12Note:N indicates Not For Sale; P indicates Preliminary; T indicates Temporary; * indicates New Edition/New Chart; ** indicates Chart CanceledChart No.Ed. No.Notice to Mariners No.10213/11 11233,47/0912133,34,45,47/09 13133,34,45,47/092037,36/1021446,48/0

     From the example above, I would like the following:

     1113463Ed.10/10LASTNM49/111/12AddDashed-line circle ?Discol. water? [Ke] (PA)29°0352.0N90°0903.0W(50/11CG8)1135741Ed.5/11LASTNM53/111/12AddDepth 27 feet Obstn [K41] (PA)29°0054N90°3031W(NOS)1135856Ed.7/10LASTNM49/111/12AddDashed-line circle ?Discol. water? [Ke] (PA)29°0352N90°0903W(50/11CG8)

    1136176Ed.8/11LASTNM53/111/12AddSubmerged well (cov 150ft) [L20]28°5147N89°2413W(See 48/11-11361)(50/11CG8)1137349Ed.9/10LASTNM51/111/12AddTabulation of controlling depths from Subsection I-3(See 42/11-11373)(NOS)

    1137435Ed.9/09LASTNM51/111/12(Side B)ChangeLegend to ?44 FT?30°2019.6N88°3028.8WLegend to ?29 FT 2002-2011?30°2050.1N88°3355.7WLegend to ?31 FT 2002-2011?30°2110.5N88°3416.7W(See 15, 24/11-11374)AddTabulation of controlling depths from Subsection I-3 (Supersedes 42/11-11374)(NOS)

    1137537Ed.6/10LASTNM49/111/12ChangeLegend to ?44 FT 2011?30°2021.3N88°3032.3WLegend to ?29 FT 2002-2011?30°2049.7N88°3353.9WLegend to ?31 FT 2002-2011?30°2108.2N88°3406.5W(See 15, 24/11-11375)AddTabulation of controlling depths from Subsection I-3 (Supersedes 42/11-11375)(NOS)

    1137655Ed.3/11LASTNM53/111/12DeleteDepth 46 feet30°1131N88°0242WDepth 46 feet30°1148N88°0238WDepth 47 feet30°1222N88°0226WDepth 47 feet30°1252N88°0217WDepth 53 feet30°1315N88°0211WDepth 58 feet30°1332N88°0209WDepth 47 feet30°1407N88°0212WAddDashed line (channel limit) [I20] between30°0821N88°0349W30°0846N88°0338WDashed line (channel limit) [I20] between30°0819N88°0343W30°0845N88°0332WSolid line (range line) between30°0820N88°0346W30°0845N88°0335WDashed line (channel limit) [I20] joining30°1116N88°0248W30°1335N88°0211W30°1339N88°0211W30°1432N88°0221WDashed line (channel limit) [I20] joining30°1116N88°0240W30°1334N88°0204W30°1340N88°0204W30°1430N88°0214WSolid line (range line) between30°1339N88°0208W30°1431N88°0218W(NOS)960438Ed.12/14/85LASTNM22/111/12AddDanger circle ?Obstn? [K40]42°4357N132°2415E(53(8204)07St.Petersburg)

    8069671Ed.1/2/10LASTNMN41/11N1/12AddLight Fl G 4s 14m 6M36°1436N129°2300ELight Fl G 5s 12m 6M36°1422N129°2311E(42(637)11Inchon)

     Essentially, there are a bunch of individual items that all strings share.  These include: 

    -Start with an 5-6 digit number

    -Followed by the phrase "Ed."

    -Followed by a 1-2 digit number

    -Followed by (potentially) a bunch of text with any number of special characters, brackets etc., 

    -Followed by (at least) one coordinate, which that can be found with  (\d){1,2}.?[^\/,\-](\d){1,4}\.?\d{1,2}?(N|S)(\d){1,3}.?[^\/,\-](\d){1,4}\.?\d{1,2}?(W|E)

    - and what's causing my problems is the ending, which is text (which is not predicatable ) that is nestled in an unknown number of brackets

    - Also of note is the fact that once we find the start of the first string, all of the strings will be found immediately after eachother.  This may be useful, but troubled me since if I can't specify the end of the string, my regex will not catch the last string in the text dump.

    The closest  I have been able to get is using the following:

    \d{5,6}(\d{1,2})Ed\..*?(\d{1,4}(\d){1,2}.?[^\/,\-](\d){1,4}\.?\d{1,2}?(N|S)(\d){1,3}.?[^\/,\-](\d){1,4}\.?\d{1,2}?(W|E)).*?\)\b

    But ultimately this fails due to the fact that I can't predict the use of brackets in the text, and I was trying to use them to identify the end of the string. 

    Can anyone offer any suggestions?
     
    Thanks for reading! 
  •  02-23-2012, 4:55 PM 84700 in reply to 84698

    Re: Pulling strings from a single (massive) line of text - can't identify end of strings

    You are facing the classic "balanced parentheses"problem (literally in this case as you are dealing with the parenthesis characters - the problem actually applies to any situation where you have nested opening and closing character sequences).

    The underlying issue is that a regex cannot count. there is no way of saying "if you have seen X of this sequence, the look for X of that sequence to follow". However there are 2 (and shortly 3) regex variants that have extensions that let you do this: the PCRE library (and whatever languages such as PHP that use it) and the .NET regex (and PERL 6 in the future).

    You don't say what regex variant you are using but if it is not one of these then you should either give up of move to either. Unfortunately the syntax used by these regex variants to provide this functionality is not the same.

    Susan

  •  02-24-2012, 7:08 AM 84702 in reply to 84700

    Re: Pulling strings from a single (massive) line of text - can't identify end of strings

    Hi Susan,

    Thanks for your response.  I'm still a regex novice, so I appreciate your comment and advice.  I'm not sure which regex variant other than it's the one you get in Python when you import the "re" package. 

    To prevent anyone else from spending time trying to solve this, I'll throw a note in here that since I am using this in a program, I was able to determine the starting positions of each of the strings, and then create a loop that grabbed all of the characters between -- something like string1 = [start1:(start2-1)]

    Getting the final string required a separate bit of logic, but I seem to be getting what I want.

    Thanks! 

  •  02-27-2012, 5:15 PM 84712 in reply to 84702

    Re: Pulling strings from a single (massive) line of text - can't identify end of strings

    After a quick look, there are references on the Internet to an alternative "re" package that is available in Python 3 that introduces recursive patterns using the same syntax as PCRE. If you have a suitable Python setup, then what you are trying to do should be possible.

    However, if you are already achieving your desired outcome, then I'd leave it alone.

    Susan

View as RSS news feed in XML