Hi, I've got a script that extracts all of the text from a pdf document. Because of the nature of pdf documents, all of the whitespace and formatting disappears, leaving me with a ~100 page pdf all on a single line of text.
Within that line of text are a lot of little snippets that I want. I have gotten so close to what I want, but I'm not quite there. From what I can tell, I only reliably know the start of the string that I want. Here's an example of what I'm looking at:
/mmsi), Sea Tow (http://www.seatow.com/mmsi), or the U.S. Power Squadrons (http://www.usps.org/php/mmsi). Ensure any information originally provided is updated as changes occur. FCC regulations require DSC-equippedradios ?use MMSI assigned by the Commission or its designees? (47 CFR 80.103(b)).Then: Interconnect your radio to a GPS receiver using a two-wire NMEA 0183 interface on all DSC equipped marine radiosand on most GPS receivers. Instructions should be provided in the radio and GPS operator's manual. Further information isprovided and will be routinely updated at http://www.navcen.uscg.gov/?pageName=mtDsc.(USCG)SECTIONINM1/12CHARTCORRECTIONSI-2.1113463Ed.10/10LASTNM49/111/12AddDashed-line circle ?Discol. water? [Ke] (PA)29°0352.0N90°0903.0W(50/11CG8)1135741Ed.5/11LASTNM53/111/12AddDepth 27 feet Obstn [K41] (PA)29°0054N90°3031W(NOS)1135856Ed.7/10LASTNM49/111/12AddDashed-line circle ?Discol. water? [Ke] (PA)29°0352N90°0903W(50/11CG8)1136176Ed.8/11LASTNM53/111/12AddSubmerged well (cov 150ft) [L20]28°5147N89°2413W(See 48/11-11361)(50/11CG8)1137349Ed.9/10LASTNM51/111/12AddTabulation of controlling depths from Subsection I-3(See 42/11-11373)(NOS)1137435Ed.9/09LASTNM51/111/12(Side B)ChangeLegend to ?44 FT?30°2019.6N88°3028.8WLegend to ?29 FT 2002-2011?30°2050.1N88°3355.7WLegend to ?31 FT 2002-2011?30°2110.5N88°3416.7W(See 15, 24/11-11374)AddTabulation of controlling depths from Subsection I-3 (Supersedes 42/11-11374)(NOS)1137537Ed.6/10LASTNM49/111/12ChangeLegend to ?44 FT 2011?30°2021.3N88°3032.3WLegend to ?29 FT 2002-2011?30°2049.7N88°3353.9WLegend to ?31 FT 2002-2011?30°2108.2N88°3406.5W(See 15, 24/11-11375)AddTabulation of controlling depths from Subsection I-3 (Supersedes 42/11-11375)(NOS)1137655Ed.3/11LASTNM53/111/12DeleteDepth 46 feet30°1131N88°0242WDepth 46 feet30°1148N88°0238WDepth 47 feet30°1222N88°0226WDepth 47 feet30°1252N88°0217WDepth 53 feet30°1315N88°0211WDepth 58 feet30°1332N88°0209WDepth 47 feet30°1407N88°0212WAddDashed line (channel limit) [I20] between30°0821N88°0349W30°0846N88°0338WDashed line (channel limit) [I20] between30°0819N88°0343W30°0845N88°0332WSolid line (range line) between30°0820N88°0346W30°0845N88°0335WDashed line (channel limit) [I20] joining30°1116N88°0248W30°1335N88°0211W30°1339N88°0211W30°1432N88°0221WDashed line (channel limit) [I20] joining30°1116N88°0240W30°1334N88°0204W30°1340N88°0204W30°1430N88°0214WSolid line (range line) between30°1339N88°0208W30°1431N88°0218W(NOS)960438Ed.12/14/85LASTNM22/111/12AddDanger circle ?Obstn? [K40]42°4357N132°2415E(53(8204)07St.Petersburg)8069671Ed.1/2/10LASTNMN41/11N1/12AddLight Fl G 4s 14m 6M36°1436N129°2300ELight Fl G 5s 12m 6M36°1422N129°2311E(42(637)11Inchon) SECTION INM 1/12I-3.1SECTION INM 1/12I-3.2SECTION INM 1/12I-3.3SECTION INM 1/12I-3.4SECTION INM 1/12I-3.5SECTION INM 1/12I-3.6I-4.1 SECTION I NM 1/12CHARTS AFFECTED BY NOTICE TO MARINERSNM 33/09 THROUGH NM 1/12Note:N indicates Not For Sale; P indicates Preliminary; T indicates Temporary; * indicates New Edition/New Chart; ** indicates Chart CanceledChart No.Ed. No.Notice to Mariners No.10213/11 11233,47/0912133,34,45,47/09 13133,34,45,47/092037,36/1021446,48/0
From the example above, I would like the following:
1113463Ed.10/10LASTNM49/111/12AddDashed-line circle ?Discol. water? [Ke] (PA)29°0352.0N90°0903.0W(50/11CG8)1135741Ed.5/11LASTNM53/111/12AddDepth 27 feet Obstn [K41] (PA)29°0054N90°3031W(NOS)1135856Ed.7/10LASTNM49/111/12AddDashed-line circle ?Discol. water? [Ke] (PA)29°0352N90°0903W(50/11CG8)
1136176Ed.8/11LASTNM53/111/12AddSubmerged well (cov 150ft) [L20]28°5147N89°2413W(See 48/11-11361)(50/11CG8)1137349Ed.9/10LASTNM51/111/12AddTabulation of controlling depths from Subsection I-3(See 42/11-11373)(NOS)
1137435Ed.9/09LASTNM51/111/12(Side B)ChangeLegend to ?44 FT?30°2019.6N88°3028.8WLegend to ?29 FT 2002-2011?30°2050.1N88°3355.7WLegend to ?31 FT 2002-2011?30°2110.5N88°3416.7W(See 15, 24/11-11374)AddTabulation of controlling depths from Subsection I-3 (Supersedes 42/11-11374)(NOS)
1137537Ed.6/10LASTNM49/111/12ChangeLegend to ?44 FT 2011?30°2021.3N88°3032.3WLegend to ?29 FT 2002-2011?30°2049.7N88°3353.9WLegend to ?31 FT 2002-2011?30°2108.2N88°3406.5W(See 15, 24/11-11375)AddTabulation of controlling depths from Subsection I-3 (Supersedes 42/11-11375)(NOS)
1137655Ed.3/11LASTNM53/111/12DeleteDepth 46 feet30°1131N88°0242WDepth 46 feet30°1148N88°0238WDepth 47 feet30°1222N88°0226WDepth 47 feet30°1252N88°0217WDepth 53 feet30°1315N88°0211WDepth 58 feet30°1332N88°0209WDepth 47 feet30°1407N88°0212WAddDashed line (channel limit) [I20] between30°0821N88°0349W30°0846N88°0338WDashed line (channel limit) [I20] between30°0819N88°0343W30°0845N88°0332WSolid line (range line) between30°0820N88°0346W30°0845N88°0335WDashed line (channel limit) [I20] joining30°1116N88°0248W30°1335N88°0211W30°1339N88°0211W30°1432N88°0221WDashed line (channel limit) [I20] joining30°1116N88°0240W30°1334N88°0204W30°1340N88°0204W30°1430N88°0214WSolid line (range line) between30°1339N88°0208W30°1431N88°0218W(NOS)960438Ed.12/14/85LASTNM22/111/12AddDanger circle ?Obstn? [K40]42°4357N132°2415E(53(8204)07St.Petersburg)
8069671Ed.1/2/10LASTNMN41/11N1/12AddLight Fl G 4s 14m 6M36°1436N129°2300ELight Fl G 5s 12m 6M36°1422N129°2311E(42(637)11Inchon)
Essentially, there are a bunch of individual items that all strings share. These include:
-Start with an 5-6 digit number
-Followed by the phrase "Ed."
-Followed by a 1-2 digit number
-Followed by (potentially) a bunch of text with any number of special characters, brackets etc.,
-Followed by (at least) one coordinate, which that can be found with (\d){1,2}.?[^\/,\-](\d){1,4}\.?\d{1,2}?(N|S)(\d){1,3}.?[^\/,\-](\d){1,4}\.?\d{1,2}?(W|E)
- and what's causing my problems is the ending, which is text (which is not predicatable ) that is nestled in an unknown number of brackets
- Also of note is the fact that once we find the start of the first string, all of the strings will be found immediately after eachother. This may be useful, but troubled me since if I can't specify the end of the string, my regex will not catch the last string in the text dump.
The closest I have been able to get is using the following:
\d{5,6}(\d{1,2})Ed\..*?(\d{1,4}(\d){1,2}.?[^\/,\-](\d){1,4}\.?\d{1,2}?(N|S)(\d){1,3}.?[^\/,\-](\d){1,4}\.?\d{1,2}?(W|E)).*?\)\b
But ultimately this fails due to the fact that I can't predict the use of brackets in the text, and I was trying to use them to identify the end of the string.
Can anyone offer any suggestions?
Thanks for reading!