regex - Single line delimited text -
i have single line text containing invoice's rows, extracted pdf.
i should parse it, getting rows output tokens.
here's snippet:
1 nr 0pr 18ov dho1o154 occhiale acetato donna vista 1 nr 0pr 18ov nag1o152 occhiale acetato donna vista 1 nr 0pr 61qv 7ax1o156 occhiale metallo uomo vista descrizione causale vendita 2 nr 0an4007 41 / 87 66 occhiale nylon uomo sole descrizione causale vendita 1 nr 0ea4001 50638g56 valeria occhiale nylon uomo sole descrizione causale vendita - pag 1 di 3 - segue - 1 nr 0po3042s 972 / m351 sofia occhiale acetato uomo sole descrizione causale vendita 1 nr 0an3048 502 / 8g30 valeria occhiale metallo uomo sole descrizione causale vendita 6 nr 0dg4204 27648764 occhiale acetato uomo sole descrizione causale vendita 1 nr 0ox3123 31230453 valeria occhiale acciaio uomo vista
i want get, token, example first:
1 nr 0pr 18ov dho1o154 occhiale acetato donna vista
explained, token should be:
- starting integer+*space*+nr+space
- containing whatever can find after start, strings, numbers, whatever...
- ending before: next "x+nr" starting token, fixed strings (such "descrizione causale vendita") or end of file.
using regex (\b\d+\b nr)
can match x+nr starting tokens, how can select next part, before next x+nr token?
notice title! have in 1 single line, so... no new lines separators!
thank you
building on regex have far, can use positive lookahead:
(?:\b\d+\b nr).*?(?=\b\d+\b nr|$)
each colour indicates different match.
(?= ... )
positive lookahead doesn't count match. therefore, matches until , before next \b\d\b nr
or end of string $
.
Comments
Post a Comment