Python - regex to complete hyphens in standard filename format -
this program goes through directory , fixes (if possible) filenames specific format of whitespaces, hyphens, etc. method regexsubfixgrouping()
changes improper whitespace found in filenames proper whitespace. method checkproper()
shows format needed.
proper format:
201308 - (82608) - mac 2233-007-methods of calculus - klingler, lee.pdf
everything works pretty except regex should insert of first 4 hyphens may missing. i'm not overly concerned hyphens @ point, maybe down road. mainly, want insert of first 4 missing hyphens (and maintain it's current functionality of correcting whitespace, etc).
methods:
def readdir(path1): return [ f f in os.listdir(path1) if os.path.isfile(os.path.join(path1,f)) ] def checkproper(f,term): return re.match(term + '\s-\s\(\d{5}\)\s-\s\w{3}\s\d{4}\w?-\d{3}-[^\.]+\s-\s[^\.]+\.txt', f) def regexsubfixgrouping(f,term): """ improved version of regexsubfix(). corrects improper whitespace in filename """ return re.sub(term + r'\s*-\s*(\(\d{5}\))\s*-\s*(\w{3}\s\d{4}\w?-\d{3}\s*-\s*(?:[^.\s]|\b\s\b)+)\s*-\s*([^.]+\.pdf)$', lambda match: term+' - {0} - {1} - {2}'.format(match.group(1), re.sub(r'\s*-\s*', '-', match.group(2)), match.group(3)) , f) def properfiles(dir1,term,path1): """ main functionality. goes through list of files in directory, separates bad , fixes can. """ goodmatch = []; stillwrong = []; goodfix = [] #; fixed = "" f in dir1: result = checkproper(f,term) if result: goodmatch.append(result.group(0)) else: fixed = regexsubfixgrouping(f,term) #print "^^^^^^ ",fixed if checkproper(fixed,term): os.rename(path1+'\\'+f, path1+'\\'+fixed); goodfix.append(fixed) else: os.rename(path1+'\\'+f, path1+'\\'+'@ '+fixed); stillwrong.append(fixed) goodtogo = len(goodmatch)+len(goodfix); total = len(dir1); successrate = (goodtogo/(float(total)))*100.0 print "%d total files. %d files in proper format. %0.2f%% success rate."%(total,goodtogo,successrate) print "all files not in proper format appended @ marked user." return goodmatch, goodfix, stillwrong
so should able fix filenames these (missing hyphen) errors:
201308 - (82431) - mac 1105-006 college algebra - graziose, james.pdf 201308 - (82610) - mac 2233-009 methods of calculus - grigoriev, stepan.pdf
and errors 3 capital letters after 2nd hyphen don't have space before 4 integers following it:
201308 - (91500) - mac1105-014 - college algebra - radulovic, aibeng.pdf
if possible i'd adjust regexsubfixgrouping()
method rather use system resources in running more regex's necessary. i'm teaching myself python i'm sure junior programmer this, if pro happens on question straighten out easily.
edit: remaining outliers:
201308 - (82442) - mac 1105 - 012 - college algebra - harmon, drake.pdf 201308 - (92835) - mac 1105 - 017 - college algebra - harmon, drake.pdf 201308 - (95125) - mac1147-004 - precaclculus algebra & trig - greenberg, alisa.pdf 201308 - (82600) - mac1147-002 - precaclculus algebra & trig - greenberg, alisa.pdf
first 2 i'm not sure why didn't catch. seem fixable. second 2, i'm not sure why didn't separate mac
space 1147
.
you can edit second re.sub
in function, ,although you'll have edit first re.sub
accommodate change:
return re.sub(term + r'\s*-\s*(\(\d{5}\))\s*-\s*(\w{3}\s?\d{4}\w?-?\d{3}\s*-?\s*(?:[^.\s]|\b\s\b)+)\s*-\s*([^.]+\.pdf)$', lambda match: term+' - {0} - {1} - {2}'.format(match.group(1), re.sub(r'(\w{3})\s?(\d{4}\w?)\s*-?\s*(\d{3})\s*-?\s*(.*)', r'\1 \2-\3-\4', match.group(2)), match.group(3)) , f)
the second re.sub
parses 'middle part' scratch.
i don't know how affect previous file names have though, since added more flexibility regex accept 'wrong formats'.
edit: didn't consider " & " , forgot put spaces around third hyphen. use regex first re.sub:
\s*-\s*(\(\d{5}\))\s*-\s*(\w{3}\s*\d{4}\w?\s*-?\s*\d{3}\s*-?\s*(?:[^.\s]|\b\s\b|\s&\s)+)\s*-\s*([^.]+\.pdf)$
Comments
Post a Comment