Python - regex to complete hyphens in standard filename format -


this program goes through directory , fixes (if possible) filenames specific format of whitespaces, hyphens, etc. method regexsubfixgrouping() changes improper whitespace found in filenames proper whitespace. method checkproper() shows format needed.

proper format:

201308 - (82608) - mac 2233-007-methods of calculus - klingler, lee.pdf

everything works pretty except regex should insert of first 4 hyphens may missing. i'm not overly concerned hyphens @ point, maybe down road. mainly, want insert of first 4 missing hyphens (and maintain it's current functionality of correcting whitespace, etc).

methods:

def readdir(path1):     return [ f f in os.listdir(path1) if os.path.isfile(os.path.join(path1,f)) ]  def checkproper(f,term):     return re.match(term + '\s-\s\(\d{5}\)\s-\s\w{3}\s\d{4}\w?-\d{3}-[^\.]+\s-\s[^\.]+\.txt', f)   def regexsubfixgrouping(f,term):     """ improved version of regexsubfix(). corrects improper whitespace in filename """     return re.sub(term + r'\s*-\s*(\(\d{5}\))\s*-\s*(\w{3}\s\d{4}\w?-\d{3}\s*-\s*(?:[^.\s]|\b\s\b)+)\s*-\s*([^.]+\.pdf)$',           lambda match: term+' - {0} - {1} - {2}'.format(match.group(1),           re.sub(r'\s*-\s*', '-', match.group(2)),           match.group(3)) ,           f)  def properfiles(dir1,term,path1): """ main functionality. goes through list of files in directory, separates bad , fixes can. """ goodmatch = []; stillwrong = []; goodfix = [] #; fixed = "" f in dir1:     result = checkproper(f,term)     if result: goodmatch.append(result.group(0))     else:         fixed = regexsubfixgrouping(f,term)         #print "^^^^^^   ",fixed         if checkproper(fixed,term):             os.rename(path1+'\\'+f, path1+'\\'+fixed); goodfix.append(fixed)         else: os.rename(path1+'\\'+f, path1+'\\'+'@ '+fixed); stillwrong.append(fixed) goodtogo = len(goodmatch)+len(goodfix); total = len(dir1); successrate = (goodtogo/(float(total)))*100.0 print "%d total files. %d files in proper format. %0.2f%% success rate."%(total,goodtogo,successrate) print "all files not in proper format appended @ marked user." return goodmatch, goodfix, stillwrong 

so should able fix filenames these (missing hyphen) errors:

201308 - (82431) - mac 1105-006 college algebra - graziose, james.pdf  201308 - (82610) - mac 2233-009 methods of calculus - grigoriev, stepan.pdf 

and errors 3 capital letters after 2nd hyphen don't have space before 4 integers following it:

201308 - (91500) - mac1105-014 - college algebra - radulovic, aibeng.pdf

if possible i'd adjust regexsubfixgrouping() method rather use system resources in running more regex's necessary. i'm teaching myself python i'm sure junior programmer this, if pro happens on question straighten out easily.

edit: remaining outliers:

 201308 - (82442) - mac 1105 - 012 - college algebra - harmon, drake.pdf 201308 - (92835) - mac 1105 - 017 - college algebra - harmon, drake.pdf 201308 - (95125) - mac1147-004 - precaclculus algebra & trig - greenberg, alisa.pdf 201308 - (82600) - mac1147-002 - precaclculus algebra & trig - greenberg, alisa.pdf 

first 2 i'm not sure why didn't catch. seem fixable. second 2, i'm not sure why didn't separate mac space 1147.

you can edit second re.sub in function, ,although you'll have edit first re.sub accommodate change:

return re.sub(term + r'\s*-\s*(\(\d{5}\))\s*-\s*(\w{3}\s?\d{4}\w?-?\d{3}\s*-?\s*(?:[^.\s]|\b\s\b)+)\s*-\s*([^.]+\.pdf)$',       lambda match: term+' - {0} - {1} - {2}'.format(match.group(1),       re.sub(r'(\w{3})\s?(\d{4}\w?)\s*-?\s*(\d{3})\s*-?\s*(.*)', r'\1 \2-\3-\4', match.group(2)),       match.group(3)) ,       f) 

the second re.sub parses 'middle part' scratch.

i don't know how affect previous file names have though, since added more flexibility regex accept 'wrong formats'.

edit: didn't consider " & " , forgot put spaces around third hyphen. use regex first re.sub:

\s*-\s*(\(\d{5}\))\s*-\s*(\w{3}\s*\d{4}\w?\s*-?\s*\d{3}\s*-?\s*(?:[^.\s]|\b\s\b|\s&\s)+)\s*-\s*([^.]+\.pdf)$ 

Comments

Popular posts from this blog

java.util.scanner - How to read and add only numbers to array from a text file -

rewrite - Trouble with Wordpress multiple custom querystrings -