python - Optimizing searches in very large csv files -


i have csv file single column, 6.2 million rows, containing strings between 6 , 20ish letters. strings found in duplicate (or more) entries, , want write these new csv file - guess there should around 1 million non-unique strings. that's it, really. continuously searching through dictionary of 6 million entries take time, however, , i'd appreciate tips on how it. script i've written far takes @ least week (!) run, according timings did.

first try:

in_file_1 = open('uniprot trypsinome (full).csv','r') in_list_1 = list(csv.reader(in_file_1)) out_file_1 = open('uniprot non-unique reference trypsinome.csv','w+') out_file_2 = open('uniprot unique trypsin peptides.csv','w+') writer_1 = csv.writer(out_file_1) writer_2 = csv.writer(out_file_2)  # create trypsinome dictionary construct ref_dict = {} row in range(len(in_list_1)):     ref_dict[row] = in_list_1[row]  # find unique/non-unique peptides trypsinome peptide_list = [] uniques = [] n in range(len(in_list_1)):     peptide = ref_dict.pop(n)     if peptide in ref_dict.values(): # non-unique peptides         peptide_list.append(peptide)     else:         uniques.append(peptide) # unique peptides  m in range(len(peptide_list)):     write_list = (str(peptide_list[m]).replace("'","").replace("[",'').replace("]",''),'')     writer_1.writerow(write_list) 

second try:

in_file_1 = open('uniprot trypsinome (full).csv','r') in_list_1 = list(csv.reader(in_file_1)) out_file_1 = open('uniprot non-unique reference trypsinome.csv','w+') writer_1 = csv.writer(out_file_1)  ref_dict = {} row in range(len(in_list_1)):     peptide = in_list_1[row]     if peptide in ref_dict.values():         write = (in_list_1[row],'')         writer_1.writerow(write)     else:         ref_dict[row] = in_list_1[row] 

edit: here's few lines csv file:

selvqk aklaeqaer aklaeqaerr laeqaer laeqaeryddmaaamk laeqaeryddmaaamkk mtmdkselvqk yddmaaamkavteqghelsneer yddmaaamkavteqghelsneerr 

first hint : python has support lazy evaluation, better use when dealing huge datasets. :

  • iterate on csv.reader instead of bulding huge in-memory list,
  • don't build huge in-memory lists ranges - use enumate(seq) instead if need both item , index, , iterate on sequence's items if don't need index.

second hint : main point of using dict (hashtable) lookup on keys, not values... don't build huge dict that's used list.

third hint : if want way store "already seen" values, use set.


Comments

Popular posts from this blog

java.util.scanner - How to read and add only numbers to array from a text file -

rewrite - Trouble with Wordpress multiple custom querystrings -