python - Optimizing searches in very large csv files -
i have csv file single column, 6.2 million rows, containing strings between 6 , 20ish letters. strings found in duplicate (or more) entries, , want write these new csv file - guess there should around 1 million non-unique strings. that's it, really. continuously searching through dictionary of 6 million entries take time, however, , i'd appreciate tips on how it. script i've written far takes @ least week (!) run, according timings did.
first try:
in_file_1 = open('uniprot trypsinome (full).csv','r') in_list_1 = list(csv.reader(in_file_1)) out_file_1 = open('uniprot non-unique reference trypsinome.csv','w+') out_file_2 = open('uniprot unique trypsin peptides.csv','w+') writer_1 = csv.writer(out_file_1) writer_2 = csv.writer(out_file_2) # create trypsinome dictionary construct ref_dict = {} row in range(len(in_list_1)): ref_dict[row] = in_list_1[row] # find unique/non-unique peptides trypsinome peptide_list = [] uniques = [] n in range(len(in_list_1)): peptide = ref_dict.pop(n) if peptide in ref_dict.values(): # non-unique peptides peptide_list.append(peptide) else: uniques.append(peptide) # unique peptides m in range(len(peptide_list)): write_list = (str(peptide_list[m]).replace("'","").replace("[",'').replace("]",''),'') writer_1.writerow(write_list)
second try:
in_file_1 = open('uniprot trypsinome (full).csv','r') in_list_1 = list(csv.reader(in_file_1)) out_file_1 = open('uniprot non-unique reference trypsinome.csv','w+') writer_1 = csv.writer(out_file_1) ref_dict = {} row in range(len(in_list_1)): peptide = in_list_1[row] if peptide in ref_dict.values(): write = (in_list_1[row],'') writer_1.writerow(write) else: ref_dict[row] = in_list_1[row]
edit: here's few lines csv file:
selvqk aklaeqaer aklaeqaerr laeqaer laeqaeryddmaaamk laeqaeryddmaaamkk mtmdkselvqk yddmaaamkavteqghelsneer yddmaaamkavteqghelsneerr
first hint : python has support lazy evaluation, better use when dealing huge datasets. :
- iterate on csv.reader instead of bulding huge in-memory list,
- don't build huge in-memory lists ranges - use
enumate(seq)
instead if need both item , index, , iterate on sequence's items if don't need index.
second hint : main point of using dict
(hashtable) lookup on keys, not values... don't build huge dict that's used list.
third hint : if want way store "already seen" values, use set
.
Comments
Post a Comment