r - How to optimize for loops in extremely large dataframe -
i have dataframe "x" 5.9 million rows , 4 columns: idnumber/integer, compdate/integer , judge/character,, representing individual cases completed in administrative court. data imported stata dataset , date field came in integer, fine purposes. want create caseload variable calculating number of cases completed judge within 30 day window of completion date of case @ issue.
here first 34 rows of data:
idnumber compdate judge 1 9615 jvc 2 15316 ban 3 15887 wla 4 11968 wfn 5 15001 clr 6 13914 ieb 7 14760 hsd 8 11063 rjd 9 10948 ppl 10 16502 ban 11 15391 wcp 12 14587 lrd 13 10672 rtg 14 11864 jcw 15 15071 gmr 16 15082 pam 17 11697 dlk 18 10660 adp 19 13284 ecc 20 13052 jwr 21 15987 mak 22 10105 hea 23 14298 clr 24 18154 mmt 25 10392 hea 26 10157 erh 27 9188 rbr 28 12173 jcw 29 10234 par 30 10437 adp 31 11347 rdw 32 14032 jtz 33 11876 amc 34 11470 amc
here's came with. each record i'm taking subset of data particular judge , subsetting cases decided in 30 day window, , assigning length of vector in subsetted dataframe caseload variable subject case, follows:
for(i in 1:length(x$idnumber)){ e<-x$compdate[i] f<-e-29 a<-x[x$judge==x$judge[i] & !is.na(x$compdate),] b<-a[a$compdate<=e & a$compdate>=f,] x$caseload[i]<-length(b$idnumber) }
it working taking extremely long complete. how can optimize or easier. sorry i'm new r , programming -- i'm law professor trying analyze court data.... appreciated. thanks. ken
i don't have experience rolling calculations, but...
- calculate per-day, not per-case (since same cases on same day).
- calculate cumulative sum of number of cases, , take difference of current value of sum , value of sum 31 days ago (or
min{daysago:daysago>30}
since cases not resolved every day).
it's fastest use data.table. attempt, using @nograpes simulated data. comments start #
.
require(data.table) dt <- data.table(x) dt[,compdate:=as.integer(compdate)] setkey(dt,judge,compdate) # count cases each day ldt <- dt[,.n,by='judge,compdate'] # cumulative sum of counts ldt[,nrun:=cumsum(n),by=judge] # see how far ldt[,lookbk:=sapply(1:.n,function(i){ z <- compdate[i]-compdate[i:1] older <- which(z>30) if (length(older)) min(older)-1l else as(na,'integer') }),by=judge] # compute cumsum(today) - cumsum(more 30 days ago) ldt[,wload:=list(sapply(1:.n,function(i) nrun[i]-ifelse(is.na(lookbk[i]),0,nrun[i-lookbk[i]]) ))]
on laptop, takes under minute. run command see output 1 judge:
print(ldt['xyz'],nrow=120)
Comments
Post a Comment