machine learning - Classifying event data -
i'm getting started machine learning , analytics, , approach dive right in data , learn methods/tools need them.
i have set of event data, logging people's movement in/out of room using via rfid chips unique ids have timestamp of when chip enters room , when leaves. want classify chip ids based on movements in , out of room.
for instance,
- a chip leaves , comes regularly belong works in room.
- a chip enters , stays long time may belong electronic equipment moved around.
- a chip comes , leaves short while later, few or 1 set of entrances/exits visitor.
- a chip spends minimal time in room , not visit walking through.
those 4 main categories.
so far, have tried k-means clustering. each chip, compute parameters indicate particular category: average time spent in room, number of days seen in week, total time spent in room, , average number of entrances/exits per day.
with this, have seen reasonable results, depending on parameters use results vary drastically. looking @ chip parameters generate on plot, there's lot of variation in classification. not have training data, why tried classification method first.
i'm looking advice on might better algorithms or techniques use, or if approach way off. can provide code or dummy data if needed, i'm looking direction.
sounds fun problem work on ! here couple of general suggestions might give ideas.
first, useful if able identify sort of goal you're trying achieve modeling data. you'd identify intruders, or learn how many different classes of rfid badge there (you listed 4 in question, if there 5 or ten ?), or other task entirely. might need different data depending on task identify, useful because have gauge of whether you're trying working. hope makes sense.
second, if you're using k-means (or clustering algorithm, really) model data in unsupervised manner, it's idea normalize features of data. suppose had 2 features -- 1 indicates hour of last entrance room, , indicates total number of entrances in month. values of first feature in range [0, 24) while second range 0 1000 (say). then, when you're computing distance between data points, second feature might dominate distance computation because numeric values larger. starters, try subtracting mean each feature , dividing standard deviation. make feature clusters more stable, if nothing else.
Comments
Post a Comment