Automatic Document classification with Python: Gaming articles being sorted into Sports -


i have corpus of 500 pre-categorized articles. i've taken commonly-used nouns , adjectives each category , sorted them relevance.

each category (world, business, tech, entertainment, science, health, sports), has few hundred words associated it.

i having trouble article: http://www.techhive.com/article/2052311/hands-on-with-the-2ds-an-entry-level-investment.html

it gaming. words "game, player, etc" closely associated sports, based on articles i've looked at.

this article scores following:

{u'business': 51, u'entertainment': 58, u'science': 48, u'sports': 62, u'health': 35, u'world': 48, u'technology': 59} 

as can see, technology there @ 59, overtaken sports @ 62.

i hoping if increase corpus few thousand articles, problem solved, don't know if likely.

what ideas on solving issue?

i thought having list of giveaway words, "twitter, facebook, technology, nintendo, etc", automatically cluster article technology if present. problem finding words with, , avoiding clashes business/world, etc.

thanks.

the gaming category should blur hunting, war correspondence, pen-and-paper rpgs... - has game-version of it.

i think looking differentiate fact fiction. idea derive 1 proposed grab fiction section , fact section of library , reduce them short-list , long-list of keywords.

ed: it's have discovered, typical 'hello world' example, word frequency analysis, map-reduce framework such disco should let point set of urls know either fact or fiction. should have 2 lists of tuples , can filter these keywords speak of fact or fiction.


Comments

Popular posts from this blog

java.util.scanner - How to read and add only numbers to array from a text file -

rewrite - Trouble with Wordpress multiple custom querystrings -