Automatic Document classification with Python: Gaming articles being sorted into Sports -
i have corpus of 500 pre-categorized articles. i've taken commonly-used nouns , adjectives each category , sorted them relevance.
each category (world, business, tech, entertainment, science, health, sports), has few hundred words associated it.
i having trouble article: http://www.techhive.com/article/2052311/hands-on-with-the-2ds-an-entry-level-investment.html
it gaming. words "game, player, etc" closely associated sports, based on articles i've looked at.
this article scores following:
{u'business': 51, u'entertainment': 58, u'science': 48, u'sports': 62, u'health': 35, u'world': 48, u'technology': 59}
as can see, technology there @ 59, overtaken sports @ 62.
i hoping if increase corpus few thousand articles, problem solved, don't know if likely.
what ideas on solving issue?
i thought having list of giveaway words, "twitter, facebook, technology, nintendo, etc", automatically cluster article technology if present. problem finding words with, , avoiding clashes business/world, etc.
thanks.
the gaming category should blur hunting, war correspondence, pen-and-paper rpgs... - has game-version of it.
i think looking differentiate fact fiction. idea derive 1 proposed grab fiction section , fact section of library , reduce them short-list , long-list of keywords.
ed: it's have discovered, typical 'hello world' example, word frequency analysis, map-reduce framework such disco should let point set of urls know either fact or fiction. should have 2 lists of tuples , can filter these keywords speak of fact or fiction.
Comments
Post a Comment