Python matching n-grams from a dictionary to a string of text -
i have dictionary of 2 , 3 word phrases want search in rss feeds match. grab rss feeds, process them , end string in list entitled "documents". want check dictionary below , if of phrases in dictionary match part of string of text want return values key. not sure best way approach problem. suggestions appreciated.
ngramlist = {"cash outflows":-1, "pull out":-1,"winding down":-1,"most traded":-1,"steep gains":-1,"military strike":-1, "resumed operations":+1,"state aid":+1,"bail out":-1,"cut costs":-1,"alleged violations":-1,"under perform":-1,"more expected":+1, "pay more taxes":-1,"not sale":+1,"struck deal":+1,"cash flow problems":-2}
i'm assuming numbers (-2, -1, +1) in dictionary weights need count each phrase in each document make them useful.
so pseudocode :
- split document list of lines, each line list of words.
- then loop through each word in line, looping both forward , backwards in line generating various phrases.
- as each phrase generated keep global dictionary, phrase , count of occurrences.
here code simple case of finding count of each phrase in document, seems trying :
text = """ have dictionary of 2 , 3 word phrases want search in rss feeds match. grab rss feeds, process them , end string in list entitled "documents". want check dictionary below , if of phrases in dictionary match part of string of text want return values key. not sure best way approach problem. suggestions appreciated. """ ngrams = ["grab rss", "approach this", "in"] import re counts = {} ngram in ngrams: words = ngram.rsplit() pattern = re.compile(r'%s' % "\s+".join(words), re.ignorecase) counts[ngram] = len(pattern.findall(text)) print counts
output :
{'grab rss': 1, 'approach this': 1, 'in': 5}
Comments
Post a Comment