Python matching n-grams from a dictionary to a string of text -

i have dictionary of 2 , 3 word phrases want search in rss feeds match. grab rss feeds, process them , end string in list entitled "documents". want check dictionary below , if of phrases in dictionary match part of string of text want return values key. not sure best way approach problem. suggestions appreciated.

ngramlist = {"cash outflows":-1, "pull out":-1,"winding down":-1,"most traded":-1,"steep gains":-1,"military strike":-1,           "resumed operations":+1,"state aid":+1,"bail out":-1,"cut costs":-1,"alleged violations":-1,"under perform":-1,"more expected":+1,          "pay more taxes":-1,"not sale":+1,"struck deal":+1,"cash flow problems":-2}

i'm assuming numbers (-2, -1, +1) in dictionary weights need count each phrase in each document make them useful.

so pseudocode :

split document list of lines, each line list of words.
then loop through each word in line, looping both forward , backwards in line generating various phrases.
as each phrase generated keep global dictionary, phrase , count of occurrences.

here code simple case of finding count of each phrase in document, seems trying :

text = """ have dictionary of 2 , 3 word phrases want search in rss feeds match.   grab   rss feeds, process them , end string in list entitled "documents".  want check dictionary below , if of phrases in dictionary match part of string of text want return values key.  not sure best way approach problem. suggestions appreciated. """  ngrams = ["grab rss", "approach this", "in"]  import re  counts = {} ngram in ngrams:     words = ngram.rsplit()     pattern = re.compile(r'%s' % "\s+".join(words),         re.ignorecase)     counts[ngram] = len(pattern.findall(text))  print counts

output :

{'grab rss': 1, 'approach this': 1, 'in': 5}

Search This Blog

Bradly

Python matching n-grams from a dictionary to a string of text -

Comments

Post a Comment

Popular posts from this blog

java.util.scanner - How to read and add only numbers to array from a text file -

What is the end of string notation in python -

php - Add the correct number of days for each month -