Python matching n-grams from a dictionary to a string of text -


i have dictionary of 2 , 3 word phrases want search in rss feeds match. grab rss feeds, process them , end string in list entitled "documents". want check dictionary below , if of phrases in dictionary match part of string of text want return values key. not sure best way approach problem. suggestions appreciated.

ngramlist = {"cash outflows":-1, "pull out":-1,"winding down":-1,"most traded":-1,"steep gains":-1,"military strike":-1,           "resumed operations":+1,"state aid":+1,"bail out":-1,"cut costs":-1,"alleged violations":-1,"under perform":-1,"more expected":+1,          "pay more taxes":-1,"not sale":+1,"struck deal":+1,"cash flow problems":-2} 

i'm assuming numbers (-2, -1, +1) in dictionary weights need count each phrase in each document make them useful.

so pseudocode :

  1. split document list of lines, each line list of words.
  2. then loop through each word in line, looping both forward , backwards in line generating various phrases.
  3. as each phrase generated keep global dictionary, phrase , count of occurrences.

here code simple case of finding count of each phrase in document, seems trying :

text = """ have dictionary of 2 , 3 word phrases want search in rss feeds match.   grab   rss feeds, process them , end string in list entitled "documents".  want check dictionary below , if of phrases in dictionary match part of string of text want return values key.  not sure best way approach problem. suggestions appreciated. """  ngrams = ["grab rss", "approach this", "in"]  import re  counts = {} ngram in ngrams:     words = ngram.rsplit()     pattern = re.compile(r'%s' % "\s+".join(words),         re.ignorecase)     counts[ngram] = len(pattern.findall(text))  print counts 

output :

{'grab rss': 1, 'approach this': 1, 'in': 5} 

Comments

Popular posts from this blog

java.util.scanner - How to read and add only numbers to array from a text file -

rewrite - Trouble with Wordpress multiple custom querystrings -