java - Lucene: exact matches aren't shown first -


i using demo indexfiles , searchfiles classes index , search in org.apache.lucene.demo packet.

my issue when use query contains more word, not getting results have exact match. instance:

enter query: "natural language" searching for: "natural language" 298 total matching documents 1. download\researchers.uq.edu.au\fields-of-research\natural-language-processing .txt 2. download\researchers.uq.edu.au\research-project\16267.txt 3. download\researchers.uq.edu.au\research-project\16279.txt 4. download\researchers.uq.edu.au\research-project\18361.txt 5. download\www.uq.edu.au\news\%3farticle%3d2187.txt 6. download\researchers.uq.edu.au\researcher\2115.txt 7. download\ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-project s-dr-alan-cody%3fpage%3d1.txt 8. download\ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-project s-dr-alan-cody%3fpage%3d2.txt 9. download\ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-project s-dr-alan-cody.txt 10. download\www.ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-pr ojects-dr-alan-cody.txt press (n)ext page, (q)uit or enter number jump page. 

does not have same results as:

enter query: natural language searching for: natural language 54307 total matching documents 1. download\cyberschool.library.uq.edu.au\display_resource.phtml%3frid%3d190.txt  2. download\cyberschool.library.uq.edu.au\display_resource.phtml%3frid%3d576.txt  3. download\cyberschool.library.uq.edu.au\display_resource.phtml%3frid%3d46.txt 4. download\espace.library.uq.edu.au\view\uq%3a166163.txt 5. download\cyberschool.library.uq.edu.au\display_resource.phtml%3frid%3d108.txt  6. download\cyberschool.library.uq.edu.au\display_resource.phtml%3frid%3d70.txt 7. download\cyberschool.library.uq.edu.au\display_resource.phtml%3frid%3d708.txt  8. download\researchers.uq.edu.au\fields-of-research\natural-language-processing .txt 9. download\researchers.uq.edu.au\research-project\16267.txt 10. download\cyberschool.library.uq.edu.au\display_resource.phtml%3frid%3d117.tx t press (n)ext page, (q)uit or enter number jump page. 

for instance first matching document not contain "language" keyword.

if use explain() method within indexsearcher class getting result 1st one:

1. download\cyberschool.library.uq.edu.au\display_resource.phtml%3frid%3d190.txt 0.70643383 = (match) sum of:   0.5590494 = (match) weight(contents:natural in 62541) [defaultsimilarity], result of:     0.5590494 = score(doc=62541,freq=4.0 = termfreq=4.0 ), product of:       0.8091749 = queryweight, product of:         4.4216847 = idf(docfreq=13111, maxdocs=401502)         0.18300149 = querynorm       0.6908882 = fieldweight in 62541, product of:         2.0 = tf(freq=4.0), freq of:           4.0 = termfreq=4.0         4.4216847 = idf(docfreq=13111, maxdocs=401502)         0.078125 = fieldnorm(doc=62541)   0.1473844 = (match) weight(contents:language in 62541) [defaultsimilarity], result of:     0.1473844 = score(doc=62541,freq=1.0 = termfreq=1.0 ), product of:       0.5875679 = queryweight, product of:         3.2107275 = idf(docfreq=44012, maxdocs=401502)         0.18300149 = querynorm       0.25083807 = fieldweight in 62541, product of:         1.0 = tf(freq=1.0), freq of:           1.0 = termfreq=1.0         3.2107275 = idf(docfreq=44012, maxdocs=401502)         0.078125 = fieldnorm(doc=62541) 

if click next , find result such this:

19. download\www.uq.edu.au\news\%3farticle%3d2187.txt 0.47449595 = (match) sum of:   0.2795247 = (match) weight(contents:natural in 35173) [defaultsimilarity], result of:     0.2795247 = score(doc=35173,freq=4.0 = termfreq=4.0 ), product of:       0.8091749 = queryweight, product of:         4.4216847 = idf(docfreq=13111, maxdocs=401502)         0.18300149 = querynorm       0.3454441 = fieldweight in 35173, product of:         2.0 = tf(freq=4.0), freq of:           4.0 = termfreq=4.0         4.4216847 = idf(docfreq=13111, maxdocs=401502)         0.0390625 = fieldnorm(doc=35173)   0.19497125 = (match) weight(contents:language in 35173) [defaultsimilarity], result of:     0.19497125 = score(doc=35173,freq=7.0 = termfreq=7.0 ), product of:       0.5875679 = queryweight, product of:         3.2107275 = idf(docfreq=44012, maxdocs=401502)         0.18300149 = querynorm       0.33182758 = fieldweight in 35173, product of:         2.6457512 = tf(freq=7.0), freq of:           7.0 = termfreq=7.0         3.2107275 = idf(docfreq=44012, maxdocs=401502)         0.0390625 = fieldnorm(doc=35173) 

which page contains exact keyword "natural language". questions are:

1) why lucene not show exact matches first?

2) why lucene shows result not contain keyword?

3) where/how can change first show exact matching ones , more relevant ones?

1 - isn't intended to. see documentation on lucene query syntax. query natural language query made of 2 terms. on own, lucene has no preference terms close together. if want find exact matches, phrase query correct approach, "natural language"

2 - both results in included explaination contain matches both terms, see:

0.2795247 = (match) weight(contents:natural in 35173) [defaultsimilarity], result of:   0.2795247 = score(doc=35173,freq=4.0 = termfreq=4.0 ... 0.19497125 = (match) weight(contents:language in 35173) [defaultsimilarity], result of:   0.19497125 = score(doc=35173,freq=7.0 = termfreq=7.0 

according lucene, found term "natural" 4 times in document, , "language" 7 times, in content field (which assume default field).

3 - on query parser syntax, see makes sense you. sounds might find proximity searches useful.

if want phrase matches followed others, use along lines of:

"natural language" natural language 

Comments

Popular posts from this blog

java.util.scanner - How to read and add only numbers to array from a text file -

rewrite - Trouble with Wordpress multiple custom querystrings -