java - Lucene: exact matches aren't shown first -
i using demo indexfiles , searchfiles classes index , search in org.apache.lucene.demo packet.
my issue when use query contains more word, not getting results have exact match. instance:
enter query: "natural language" searching for: "natural language" 298 total matching documents 1. download\researchers.uq.edu.au\fields-of-research\natural-language-processing .txt 2. download\researchers.uq.edu.au\research-project\16267.txt 3. download\researchers.uq.edu.au\research-project\16279.txt 4. download\researchers.uq.edu.au\research-project\18361.txt 5. download\www.uq.edu.au\news\%3farticle%3d2187.txt 6. download\researchers.uq.edu.au\researcher\2115.txt 7. download\ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-project s-dr-alan-cody%3fpage%3d1.txt 8. download\ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-project s-dr-alan-cody%3fpage%3d2.txt 9. download\ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-project s-dr-alan-cody.txt 10. download\www.ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-pr ojects-dr-alan-cody.txt press (n)ext page, (q)uit or enter number jump page.
does not have same results as:
enter query: natural language searching for: natural language 54307 total matching documents 1. download\cyberschool.library.uq.edu.au\display_resource.phtml%3frid%3d190.txt 2. download\cyberschool.library.uq.edu.au\display_resource.phtml%3frid%3d576.txt 3. download\cyberschool.library.uq.edu.au\display_resource.phtml%3frid%3d46.txt 4. download\espace.library.uq.edu.au\view\uq%3a166163.txt 5. download\cyberschool.library.uq.edu.au\display_resource.phtml%3frid%3d108.txt 6. download\cyberschool.library.uq.edu.au\display_resource.phtml%3frid%3d70.txt 7. download\cyberschool.library.uq.edu.au\display_resource.phtml%3frid%3d708.txt 8. download\researchers.uq.edu.au\fields-of-research\natural-language-processing .txt 9. download\researchers.uq.edu.au\research-project\16267.txt 10. download\cyberschool.library.uq.edu.au\display_resource.phtml%3frid%3d117.tx t press (n)ext page, (q)uit or enter number jump page.
for instance first matching document not contain "language" keyword.
if use explain()
method within indexsearcher
class getting result 1st one:
1. download\cyberschool.library.uq.edu.au\display_resource.phtml%3frid%3d190.txt 0.70643383 = (match) sum of: 0.5590494 = (match) weight(contents:natural in 62541) [defaultsimilarity], result of: 0.5590494 = score(doc=62541,freq=4.0 = termfreq=4.0 ), product of: 0.8091749 = queryweight, product of: 4.4216847 = idf(docfreq=13111, maxdocs=401502) 0.18300149 = querynorm 0.6908882 = fieldweight in 62541, product of: 2.0 = tf(freq=4.0), freq of: 4.0 = termfreq=4.0 4.4216847 = idf(docfreq=13111, maxdocs=401502) 0.078125 = fieldnorm(doc=62541) 0.1473844 = (match) weight(contents:language in 62541) [defaultsimilarity], result of: 0.1473844 = score(doc=62541,freq=1.0 = termfreq=1.0 ), product of: 0.5875679 = queryweight, product of: 3.2107275 = idf(docfreq=44012, maxdocs=401502) 0.18300149 = querynorm 0.25083807 = fieldweight in 62541, product of: 1.0 = tf(freq=1.0), freq of: 1.0 = termfreq=1.0 3.2107275 = idf(docfreq=44012, maxdocs=401502) 0.078125 = fieldnorm(doc=62541)
if click next , find result such this:
19. download\www.uq.edu.au\news\%3farticle%3d2187.txt 0.47449595 = (match) sum of: 0.2795247 = (match) weight(contents:natural in 35173) [defaultsimilarity], result of: 0.2795247 = score(doc=35173,freq=4.0 = termfreq=4.0 ), product of: 0.8091749 = queryweight, product of: 4.4216847 = idf(docfreq=13111, maxdocs=401502) 0.18300149 = querynorm 0.3454441 = fieldweight in 35173, product of: 2.0 = tf(freq=4.0), freq of: 4.0 = termfreq=4.0 4.4216847 = idf(docfreq=13111, maxdocs=401502) 0.0390625 = fieldnorm(doc=35173) 0.19497125 = (match) weight(contents:language in 35173) [defaultsimilarity], result of: 0.19497125 = score(doc=35173,freq=7.0 = termfreq=7.0 ), product of: 0.5875679 = queryweight, product of: 3.2107275 = idf(docfreq=44012, maxdocs=401502) 0.18300149 = querynorm 0.33182758 = fieldweight in 35173, product of: 2.6457512 = tf(freq=7.0), freq of: 7.0 = termfreq=7.0 3.2107275 = idf(docfreq=44012, maxdocs=401502) 0.0390625 = fieldnorm(doc=35173)
which page contains exact keyword "natural language". questions are:
1) why lucene not show exact matches first?
2) why lucene shows result not contain keyword?
3) where/how can change first show exact matching ones , more relevant ones?
1 - isn't intended to. see documentation on lucene query syntax. query natural language
query made of 2 terms. on own, lucene has no preference terms close together. if want find exact matches, phrase query correct approach, "natural language"
2 - both results in included explaination contain matches both terms, see:
0.2795247 = (match) weight(contents:natural in 35173) [defaultsimilarity], result of: 0.2795247 = score(doc=35173,freq=4.0 = termfreq=4.0 ... 0.19497125 = (match) weight(contents:language in 35173) [defaultsimilarity], result of: 0.19497125 = score(doc=35173,freq=7.0 = termfreq=7.0
according lucene, found term "natural" 4 times in document, , "language" 7 times, in content field (which assume default field).
3 - on query parser syntax, see makes sense you. sounds might find proximity searches useful.
if want phrase matches followed others, use along lines of:
"natural language" natural language
Comments
Post a Comment