【大数据开发运维解决方案】Solr6.2默认相似性算法检索匹配得分高于5.1版本问题分析，首先怀疑是不是分词器

和通数据库htsjk.Com2023-03-26 02:02 来源:未知阅读:18572 评论 259 热度4

标签：linux 大数据数据 solr

【大数据开发运维解决方案】Solr6.2默认相似性算法检索匹配得分高于5.1版本问题分析，首先怀疑是不是分词器

Solr6.2默认相似性算法检索匹配得分高于5.1版本问题分析

注意：
我们之前使用的solr版本是solr5.1，分词器使用的是jcseg1.9.6，后续接触了Solr6.2，分词器使用的是jcseg2.6.0，发现同一个Oracle库的同一套表数据，分别使用solr5.1和solr6.2版本的模板collection配置集做相同的字段配置并成功做索引后，做相同查询，solr6.2检索文档score远高于solr5.1，下面是我们使用的两个solr环境以及另一个单机solr测试环境的基本情况：

大数据环境	solr版本
CDH	Solr5.1
华为云	Solr6.2
单机	开源Solr6.2

一、问题重现

现有华为云solr6.2和cdh5.1以及开源solr6.2三个环境的solr,索引的数据均从同一个oracle11.2.0.4库表用相同的逻辑取数据，collection或core名字分别为uoc-buyer1、uoc-buyer、uoc-buyer，现分别从三个环境做下面问题查询：

q=engname%3A(ADNAN+UL+HAQ)%5E5+buyeraddr%3A(PP+NO+AF1401302+ADD%5C%3ATOBA+TEK+SINGH%2CPAKISTAN)%5E2+flag%3A(0%5E2+1%5E1)
&fq=%7B!frange+l%3D1.8%7Dquery(%24q)
&fq=-accuracylel%3A1
&fq=countrycode%3APAK
&fq=flag%3A0
&sort=flag+asc%2Caccuracylel+desc%2Cscore+desc
&fl=chnname%2Cengname%2Cbuyeraddr%2Cpuppetbn%2Cflag%2Ctableflag%2Cscore%2Ccountrycode
&wt=json
&indent=true

开源及华为云solr6.2检索得分：

cdh solr5.1检索得分：

二、问题分析

1、问题原因是否由于分词器版本不一致导致

因为我们之前开始时使用的是solr5.1,相关代码开发和相识度分数认定的分数线也是基于solr5.1来做的，所以在后续将collection逻辑拿到6.2版本的开源和华为云solr后，发现分数差别很大。
首先怀疑是不是分词器导致的，因为两个solr6.2分词器使用的是jcseg2.6.0，而cdh的solr5.1使用的是1.9.6版本，于是通过三个solr的analyze功能分析要查询的地址：
两个使用jcseg2.6.0的solr6.2:

使用jcseg1.9.6的solr5.1:

两个结果比较了下，感觉还是1.9.6版本的英文分词结果更友好，2.6.0版本分词分的太细致了，将本应该在一起的单词也给拆分的七零八落了。
于是根据jcseg1.9.6的默认配置去修改jcseg2.6.0的分词器配置，最终修改后的jcseg-core-2.6.0.jar分词器中的配置文件jcseg.properties内容为：

# Jcseg properties file.
# @Note: 
# true | 1 | on for open the specified configuration or
# false | 0 | off to close it.
# bug report chenxin <chenxin619315@gmail.com>

# Jcseg function
#maximum match length. (5-7)
jcseg.maxlen = 5

#Whether to recognized the Chinese name.
jcseg.icnname = true

#maximum chinese word number of english chinese mixed word. 
jcseg.mixcnlen = 3

#maximum length for pair punctuation text.
jcseg.pptmaxlen = 7

#maximum length for Chinese last name andron.
jcseg.cnmaxlnadron = 1

#Whether to clear the stopwords.
jcseg.clearstopword = false

#Whether to convert the Chinese numeric to Arabic number. like '\u4E09\u4E07' to 30000.
jcseg.cnnumtoarabic = true

#Whether to convert the Chinese fraction to Arabic fraction.
#@Note: for lucene,solr,elasticsearch eg.. close it.
jcseg.cnfratoarabic = false

#Whether to keep the unrecognized word.
jcseg.keepunregword = true

#Whether to do the secondary segmentation for the complex English words
jcseg.ensecondseg = true

#min length of the secondary simple token. (better larger than 1)
jcseg.stokenminlen = 2

#minimum length of the secondary segmentation token.
jcseg.ensecminlen = 1

#Whether to do the English word segmentation
#the jcseg.ensecondseg must set to true before active this function
jcseg.enwordseg = false

#maximum match length for English extracted word
jcseg.enmaxlen = 16

#threshold for Chinese name recognize.
# better not change it before you know what you are doing.
jcseg.nsthreshold = 1000000

#The punctuation set that will be keep in an token.(Not the end of the token).
jcseg.keeppunctuations = @#%.&+

#Whether to append the pinyin of the entry.
jcseg.appendpinyin = false

#Whether to load and append the synonyms words of the entry.
jcseg.appendsyn = true


####for Tokenizer
#default delimiter for JcsegDelimiter tokenizer
#set to default or whitespace will use the default whitespace as delimiter
#or set to the char you want, like ',' or whatever
jcseg.delimiter = default

#default length for the N-gram tokenizer
jcseg.gram = 1


####about the lexicon
#absolute path of the lexicon file.
#Multiple path support from jcseg 1.9.2, use ';' to split different path.
#example: lexicon.path = /home/chenxin/lex1;/home/chenxin/lex2 (Linux)
#        : lexicon.path = D:/jcseg/lexicon/1;D:/jcseg/lexicon/2 (WinNT)
#lexicon.path=/Code/java/JavaSE/jcseg/lexicon
#lexicon.path = {jar.dir}/lexicon ({jar.dir} means the base directory of jcseg-core-{version}.jar)
#@since 1.9.9 Jcseg default to load the lexicons in the classpath
lexicon.path = {jar.dir}/lexicon

#Whether to load the modified lexicon file auto.
lexicon.autoload = true

#Poll time for auto load. (seconds)
lexicon.polltime = 30


####lexicon load
#Whether to load the part of speech of the entry.
jcseg.loadpos = true

#Whether to load the pinyin of the entry.
jcseg.loadpinyin = false

#Whether to load the synonyms words of the entry.
jcseg.loadsyn = true

#Whether to load the entity of the entry
jcseg.loadentity = true

修改后的jcseg分词器分词效果如下：

已经与jcseg1.9.6分词效果基本一致了，这时候两个solr6.2.0再重做索引，再次执行之前的查询，发现检索得分还是100多分。

2、问题原因是否由于solr6和5默认相似性算法不一致导致

根据上面实验，于是这里怀疑不只是因为分词器分词差异导致的问题，更大的问题应该在于solr5和solr6的相似度得分算法不一样了，为了排除分词器带来的影响，于是将solr6.2使用的分词器也替换成solr5.1使用那一套分词器，再次索引同样的数据，做同样的查询发现得分还是很高，那就说明相似得分差异过大的主要原因是由于solr两个版本的算法不一致导致的了。
经过网上查找资料发现了solr5和solr6的默认相似度算法的确是变了：

默认的相似性改变

当 Schema 没有明确地定义全局 <similarity/> 时，Solr 的默认行为将依赖于 solrconfig. xml 中指定的
luceneMatchVersion。当 luceneMatchVersion < 6.0 时，将使用
ClassicSimilarityFactory 的实例，否则将使用 SchemaSimilarityFactory
的实例。最值得注意的是，这种改变意味着用户可以利用每个字段类型的相似性声明，并且需要明确声明 SchemaSimilarityFactory
的全局用法。无论是明确声明还是作为隐式全局默认值使用，当字段类型不声明明确<similarity/>
时，SchemaSimilarityFactory 的隐式行为也被更改为依赖于 luceneMatchVersion。当
luceneMatchVersion < 6.0 时，将使用 ClassicSimilarity 的实例，否则将使用
BM25Similarity 的实例。可以在 SchemaSimilarityFactory 声明中指定
defaultSimFromFieldType init 选项来更改此行为。请查看
SchemaSimilarityFactoryjavadocs 了解更多详情

于是修改solr6.2的manage-schema，新增similarity显示指定：

<similarity class="solr.ClassicSimilarityFactory"/>

而且由于当前环境索引速度较慢，同时修改solrconfig.xml的索引并行度：

<maxIndexingThreads>32</maxIndexingThreads>

重启solr，重做索引，发现现在索引速度比原来快了一个小时，再次做同样的查询，检索得分已经同solr5.1相似了：

本站文章为和通数据库网友分享或者投稿，欢迎任何形式的转载，但请务必注明出处.
同时文章内容如有侵犯了您的权益，请联系QQ：970679559，我们会在尽快处理。

返回首页

评论暂时关闭