Solr - configuration for Chinese and correct results for german umlauts

(Difference between revisions)
Jump to: navigation, search
(new text for config of solr search for chinese)
 
(Alternative configuration)
Line 12: Line 12:
 
             </analyzer>
 
             </analyzer>
 
</nowiki>
 
</nowiki>
 +
 +
==Configuration for Chinese search with Solr==
 +
 +
<analyzer type="index">
 +
    <tokenizer class="solr.StandardTokenizerFactory"/>
 +
    <filter class="solr.LowerCaseFilterFactory"/>
 +
</analyzer>
 +
<analyzer type="query">
 +
    <tokenizer class="solr.StandardTokenizerFactory"/>
 +
    <filter class="solr.LowerCaseFilterFactory"/>
 +
    <filter class="solr.PositionFilterFactory" />
 +
</analyzer>
  
 
==Alternative configuration==
 
==Alternative configuration==
<nowiki>
+
 
<analyzer>
+
<analyzer>
 
     <tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/>
 
     <tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/>
 
     <filter class="solr.SmartChineseWordTokenFilterFactory"/>
 
     <filter class="solr.SmartChineseWordTokenFilterFactory"/>
 
     <filter class="solr.LowerCaseFilterFactory"/>
 
     <filter class="solr.LowerCaseFilterFactory"/>
 
     <filter class="solr.PositionFilterFactory" />
 
     <filter class="solr.PositionFilterFactory" />
</analyzer>
+
</analyzer>
</nowiki>
+
 
This configuration didn’t work for us, because the indexing in the Admin area of OpenCms didn’t even start with this config. But it was a suggestion from someone on the OpenCms mailing list.
+
This configuration '''didn’t''' work for us, because the indexing in the Admin area of OpenCms didn’t even start with this config. But it was a suggestion from someone on the OpenCms mailing list.
  
'''Further reading'''
+
===Additional info===
 
The [[http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean Solr wiki]] says, that the <nowiki>PositionFilterFactory</nowiki> should only be used at query time.
 
The [[http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean Solr wiki]] says, that the <nowiki>PositionFilterFactory</nowiki> should only be used at query time.
  
'''Further configuration'''
+
===Further configuration (in Tomcat)===
 
Our main problem was, not finding the correct indexer. It was the proper configuration of Tomcat.
 
Our main problem was, not finding the correct indexer. It was the proper configuration of Tomcat.
 
As the solr search sends get requests, we had to add the correct URI encoding in the server.xml:
 
As the solr search sends get requests, we had to add the correct URI encoding in the server.xml:
  
<nowiki>
 
 
     <Connector connectionTimeout="20000" port="8080" protocol="HTTP/1.1" redirectPort="8443" URIEncoding="UTF-8"/>
 
     <Connector connectionTimeout="20000" port="8080" protocol="HTTP/1.1" redirectPort="8443" URIEncoding="UTF-8"/>
</nowiki>
 
  
 
On our production sever we use the AJP connector, so we had to add it there as well:
 
On our production sever we use the AJP connector, so we had to add it there as well:
  
<nowiki>
 
 
     <Connector port="8009" protocol="AJP/1.3" redirectPort="8443" URIEncoding="UTF-8"/>
 
     <Connector port="8009" protocol="AJP/1.3" redirectPort="8443" URIEncoding="UTF-8"/>
</nowiki>
 
  
 
This setting solved the problem when searching for words with german umlauts (e.g. müller) as well. As we got search results with the wrong settings, it was hard to spot, that this were the wrong results.
 
This setting solved the problem when searching for words with german umlauts (e.g. müller) as well. As we got search results with the wrong settings, it was hard to spot, that this were the wrong results.

Revision as of 13:09, 4 July 2014

Contents

configuration for Chinese

This is a configuration of the schema.xml (/WEB-INF/solr/conf/)for the OpenCms Solr search for a Chinese website <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PositionFilterFactory" /> </analyzer>

Configuration for Chinese search with Solr

<analyzer type="index">
   <tokenizer class="solr.StandardTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PositionFilterFactory" />
</analyzer>

Alternative configuration

<analyzer>
   <tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/>
   <filter class="solr.SmartChineseWordTokenFilterFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.PositionFilterFactory" />
</analyzer>

This configuration didn’t work for us, because the indexing in the Admin area of OpenCms didn’t even start with this config. But it was a suggestion from someone on the OpenCms mailing list.

Additional info

The [Solr wiki] says, that the PositionFilterFactory should only be used at query time.

Further configuration (in Tomcat)

Our main problem was, not finding the correct indexer. It was the proper configuration of Tomcat. As the solr search sends get requests, we had to add the correct URI encoding in the server.xml:

   <Connector connectionTimeout="20000" port="8080" protocol="HTTP/1.1" redirectPort="8443" URIEncoding="UTF-8"/>

On our production sever we use the AJP connector, so we had to add it there as well:

   <Connector port="8009" protocol="AJP/1.3" redirectPort="8443" URIEncoding="UTF-8"/>

This setting solved the problem when searching for words with german umlauts (e.g. müller) as well. As we got search results with the wrong settings, it was hard to spot, that this were the wrong results.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox