Some languages, in particular hieroglyphic languages, typically require special tools to split text into words.
This is necessary due to the lack of spaces between words, polysemantic phrases, set expressions, and complex lexical rules.
Currently, the GroupDocs.Search library does not contain such special algorithms for splitting Chinese, Japanese, and Korean text into words. However, it provides the ability to use external custom algorithms for this. To connect an external text segmentation library, you must implement the IWordSplitter interface and pass an object implementing this interface to the WordSplitter property of the FileIndexing event arguments.
The code example below demonstrates how to use the external library Jieba.NET for Chinese text segmentation.
C#
// Implementing custom word splitterpublicclassJiebaWordSplitter:IWordSplitter{privatereadonlyJiebaSegmentersegmenter;publicJiebaWordSplitter(){segmenter=newJiebaSegmenter();}publicIEnumerable<string>Split(stringtext){IEnumerable<string>segments=segmenter.Cut(text,cutAll:false);returnsegments;}}...stringindexFolder=@"c:\MyIndex\";stringdocumentsFolder=@"c:\MyDocuments\";// Creating an index in the specified folderIndexindex=newIndex(indexFolder);// Using Jieba segmenter to break text into wordsJiebaWordSplitterjiebaWordSplitter=newJiebaWordSplitter();index.Events.FileIndexing+=(s,e)=>{if(e.DocumentFullPath.EndsWith("Chinese.txt")){// We know that the text in this document is in Chinesee.WordSplitter=jiebaWordSplitter;}};// Indexing documentsindex.Add(documentsFolder);// Searching in the indexstringquery="考虑";// ConsiderSearchResultresult=index.Search(query);
More resources
GitHub examples
You may easily run the code from documentation articles and see the features in action in our GitHub examples: