The feature described in this article makes it possible to separate the operations of extracting data from a document and adding the extracted data to the index. The extracted data can be easily serialized and deserialized as needed.
This feature can be useful because the operation of extracting text and other data from documents can be very long for some formats. At the same time, adding data to the index is a rather lightweight operation. And thus, separating these operations can lead to a significant increase in the total time when the index is not busy performing any operations. And the subsequent addition of new data extractors will significantly increase the performance of indexing. In this case, data extractors can even work on separate servers, completely freeing the indexing server from extracting operations.
The Extractor class is used to extract data from documents. Upon completion of the operation, the Extract method returns an instance of the ExtractedData class, which is used directly to add to the index.
The following example demonstrates how to perform separate data extraction and indexing.
C#
stringindexFolder=@"c:\MyIndex";stringdocumentPath=@"c:\MyDocuments\MyDocument.pdf";// Extracting data from a documentExtractorextractor=newExtractor();Documentdocument=Document.CreateFromFile(documentPath);ExtractionOptionsextractionOptions=newExtractionOptions();extractionOptions.UseRawTextExtraction=false;ExtractedDataextractedData=extractor.Extract(document,extractionOptions);// Serializing the databyte[]array=extractedData.Serialize();// Deserializing the dataExtractedDatadeserializedData=ExtractedData.Deserialize(array);// Creating an indexIndexindex=newIndex(indexFolder);// Indexing the dataExtractedData[]data=newExtractedData[]{deserializedData};index.Add(data,newIndexingOptions());// Searching in the indexSearchResultresult=index.Search("Einstein");
Note that when indexed documents change and need to be updated in the index, the same code must be executed. That is, data must be extracted separately from the updated documents and then the extracted data must be added to the index.
More resources
GitHub examples
You may easily run the code from documentation articles and see the features in action in our GitHub examples: