This page contains a description of how to get a list of indexed documents from an index, and how to get the text of indexed documents in HTML or plain text format.
Getting indexed documents
To get a list of indexed documents from an index, use the GetIndexedDocuments method of the Index class. Documents with the extensions ZIP, PST, OST can also contain internal documents. To get a list of internal documents, use the GetIndexedDocumentItems method of the Index class. For ZIP archives, this way you can access documents of arbitrary nesting depth. An example of obtaining a list of documents from an index is presented below.
C#
stringindexFolder=@"c:\MyIndex\";stringdocumentsFolder=@"c:\MyDocuments\";// Creating an index in the specified folderIndexindex=newIndex(indexFolder);// Indexing documents from the specified folderindex.Add(documentsFolder);// Getting list of indexed documentsDocumentInfo[]documents=index.GetIndexedDocuments();for(inti=0;i<documents.Length;i++){DocumentInfodocument=documents[i];Console.WriteLine(document.FilePath);DocumentInfo[]items=index.GetIndexedDocumentItems(document);// Getting list of document itemsfor(intj=0;j<items.Length;j++){DocumentInfoitem=items[j];Console.WriteLine("\t"+item.InnerPath);}}
Getting text of indexed documents
The text of the indexed document can also be extracted from an index if the option to save the text of documents in the index has been enabled. If this option was not enabled when creating an index, then when the GetDocumentText method of the Index class is called, the text of the document will be retrieved again. Details about saving the text of documents in an index can be found on the page Storing text of indexed documents.
The generated text of the document is passed to an instance of a class derived from the abstract class OutputAdapter. Details on the output adapters are presented on the page Output adapters.
After generating the text of a document into a file, this file can be opened by an Internet browser. The following example shows how to extract document text from an index.
C#
stringindexFolder=@"c:\MyIndex\";stringdocumentsFolder=@"c:\MyDocuments\";// Creating an index in the specified folderIndexindex=newIndex(indexFolder);// Indexing documents from the specified folderindex.Add(documentsFolder);// Getting list of indexed documentsDocumentInfo[]documents=index.GetIndexedDocuments();// Getting a document textif(documents.Length>0){FileOutputAdapteroutputAdapter=newFileOutputAdapter(OutputFormat.Html,@"C:\Text.html");index.GetDocumentText(documents[0],outputAdapter);}
To extract the text of a document from an index, the method overloading is also presented, which takes an instance of the TextOptions class as a parameter. In this class, the following options can be specified:
CustomExtractor is a custom extractor used during indexing, it is necessary if the text of the document was not saved in the index;
AdditionalFields are additional document fields added during document indexing which are also necessary if the document text was not saved in the index;
Cancellation is an object used to cancel the operation;