Getting indexed documents

This page contains a description of how to get a list of indexed documents from an index, and how to get the text of indexed documents in HTML or plain text format.

Getting indexed documents

To get a list of indexed documents from an index, use the getIndexedDocuments method of the Index class. Documents with the extensions ZIP, PST, OST can also contain internal documents. To get a list of internal documents, use the getIndexedDocumentItems method of the Index class. For ZIP archives, this way you can access documents of arbitrary nesting depth. An example of obtaining a list of documents from an index is presented below.

String indexFolder = "c:\\MyIndex\\";
String documentsFolder = "c:\\MyDocuments\\";
 
// Creating an index in the specified folder
Index index = new Index(indexFolder);
 
// Indexing documents from the specified folder
index.add(documentsFolder);
 
// Getting list of indexed documents
DocumentInfo[] documents = index.getIndexedDocuments();
for (DocumentInfo document : documents) {
    System.out.println(document.getFilePath());
    DocumentInfo[] items = index.getIndexedDocumentItems(document); // Getting list of document items
    for (DocumentInfo item : items) {
        System.out.println("\t" + item.getInnerPath());
    }
}

Getting text of indexed documents

The text of the indexed document can also be extracted from an index if the option to save the text of documents in the index has been enabled. If this option was not enabled when creating an index, then when the getDocumentText method of the Index class is called, the text of the document will be retrieved again. Details about saving the text of documents in an index can be found on the page Storing text of indexed documents.

The generated text of the document is passed to an instance of a class derived from the abstract class OutputAdapter. Details on the output adapters are presented on the page Output adapters.

After generating the text of a document into a file, this file can be opened by an Internet browser. The following example shows how to extract document text from an index.

String indexFolder = "c:\\MyIndex\\";
String documentsFolder = "c:\\MyDocuments\\";

// Creating an index in the specified folder
Index index = new Index(indexFolder);

// Indexing documents from the specified folder
index.add(documentsFolder);

// Getting list of indexed documents
DocumentInfo[] documents = index.getIndexedDocuments();

// Getting a document text
if (documents.length > 0) {
    FileOutputAdapter outputAdapter = new FileOutputAdapter(OutputFormat.Html, "C:\\Text.html");
    index.getDocumentText(documents[0], outputAdapter);
}

To extract the text of a document from an index, the method overloading is also presented, which takes an instance of the TextOptions class as a parameter. In this class, the following options can be specified:

  • setCustomExtractor method sets an extractor used during indexing, it is necessary if the text of the document was not saved in the index;
  • setAdditionalFields method sets additional document fields added during document indexing which are also necessary if the document text was not saved in the index;
  • setCancellation method sets an object used to cancel the operation;
  • getMetadataIndexingOptions method returns an object for specifying metadata indexing options.

More resources

GitHub examples

You may easily run the code from documentation articles and see the features in action in our GitHub examples:

Free online document search App

Along with full featured .NET library we provide simple, but powerful free Apps.

You are welcome to search over your PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX and more with our free online Free Online Document Search App.