GroupDocs.Parser for Java 18.11 Release Notes

Major Features

There are the following features in this release:

  • Implemented the ability to retrieve the information of supported extractors for a document
  • Implemented IFastTextExtractor interface
  • Implemented IDocumentContentExtractor interface
  • Improved text area extraction for PDF documents

Full List of Issues Covering all Changes in this Release

KeySummaryIssue Type
PARSERNET-1077Implement the ability to retrieve the information of supported extractors for a documentNew feature
PARSERNET-1075Implement IFastTextExtractor interfaceEnhancement
PARSERNET-1076Implement IDocumentContentExtractor interfaceEnhancement
PARSERNET-1069Improve text area extraction for PDF documentsEnhancement

Public API and Backward Incompatible Changes

Ability to retrieve the information of supported extractors for a document

Description

This enhancement allows getting the information of supported extractors for a document.

Public API changes

  • Added DocumentInfo class

  • Added g****etDocumentInfo methods to ExtractorFactory class

Usage

DocumentInfo class has the following properties:

PropertyDescription
hasTextBoolean value indicating if a user can extract a plain text from a document
hasFormattedTextBoolean value indicating if a user can extract a formatted text from a document
hasMetadataBoolean value indicating if a user can extract metadata from a document
isContainerBoolean value indicating if a document contains other documents (like email attachments or zip archive)

Usage:

void printDocumentInfo(String fileName) {
    ExtractorFactory factory = new ExtractorFactory();
    // Get the document info
    DocumentInfo info = factory.getDocumentInfo(fileName);
    System.out.println("This document contains:");
 
    // Check if a user can extract a plain text from a document
    if (info.hasText()) {
        System.out.println("text");
    }
 
    // Check if a user can extract a formatted text from a document
    if (info.hasFormattedText()) {
        System.out.println("formatted text");
    }
 
    // Check if a user can extract metadata from a document
    if (info.hasMetadata()) {
        System.out.println("metadata");
    }
 
    // Check if the document contains other documents
    if (info.isContainer()) {
        System.out.println("other documents");
    }
}

Improved text area extraction for PDF documents

Description

This enhancement improves text area extraction for PDF documents. The Y-coordinates of text areas start from the top of the page. Text areas have more items for some kind of documents.

Public API changes

No API changes.

Usage

// Create a text extractor
PdfTextExtractor extractor = new PdfTextExtractor("invoice.pdf");
  
// Create search options
TextAreaSearchOptions searchOptions = new TextAreaSearchOptions();
// Set a regular expression to search 'Invoice # XXX' text
searchOptions.setExpression("\\s?INVOICE\\s?#\\s?[0-9]+");
// Limit the search with a rectangle
searchOptions.setRectangle(new Rectangle(10, 10, 300, 150));
  
// Get text areas
java.util.List<TextArea> texts = extractor.getDocumentContent().getTextAreas(0, searchOptions);
  
// Iterate over a list
for (TextArea area : texts) {
    // Print a text
    System.out.println(area.getText());
}

IFastTextExtractor interface

Description

This enhancement allows setting the fast text extraction via **IFastTextExtractor **interface.

Public API changes

Added **IFastTextExtractor **interface

Added support for **IFastTextExtractor **interface to the following classes:

  • PdfTextExtractor class
  • CellsTextExtractor class
  • SlidesTextExtractor class

Usage

IFastTextExtractor interface has only one property:

ExtractMode ExtractMode { get; set; }

This property gets or sets a value indicating the mode of text extraction. ExtractMode enumeration has the following members:

ValueDescription
SimpleFast text extraction. The text in this mode is not extracted in a very accurate way but faster than it is extracted in the standard mode. If the fast text extraction doesn’t support the document format, this parameter is ignored and the standard text extraction is used.
StandardStandard text extraction.

Usage:

void extractText(TextExtractor extractor) {
    // Check if extractor supports IFastTextExtractor interface
    if (extractor instanceof IFastTextExtractor) {
        // Set the mode of text extraction
        ((IFastTextExtractor) extractor).setExtractMode(ExtractMode.Simple);
    }
    // Extract a text
    System.out.println(extractor.extractAll());
}

IDocumentContentExtractor interface

Description

This enhancement allows getting the access to Text Analysis API via **IDocumentContentExtractor **interface.

Public API changes

Added **IDocumentContentExtractor **interface

Added support for IDocumentContentExtractor interface to the following classes:

  • PdfTextExtractor class
  • CellsTextExtractor class
  • SlidesTextExtractor class
  • WordsTextExtractor class

Usage

IDocumentContentExtractor interface has only one property:

DocumentContent DocumentContent { get; }

This property gets the access to the document’s content.

Usage:

void extractText(TextExtractor extractor) {
    // Check if extractor supports IDocumentContentExtractor interface
    if (extractor instanceof IDocumentContentExtractor) {
        IDocumentContentExtractor contentExtractor = (IDocumentContentExtractor) extractor;
        // Iterate over pages
        for (int i = 0; i < contentExtractor.getDocumentContent().getPageCount(); i++) {
            // Iterate over text areas of the page
            for (TextArea textArea : contentExtractor.getDocumentContent().getTextAreas(i)) {
                System.out.println(textArea.getText());
            }
        }
    }
}