Skip to end of metadata
Go to start of metadata

This page contains release notes for GroupDocs.Parser for Java 18.12.

Major Features

There are the following features in this release:

  • Added the ability to extract tables from PDFs
  • Added the support for text and presentation templates
  • Added the ability to detect the type of password-protected Office Open XML documents

Full List of Issues Covering all Changes in this Release

PARSERNET-1016Implement the ability to extract tables from PDFsNew feature
PARSERNET-1097Implement the support for text and presentation templatesNew feature
PARSERNET-1092Implement the ability to detect the type of password-protected Office Open XML documentsEnhancement

 

Public API and Backward Incompatible Changes

This section lists public API changes that were introduced in GroupDocs.Parser for Java 18.12. It includes not only new and obsoleted public methods, but also a description of any changes in the behavior behind the scenes in GroupDocs.Parser which may affect existing code. Any behavior introduced that could be seen as a regression and modifies existing behavior is especially important and is documented here.

Ability to extract tables from PDFs

Description

This feature allows extracting tables from PDF documents.

Public API changes

  • Added TableArea class
  • Added TableAreaCell class
  • Added TableAreaLayout class
  • Added TableAreaDetector class
  • Added TableAreaDetectorParameters class
  • Added TableAreaParser class
  • Added TableAreaDetector property to PdfTextExtractor class
  • Added TableAreaParser property to PdfTextExtractor class

Usage

For extracting tables from PDF document, TableAreaParser class is used. The instance of TableAreaParser class is available via property with the same name in PdfTextExtractor class:

Java

ParseTableArea method is used to extract a table from the document page:

Java

This method accepts the zero-based page index and layout of the table. The layout is represented by TableAreaLayout class with the following members:

Member
Description
VerticalSeparatorsA collection of vertical separators
HorizontalSeparatorsA collection of horizontal separators

These collections represent bounds of columns and rows. For example, for 2x2 table there are 3 vertical and 3 horizontal separators:

 

---------
|   |   |
---------
|   |   |
---------

TableArea class has the following members:

Member
Description
int RowCountNumber of table rows
int ColumnCountNumber of table columns
TableAreaCell this[int row, int column]Cell of the table
double GetRowHeight(int row)Height of the row
double GetColumnWidth(int column)Width of the row

TableCellArea class has the following members:

Member
Description
TextAreaContent of the cell.
RowZero-based index of the row.
ColumnZero-based index of the column.
RowSpanNumber of rows which the cell spans across.
ColumnSpanNumber of columns which the cell spans across.

Usage:

Java
A user can create TableAreaLayout object manually or by using TableAreaDetector class. The instance of TableAreaDetector class is available via property with the same name in PdfTextExtractor class:
Java
TableAreaDetector class is created to find table bounds in automatic mode. detectLayouts method searches tables on the page of the document and returns a collection of table layouts:

 

Java

This method accepts the zero-based page index and optional parameters. These parameters help to detect tables. If set, the detector tries to search only those tables which meet this criterion; the total number of detected tables, in this case, must be equal to the number of passed parameters.

TableAreaDetectorParameters class has the following members:

Member
Description
MinRowCountMinimum number of table rows.
MinColumnCountMinimum number of table columns.
HasMergedCellsValue indicating whether the table has merged cells.
MinVerticalSpaceMinimum width of vertical separators.
RectangleRectangle which bounds a table detection region.

By setting parameters a user can tune detector's behavior. For example, limit the page area to search a table and disable searching complex tables (with merged cells):

Java

 

Support for text and presentation templates

Description

This feature allows to extract a text and metadata from the following documents:

  •     dotx (Template)
  •     dotm (Macro-enabled template)
  •     ott (OpenDocument Text Template)
  •     potx (Template)
  •     potm (Macro-enabled template)
  •     ppsm (Macro-enabled slide show)
  •     pptm (Macro-enabled presentation)

Public API changes

No API changes.

Usage

Java

Ability to detect the type of password-protected Office Open XML documents

Description

This feature allows detecting password-protected Office Open XML documents by content.

Public API changes

  • Added string Detect(Stream, LoadOptions) public method to MediaTypeDetector class.
  • Added string DetectByContent(Stream, LoadOptions) protected virtual method to MediaTypeDetector class.
  • Marked as obsolete string DetectByContent(Stream) protected virtual method from MediaTypeDetector class.

Usage

To detect media type of encrypted Office Open XML document Detect(Stream, LoadOptions) method is used:

 

Java

 

For batch document processing PasswordProvider is used:

 

Java
Labels
  • No labels