GroupDocs.Parser for .NET 18.7 Release Notes

Major Features

There are the following features in this release:

  • API to provide data for text analysis
  • Support for text analysis API for PDF documents

Full List of Issues Covering all Changes in this Release

KeySummaryIssue Type
PARSERNET-992Implement API to provide data for text analysisNew feature
PARSERNET-956Implement the support for text analysis API for PDF documentsNew feature

Public API and Backward Incompatible Changes

API to provide data for text analysis

Description

This feature allows extracting text areas from document pages.

Public API changes

Added DocumentContent class.
Added DocumentPage class.
Added TextArea class.
Added TextAreaItem class.
Added Rectangle class.
Added Font class.
Added Font property to TextProperties class.
IsBold and IsItalic properties of TextProperties class are marked as obsolete.

DocumentContent abstract class provides API to extract text areas from document pages. This API is represented by an abstract class, so we can extend this API in new versions. To provide this API, text extractor implements its own internal private class and provides DocumentContent property (see PdfTextExtractor as the sample).

DocumentContent class has the following members:

MemberDescription
PageCountReturns a total number of document pages
DisposeReleases resources used by the class
GetPageReturns a document page (see below)
GetTextAreasReturns a collection of TextArea objects (see below)

GetPage method returns an instance of DocumentPage class. This class has the following members:

MemberDescription
IndexZero-based page index
WidthPage width
HeightPage height
TextAreasCollection of TextArea objects

GetPage method works in the same way as GetTextAreas method excepting it returns DocumentPage object instead of collection of text areas.

GetTextAreas method returns a collection of TextArea objects. There are two versions of this method:

C#

GetTextArea(int pageIndex)
GetTextArea(int pageIndex, TextAreaSearchOptions searchOptions)

The first version returns all text areas from the page. The second version provides the ability to search a text with the regular expression and bounds the search area by the rectangle. TextAreaSearchOptions class has the following members:

MemberDescription
ExpressionRegular expression. Null if it isn’t used
RectangleRectangle to bound the search area. Null if it isn’t used
IgnoreFormattingA value indicating whether text formatting is ignored.
UniteSegmentsA value indicating whether nearby standing text segments are united.

TextArea class has the following members:

MemberDescription
PageDocument page
TextText of the text area
RectangleRectangle of the text area
ItemsCollection of TextAreaItem objects
DisposeRemoves an object from DocumentPage

TextArea class can be used as a stand-alone object. In this case, it has own Rectangle and Text properties; Items collection is empty. But in most cases it contains items. If TextArea object has items, then Text and Rectangle properties are calculated by Items collection.

Rectangle class has the following members:

MemberDescription
LeftX-coordinate of the upper-left corner
TopY-coordinate of the upper-left corner
RightX-coordinate of the lower-right corner
BottomY-coordinate of the lower-right corner
WidthWidth of the rectangle
HeightHeight of the rectangle

TextAreaItem class has the following members:

MemberDescription
TextText of the text area
RectangleRectangle of the text area
TextPropertiesText properties of the segment

TextProperties class has Font property. Font class has the following members:

MemberDescription
IsBoldA value indicating whether the text is bold; otherwise, false.
IsItalicA value indicating whether the text is italic; otherwise, false.
NameFont name
SizeFont size

Usage

C#

// Create a text extractor
PdfTextExtractor extractor = new PdfTextExtractor("invoice.pdf");

// Create search options
TextAreaSearchOptions searchOptions = new TextAreaSearchOptions();
// Set a regular expression to search 'Invoice # XXX' text
searchOptions.Expression = "\\s?INVOICE\\s?#\\s?[0-9]+";
// Limit the search with a rectangle
searchOptions.Rectangle = new GroupDocs.Parser.Rectangle(10, 10, 300, 150);

// Get text areas
IList<TextArea> texts = extractor.DocumentContent.GetTextAreas(0, searchOptions);
            
// Iterate over a list
foreach(TextArea area in texts)
{
    // Print a text
    Console.WriteLine(area.Text);
}

Support for text analysis API for PDF documents

Description

This feature allows extracting text areas from document pages of PDF documents.

Public API changes

Added DocumentContent property to PdfTextExtractor class.

Usage

C#

// Create a text extractor
PdfTextExtractor extractor = new PdfTextExtractor("invoice.pdf");

// Create search options
TextAreaSearchOptions searchOptions = new TextAreaSearchOptions();
// Set a regular expression to search 'Invoice # XXX' text
searchOptions.Expression = "\\s?INVOICE\\s?#\\s?[0-9]+";
// Limit the search with a rectangle
searchOptions.Rectangle = new GroupDocs.Parser.Rectangle(10, 10, 300, 150);

// Get text areas
IList<TextArea> texts = extractor.DocumentContent.GetTextAreas(0, searchOptions);
            
// Iterate over a list
foreach(TextArea area in texts)
{
    // Print a text
    Console.WriteLine(area.Text);
}