GroupDocs.Parser for .NET 18.3 Release Notes

Major Features

There are the following new features in this release:

  • Ability to extract a formatted text from chm files
  • Ability to extract a text from chm documents by pages
  • Ability to extract topics from chm documents

Full List of Issues Covering all Changes in this Release

KeySummaryIssue Type
TEXTNET-546Implement the ability to extract a formatted text from chm filesNew feature
TEXTNET-883Implement the ability to extract a text from chm documents by pagesNew feature
TEXTNET-884Implement the ability to extract topics from chm documentsNew feature

Public API and Backward Incompatible Changes

Ability to extract a formatted text from chm files

Description

This feature allows extracting a formatted text from chm files.

Public API changes

Added ChmFormattedTextExtractor class.

Usage

Extracts a line of characters from a document:

C#

// Create a text extractor for chm documents
using (var extractor = new ChmFormattedTextExtractor(stream)) {
  // Extract a line of the text
  string line = extractor.ExtractLine();
  // If the line is null, then the end of the file is reached
  while (line != null) {
    // Print a line to the console
    Console.WriteLine(line);
    // Extract another line
    line = extractor.ExtractLine();
  }
}

Extracts all characters from a document:

C#

// Create a text extractor for chm documents
using (var extractor = new ChmFormattedTextExtractor(stream)) {
  // Extract a text
  Console.WriteLine(extractor.ExtractAll());
}

For setting a formatter DocumentFormatter property is used.

C#

// Create a formatted text extractor for text documents
var extractor = new ChmFormattedTextExtractor(stream);
// Set a HTML formatter for formatting
extractor.DocumentFormatter = new HtmlDocumentFormatter(); // all the text will be formatted as HTML

By default, a text is formatted as a plain text by PlainDocumentFormatter.

Ability to extract a text from chm documents by pages

Description

This feature allows extracting a text from chm documents by pages.

Public API changes

Added the implementation of IPageTextExtractor interface to ChmTextExtractor class.
Added PageCount property to ChmTextExtractor class.
Added ExtractPage property to ChmTextExtractor class.

Usage

This example shows how to extract a text by pages via a generic function:

C#

// Create a text extractor
ChmTextExtractor textExtractor = new ChmTextExtractor(stream);
// Invoke a function to print a text by pages
PrintPages(textExtractor);

// This function allows to extract a text by pages from any text extractor with IPageTextExtractor interface support
void PrintPages(TextExtractor textExtractor)
{ 
    // Check if IPageTextExtractor is supported
    IPageTextExtractor pageTextExtractor = textExtractor as IPageTextExtractor;
    if (pageTextExtractor != null)
    {
        // Iterate over all pages
        for (int i = 0; i < pageTextExtractor.PageCount; i++)
        {
            // Print a page number
            Console.WriteLine(string.Format("{0}/{1}", i, pageTextExtractor.PageCount));
            // Extract a text from the page
            Console.WriteLine(pageTextExtractor.ExtractPage(i));
        }
    }
}

Ability to extract TOC from chm documents

Description

This feature allows extracting the table of contents (TOC) from chm documents.

Public API changes

Added TableOfContentsItem class.
Added TableOfContents property to ChmTextExtractor class.

Usage

TableOfContents property is used to retrieve a list of TOC items:

C#

IList<TableOfContentsItem> TableOfContents { get; }

TableOfContentsItem has the following members:

NameDescription
TextThe text of the item. Usually, it is a chapter’s title.
PageIndexThe page index of the text. Null if it is just a node without content.
CountThe number of sub-items. Zero if the item hasn’t sub-items.
this[int index]Gets a sub-item.
ExtractPageExtracts a text of the item.

This example shows how to print TOC:

C#

// Create a text extractor
using (ChmTextExtractor extractor = new ChmTextExtractor(@"C:\Sources\GroupDocs.Parser\TestData\unit\chm\VBOB6.CHM"))
{
    // Print TOC on the screen
    PrintToc(extractor.TableOfContents, 0);
}

private static void PrintToc(IEnumerable<TableOfContentsItem> tableOfContents, int depth)
{
    // Use spaces to indicate the depth of the TOC item
    string spaces = new string(' ', depth);

    // Iterate over items
    foreach (TableOfContentsItem item in tableOfContents)
    {
        System.Console.Write(spaces);
        // Print the item's text
        System.Console.Write(item.Text);

        // If item has a text (it's not just a node)
        if (item.PageIndex.HasValue)
        {
            // Print the text length
            System.Console.Write(string.Format(" ({0})", item.ExtractPage().Length));
        }

        System.Console.WriteLine();

        // If the item has children
        if (item.Count > 0)
        {
            // Print them
            PrintToc(item, depth + 1);
        }
    }
}

This example shows how to extract a text of the item:

C#

// Create a text extractor
using (ChmTextExtractor extractor = new ChmTextExtractor(@"C:\Sources\GroupDocs.Parser\TestData\unit\chm\VBOB6.CHM"))
{
    // Print a content of the third sub-item of the second item
    Console.WriteLine(extractor.TableOfContents[1][1].ExtractPage());
}