Skip to end of metadata
Go to start of metadata
Contents Summary
 

The code in below examples uses some methods defined in Common Utilities.

Extract Text from CHM Documents

This feature is supported by version 17.8.0 or greater.

GroupDocs.Parser for .NET allows its users to extract raw text from CHM files. For extracting text from CHM files, ChmTextExtractor class is used. The API allows extracting a single line or all characters from a CHM document.

The Recipe

The steps involved in extracting text from a CHM document are given below:

  • Get the CHM file's path
  • Initialize ChmTextExtractor object 
  • Extract a single line of text using ExtractLine() or read the whole text from the file using ExtractAll() method

The Code

Extract a Line

Extract all Characters

Extract Formatted Text from CHM Documents

Since version 18.3, GroupDocs.Parser also allows you to extract formatted text from the CHM documents. To extract formatted text, ChmFormattedTextExtractor class is used. You can extract a single line of characters as well as all the characters from the document.

The Recipe

The steps involved in extracting formatted text from a CHM document are given below:

  • Get the CHM file's path
  • Initialize ChmFormattedTextExtractor object
  • Extract a single line of characters using ExtractLine() or read the whole text from the file using ExtractAll() method of ChmFormattedTextExtractor class

The Code

Extract a Line

Extract all Characters

Extract Text using Document Formatter

You can also specify document formatter for extracting formatted text from CHM documents as shown in the below code sample. 

Extract Text by Pages from CHM Documents

Since version 18.3 of GroupDocs.Parser, you can also extract text by pages from the CHM documents. For page by page text extraction, we have added the implementation of IPageTextExtractor interface to ChmTextExtractor class. The following code sample shows how to extract text by pages. 

Extract Table of Content from CHM Documents

Since version 18.3, we have also added the feature of extracting TOC from the CHM documents. To access the TOC, TableOfContents property of ChmTextExtractor class is used. Once you get the TOC from the document, you can access the following properties of TOC items using TableOfContentsItem class.

Name
Description
TextThe text of the item. Usually, it is a chapter's title.
PageIndexThe page index of the text. Null if it is just a node without content.
CountThe number of sub-items. Zero if the item hasn't sub-items.
this[int index]Gets a sub-item.
ExtractPageExtracts a text of the item.

Extracting TOC

Following code sample shows how to extract and print TOC of a CHM document.

Extracting Text of the TOC Item

Following code sample shows how to extract text from the item of the TOC.

Labels
  • No labels