Extract a text from images and PDFs

On this page

GroupDocs.Parser for .NET provides the ability to extract text from image files and PDFs composed of images.

Note
To use the OCR functionality in .NET Framework set PlatformTarget to x64. If downloadable (msi or zip) version of GroupDocs.Parser is used, see readme.txt file for the additional information.

It should be noted that not all languages ​​represented by the Language class are currently supported for recognition without implicitly downloading additional resources from the internet. However, if internet access is available, all necessary resources will be downloaded implicitly when selecting any recognition language. Currently supported languages ​​without additional downloads: English, Chinese, Japanese, Korean, Arabic.

The following example shows how to extract text from images and PDFs:

// Create an instance of Parser class
using (Parser parser = new Parser("scanned.pdf"))
{
    // Set OCR options
    TextOptions options = new TextOptions(false, true);
    options.OcrOptions = new OcrOptions();
    options.OcrOptions.Language = Language.Chinese;
    options.OcrOptions.PagePreviewOptions = new PagePreviewOptions();
    options.OcrOptions.PagePreviewOptions.Dpi = 144;
    // Extract text using OCR
    using(TextReader reader = parser.GetText(options))
    {
        // Print text or 'not supported' message
        Console.WriteLine(reader == null ? "Text extraction isn't supported" : reader.ReadToEnd());
    }
}

TextOptions can be omitted if the file is an image:

// Create an instance of Parser class
using (Parser parser = new Parser("scanned.jpg"))
{
    // Extract text using OCR
    using(TextReader reader = parser.GetText())
    {
        // Print text or 'not supported' message
        Console.WriteLine(reader == null ? "Text extraction isn't supported" : reader.ReadToEnd());
    }
}

On this page

Close
Loading

Analyzing your prompt, please hold on...

An error occurred while retrieving the results. Please refresh the page and try again.