How to get PDF document information and generate preview

Extract text information from PDf

When working with a PDF document programmatically, in addition to interacting with the text via annotations, the ability to access the text as such is very important. The opportunity to scan the text and to split it into pages, paragraphs and lines is an essential tool. Our .NET API provides this capability. By uploading a PDF document you can receive its full text page by page and even line by line within seconds! All you need to do is write a few lines of code.

Example

Since version 21.3 the structure PageInfo has been changed. You can now take an advantage of the new functionality by calling the GetDocumentInfo() method of the Document class.

Each page, represented by PageInfo structure, now contains list of TextLineinfo. Every TextLineinfo contains information about text top and left indents, width, height and text itself. In other words, we can say that each page is represented as a sequence of text lines and you can get this information programatically within seconds!

Code example below shows how you can get data from described structures:

using (Annotator annotator = new Annotator("input.pdf"))
{
    IDocumentInfo documentInfo = annotator.Document.GetDocumentInfo();

    foreach (PageInfo page in documentInfo.PagesInfo)
    {
        // Here you can access PageInfo fields
        Console.WriteLine("Page number {0}, width: {1} and height: {2}", page.PageNumber, page.Width, page.Height);

        foreach (TextLineInfo textLine in page.TextLines)
        {
            // Here you can access TextLineInfo fields
            Console.WriteLine("\tText line. '{0}'", textLine.Text);
        }
    }
}

Supported formats

The ability to retrieve text information is implemented for most supported formats: word, pdf, excel, visio diagrams, power point presentations, html and email. Text retrieval works for all formats as it is, except for html (.htm, .html etc) and email (.eml, .msg etc). With those formats it works by converting them into the word document (.docx). Therefore, text parameters for these formats corresponds to their word counterparts.

Generate document preview

When annotating a document, it is very important to be able to see how the document would look in printed form. After all, most documents end up on paper. Of course, this can be achieved by standard means - opening the document and sending it to print. Modern operating systems usually show a preview of a document before printing it. But what if it needs to be done programmatically and much faster?

Our .NET API makes it possible. You can generate a preview right after annotating. This can be achieved by writing just a few lines of code:

using (Annotator annotator = new Annotator("input.pdf"))
{
    PreviewOptions previewOptions = new PreviewOptions(pageNumber =>
    {
        var pagePath = $"D:/result_{pageNumber}.png";
        return File.Create(pagePath);
    });
    previewOptions.PreviewFormat = PreviewFormats.PNG;
    previewOptions.PageNumbers = new int[] { 1, 2, 3, 4 };
    annotator.Document.GeneratePreview(previewOptions);
}

You can learn about more properties and setting that our preview generator provies. It is much more configurable than we have shown above, but due to the article limitations we cannot cover all the details here.

Conclusion

In short, you have learned how extract data from PDF document within .NET applications. Further, you have seen how to generate preview of any PDF file. Now, you should be confident to build your own document annotator .NET application.

More resources

Advanced Usage Topics

To learn more about document annotating features, please refer to the advanced usage section.

Free Online App

Along with full-featured .NET library we provide simple but powerful free Apps. You are welcome to annotate your PDF, DOC or DOCX, XLS or XLSX, PPT or PPTX, PNG and other documents with free to use online GroupDocs Annotation App.