Getting information about document content

Since version 21.3 the structure PageInfo has been changed. Now you can get an information not only about the width and height of each page, but also about it text content. In addition to that, page numbering was also added to the PageInfo structure. You can now take an advantage of the new functionality by calling the GetDocumentInfo() method of the Document class.

Each page, represented by PageInfo structure, now contains list of TextLineinfo. Every TextLineinfo contains information about text top and left indents, width, height and text itself. In other words, we can say that each page is represented as a sequence of text lines. Every text line is described by its parameters (width, heigh and indents).

All numerical specifications are presented in pixels relative to the top left-hand corner of the page.

Code example below shows how you can get data from described structures:

using (Annotator annotator = new Annotator("input.docx"))
{
    IDocumentInfo documentInfo = annotator.Document.GetDocumentInfo();

    foreach (PageInfo page in documentInfo.PagesInfo)
    {
        //Here you can access PageInfo fields
        Console.WriteLine("Page number {0}, width: {1} and height: {2}", page.PageNumber, page.Width, page.Height);

        foreach (TextLineInfo textLine in page.TextLines)
        {
            //Here you can access TextLineInfo fields
            Console.WriteLine("\tText line. '{0}'", textLine.Text);
            Console.WriteLine("\t\tText width {0} and height {1}. Top indent: {2}, left indent: {3}", 
                textLine.Width, textLine.Height, textLine.TopIndent, textLine.LeftIndent);
        }
    }
}

Supported formats

The ability to retrieve text information is implemented for most supported formats: word, pdf, excel, visio diagrams, power point presentations, html and email. Text retrieval works for all formats as it is, except for html (.htm, .html etc) and email (.eml, .msg etc). With those formats it works by converting them into the word document (.docx). Therefore, text parameters for these formats corresponds to their word counterparts.

How it works

Example below shows how these parameters are calculated:

Example of how text parameters are calucalated

Example above is schematic, although it’s quite accurate.

Imagine you have a text part like the one in the picture. Each text line may be represented filled into rectangle. Width and height of this rectangle is the width and height of the text, and the indentation from the top left corner of the page to the border of this rectangle is top and left indents.

In the illustration above, the possible text rectangles are marked in black. The arrows indicate corresponding text parameters.

More resources

GitHub Examples

You may easily run the code above and see the feature in action in our GitHub examples:

Free Online App

Along with full-featured .NET library we provide simple but powerful free Apps. You are welcome to annotate your PDF, DOC or DOCX, XLS or XLSX, PPT or PPTX, PNG and other documents with free to use online GroupDocs Annotation App.