GroupDocs.Parser allows to extract text from PDF, Emails, Ebooks, Microsoft Office formats: Word (DOC, DOCX), PowerPoint (PPT, PPTX), Excel (XLS, XLSX), LibreOffice formats and many others (see full list at supported document formats article).
GroupDocs.Parser’s text extractor is easy to use and powerful at the same time (to resolve complex scenarios see advanced usage section).
This article demonstrates how to implement the simplest scenario - extract text from any supported format without additional settings.
Extract text from documents
To extract text from documents simply call GetText method:
TextReaderGetText();
Methods return an instance of TextReader class with an extracted text.
Here are the steps to extract a text from the document:
Instantiate Parser object for the initial document;
Check if reader isn’t null (text extraction is supported for the document);
Read a text from reader.
The following example shows how to extract a text from a document:
// Create an instance of Parser classusing(Parserparser=newParser(filePath)){// Extract a text into the readerusing(TextReaderreader=parser.GetText()){// Print a text from the document// If text extraction isn't supported, a reader is nullConsole.WriteLine(reader==null?"Text extraction isn't supported":reader.ReadToEnd());}}