Welcome to the GroupDocs.Parser for .NET GroupDocs.Parser is a convenient text extractor API that permits users to extract raw or formatted text from different document formats. Besides, it is not only a text extractor API, the user can extract metadata from the document as well.
GroupDocs.Parser for .NET
Overview
GroupDocs.Parser for .NET is a powerful document data extraction API that enables developers to parse and extract information from over 50 document types. This robust library allows you to extract text, metadata, images, tables, and other structured data from various file formats without requiring additional software installations.
Key Highlights
50+ Document Formats - Support for PDF, Office documents, images, archives, and more
Template-Based Parsing - Extract structured data using predefined templates
Multiple Extraction Modes - Raw text, formatted text, and precise data extraction
Enterprise Ready - Scalable solution for high-volume document processing
Cross-Platform - Compatible with .NET Framework, .NET Core, and .NET 5+
What Makes GroupDocs.Parser Unique?
Template-Driven Extraction
One of the most powerful features is parsing documents with predefined templates. Easily define templates to extract specific data from invoices, receipts, contracts, and other structured documents with high precision.
Versatile Text Extraction
Raw Text Mode - Fast extraction for basic text processing
Formatted Text Mode - Preserve document formatting and structure
Search & Highlight - Advanced text search with regex support and keyword highlighting
Extract attachments from containers and email files
Process PDF forms and fillable fields
Installation
Install via NuGet Package Manager:
Install-PackageGroupDocs.Parser
Or via .NET CLI:
dotnet add package GroupDocs.Parser
Quick Start Example
usingGroupDocs.Parser;// Create an instance of Parser classusing(Parserparser=newParser("sample.pdf")){// Extract text from the documentusing(TextReaderreader=parser.GetText()){// Check if text extraction is supportedif(reader!=null){// Print the extracted textConsole.WriteLine(reader.ReadToEnd());}else{Console.WriteLine("Text extraction isn't supported for this format");}}}
Core Features
Text Extraction
Raw and formatted text extraction
Page-wise text extraction
Text areas extraction with coordinates
Search text with advanced options (case sensitivity, whole words, regex)
Content Indexing - Extract and index document content for search engines
Document Classification - Categorize documents based on extracted content
Legal & Compliance
eDiscovery Solutions - Extract and analyze text from legal documents
Contract Analysis - Parse contracts for key terms and clauses
Regulatory Compliance - Extract data for compliance reporting
Financial Services
Invoice Processing - Parse invoices, receipts, and financial documents
Automated Data Entry - Extract structured data from forms
Financial Document Analysis - Process statements and reports
Data Migration & Integration
Legacy System Migration - Extract content from old document formats
System Integration - Parse documents for API integrations
Data Transformation - Convert unstructured data to structured formats
Advanced Features
Template-Based Data Extraction
Create custom templates to extract specific data fields:
// Create a template with fixed positionsTemplatetemplate=newTemplate(newTemplateItem[]{newTemplateField(newTemplateFixedPosition(newRectangle(newPoint(35,135),newSize(100,10))),"CompanyName"),newTemplateField(newTemplateFixedPosition(newRectangle(newPoint(35,150),newSize(100,10))),"InvoiceNumber")});// Parse document using templateusing(Parserparser=newParser("invoice.pdf")){DocumentDatadata=parser.ParseByTemplate(template);// Process extracted data}
Batch Processing
Process multiple documents efficiently:
string[]files=Directory.GetFiles(@"C:\Documents","*.*");foreach(stringfileinfiles){using(Parserparser=newParser(file)){// Extract and process data from each fileusing(TextReaderreader=parser.GetText()){if(reader!=null){stringcontent=reader.ReadToEnd();// Process the content}}}}
Performance & Scalability
High Performance - Optimized for processing large documents
Memory Efficient - Stream-based processing for large files
Thread Safe - Support for multi-threaded applications