GroupDocs.Parser allows extracting hyperlinks from specific pages of a document, providing page-by-page hyperlink processing.
Prerequisites
GroupDocs.Parser for Python via .NET installed
Sample documents with hyperlinks
Understanding of page indexing (zero-based)
Extract hyperlinks from a specific page
To extract hyperlinks from a particular page:
fromgroupdocs.parserimportParser# Create an instance of Parser classwithParser("./document.pdf")asparser:# Check if hyperlink extraction is supportedifnotparser.features.hyperlinks:print("Hyperlink extraction not supported")return# Extract hyperlinks from page 0 (first page)page_index=0hyperlinks=parser.get_hyperlinks(page_index)ifhyperlinks:print(f"Hyperlinks on page {page_index+1}:")forlinkinhyperlinks:print(f" Text: {link.text}")print(f" URL: {link.url}")print()
The following sample file is used in this example: document.pdf
Expected behavior: Returns only hyperlinks from the specified page.
Extract hyperlinks from all pages
To process hyperlinks page by page:
fromgroupdocs.parserimportParser# Create an instance of Parser classwithParser("report.docx")asparser:ifnotparser.features.hyperlinks:print("Hyperlink extraction not supported")return# Get document infoinfo=parser.get_document_info()ifinfo.page_count==0:print("Document has no pages")return# Iterate over pagestotal_links=0forpage_indexinrange(info.page_count):print(f"===Page{page_index+1}/{info.page_count}===")# Extract hyperlinks from current pagehyperlinks=parser.get_hyperlinks(page_index)ifhyperlinks:page_links=list(hyperlinks)print(f"Found {len(page_links)} hyperlinks:")forlinkinpage_links:print(f" • {link.text} -> {link.url}")total_links+=len(page_links)else:print("No hyperlinks on this page")print(f"Totalhyperlinks:{total_links}")
The following sample file is used in this example: report.docx
Expected behavior: Iterates through all pages and extracts hyperlinks from each page individually.
Export page hyperlinks to JSON
To export hyperlinks organized by page:
fromgroupdocs.parserimportParserimportjsondefexport_page_hyperlinks(file_path,output_json):"""
Extract hyperlinks by page and export to JSON.
"""withParser(file_path)asparser:ifnotparser.features.hyperlinks:print("Hyperlink extraction not supported")returnFalseinfo=parser.get_document_info()# Build structuredocument_links={'file':file_path,'total_pages':info.page_count,'pages':[]}forpage_indexinrange(info.page_count):hyperlinks=parser.get_hyperlinks(page_index)page_data={'page_number':page_index+1,'hyperlinks':[]}ifhyperlinks:forlinkinhyperlinks:page_data['hyperlinks'].append({'text':link.textor'','url':link.urlor''})document_links['pages'].append(page_data)# Save to JSONwithopen(output_json,'w',encoding='utf-8')asf:json.dump(document_links,f,indent=2,ensure_ascii=False)print(f"Hyperlinks exported to {output_json}")returnTrue# Usageexport_page_hyperlinks("document.pdf","hyperlinks_by_page.json")
The following sample file is used in this example: document.pdf
Expected behavior: Creates a JSON file with hyperlinks organized by page number.
Find pages with external links
To identify pages containing external links:
fromgroupdocs.parserimportParserdeffind_pages_with_external_links(file_path):"""
Find all pages containing external (http/https) links.
"""withParser(file_path)asparser:ifnotparser.features.hyperlinks:print("Hyperlink extraction not supported")return[]info=parser.get_document_info()pages_with_external_links=[]forpage_indexinrange(info.page_count):hyperlinks=parser.get_hyperlinks(page_index)ifhyperlinks:has_external=Falseexternal_links=[]forlinkinhyperlinks:iflink.urland(link.url.startswith('http://')orlink.url.startswith('https://')):has_external=Trueexternal_links.append(link.url)ifhas_external:pages_with_external_links.append({'page':page_index+1,'links':external_links})returnpages_with_external_links# Usagepages=find_pages_with_external_links("article.pdf")print(f"Pages with external links: {len(pages)}")forpage_infoinpages:print(f"Page {page_info['page']}: {len(page_info['links'])} external links")forurlinpage_info['links'][:3]:# Show first 3print(f" - {url}")
The following sample file is used in this example: article.pdf
Expected behavior: Returns a list of pages containing external hyperlinks.
Compare hyperlinks across pages
To analyze hyperlink distribution:
fromgroupdocs.parserimportParserfromcollectionsimportdefaultdictdefanalyze_hyperlink_distribution(file_path):"""
Analyze how hyperlinks are distributed across pages.
"""withParser(file_path)asparser:ifnotparser.features.hyperlinks:print("Hyperlink extraction not supported")returninfo=parser.get_document_info()# Collect statisticsstats=[]domain_frequency=defaultdict(int)forpage_indexinrange(info.page_count):hyperlinks=parser.get_hyperlinks(page_index)ifhyperlinks:link_list=list(hyperlinks)page_stat={'page':page_index+1,'count':len(link_list),'unique_urls':len(set(link.urlforlinkinlink_list))}stats.append(page_stat)# Count domainsforlinkinlink_list:iflink.url:try:fromurllib.parseimporturlparsedomain=urlparse(link.url).netlocor"internal"domain_frequency[domain]+=1except:pass# Print analysisprint(f"Hyperlink Distribution Analysis:")print(f"{'Page':<10}{'Links':<10}{'Unique URLs':<15}")print("-"*35)forstatinstats:print(f"{stat['page']:<10}{stat['count']:<10}{stat['unique_urls']:<15}")print(f"Topdomains:")fordomain,countinsorted(domain_frequency.items(),key=lambdax:x[1],reverse=True)[:5]:print(f" {domain}: {count}")# Usageanalyze_hyperlink_distribution("documentation.pdf")
The following sample file is used in this example: documentation.pdf
Expected behavior: Provides statistical analysis of hyperlink distribution across pages.
Extract hyperlinks from page range
To extract hyperlinks from a range of pages:
fromgroupdocs.parserimportParserdefextract_hyperlinks_from_range(file_path,start_page,end_page):"""
Extract hyperlinks from a range of pages.
Args:
file_path: Path to document
start_page: Starting page number (1-based)
end_page: Ending page number (1-based, inclusive)
"""withParser(file_path)asparser:ifnotparser.features.hyperlinks:print("Hyperlink extraction not supported")return[]info=parser.get_document_info()# Validate rangestart_idx=max(0,start_page-1)end_idx=min(info.page_count-1,end_page-1)all_links=[]forpage_indexinrange(start_idx,end_idx+1):hyperlinks=parser.get_hyperlinks(page_index)ifhyperlinks:forlinkinhyperlinks:all_links.append({'page':page_index+1,'text':link.text,'url':link.url})returnall_links# Usage - extract hyperlinks from pages 1-5links=extract_hyperlinks_from_range("document.pdf",1,5)print(f"Found {len(links)} hyperlinks in pages 1-5")forlinkinlinks:print(f"Page {link['page']}: {link['text']} -> {link['url']}")
The following sample file is used in this example: document.pdf
Expected behavior: Extracts hyperlinks only from the specified page range.
Notes
Page indices are zero-based (first page is index 0)
Use get_document_info() to determine total page count
Check parser.features.hyperlinks before extraction
Empty collections are returned for pages without hyperlinks (not None)
Page-by-page extraction is memory-efficient for large documents