GroupDocs.Parser provides functionality to extract hyperlinks from documents including their text, URL, and position information.
Prerequisites
GroupDocs.Parser for Python via .NET installed
Sample documents containing hyperlinks
Basic understanding of hyperlink structure
Extract hyperlinks from document
To extract all hyperlinks from a document:
fromgroupdocs.parserimportParser# Create an instance of Parser classwithParser("./sample.pdf")asparser:# Check if hyperlink extraction is supportedifnotparser.features.hyperlinks:print("Document doesn't support hyperlink extraction")return# Extract hyperlinks from the documenthyperlinks=parser.get_hyperlinks()ifhyperlinks:# Iterate over hyperlinksforhyperlinkinhyperlinks:# Print the hyperlink textprint(f"Text: {hyperlink.text}")# Print the hyperlink URLprint(f"URL: {hyperlink.url}")# Print page numberprint(f"Page: {hyperlink.page.index+1}")print()
The following sample file is used in this example: sample.pdf
Expected behavior: Returns a collection of PageHyperlinkArea objects representing all hyperlinks found in the document, or None if hyperlink extraction is not supported.
Extract hyperlinks with position information
To extract hyperlinks with detailed position data:
fromgroupdocs.parserimportParser# Create an instance of Parser classwithParser("document.docx")asparser:# Check if hyperlink extraction is supportedifnotparser.features.hyperlinks:print("Hyperlink extraction not supported")return# Extract hyperlinkshyperlinks=parser.get_hyperlinks()ifhyperlinks:foridx,hyperlinkinenumerate(hyperlinks):print(f"Hyperlink {idx+1}:")print(f" Text: {hyperlink.text}")print(f" URL: {hyperlink.url}")print(f" Page: {hyperlink.page.index+1}")print(f" Position: ({hyperlink.rectangle.left}, {hyperlink.rectangle.top})")print(f" Size: {hyperlink.rectangle.width}x{hyperlink.rectangle.height}")print()
The following sample file is used in this example: document.docx
Expected behavior: Displays comprehensive information about each hyperlink including text, URL, page number, and position coordinates.
Extract and validate URLs
To extract hyperlinks and validate URL formats:
fromgroupdocs.parserimportParserimportredefis_valid_url(url):"""
Basic URL validation.
"""url_pattern=re.compile(r'^(https?|ftp)://'# Protocolr'([a-zA-Z0-9.-]+)'# Domainr'(:[0-9]+)?'# Optional portr'(/.*)?$'# Optional path)returnurl_pattern.match(url)isnotNone# Create an instance of Parser classwithParser("webpage.html")asparser:ifnotparser.features.hyperlinks:print("Hyperlink extraction not supported")return# Extract hyperlinkshyperlinks=parser.get_hyperlinks()ifhyperlinks:valid_links=[]invalid_links=[]forhyperlinkinhyperlinks:ifis_valid_url(hyperlink.url):valid_links.append(hyperlink)else:invalid_links.append(hyperlink)print(f"Valid hyperlinks: {len(valid_links)}")print(f"Invalid hyperlinks: {len(invalid_links)}")print("\nValid URLs:")forlinkinvalid_links[:10]:# Show first 10print(f" {link.text} -> {link.url}")
The following sample file is used in this example: webpage.htm
Expected behavior: Categorizes hyperlinks into valid and invalid based on URL format validation.
Export hyperlinks to CSV
To export hyperlinks to a CSV file:
fromgroupdocs.parserimportParserimportcsvdefexport_hyperlinks_to_csv(file_path,output_csv):"""
Extract hyperlinks and save to CSV file.
"""withParser(file_path)asparser:ifnotparser.features.hyperlinks:print("Hyperlink extraction not supported")returnFalsehyperlinks=parser.get_hyperlinks()ifnothyperlinks:print("No hyperlinks found")returnFalse# Write to CSVwithopen(output_csv,'w',newline='',encoding='utf-8')ascsvfile:fieldnames=['Text','URL','Page','X','Y','Width','Height']writer=csv.DictWriter(csvfile,fieldnames=fieldnames)writer.writeheader()forhyperlinkinhyperlinks:writer.writerow({'Text':hyperlink.textor'','URL':hyperlink.urlor'','Page':hyperlink.page.index+1,'X':hyperlink.rectangle.left,'Y':hyperlink.rectangle.top,'Width':hyperlink.rectangle.width,'Height':hyperlink.rectangle.height})print(f"Hyperlinks exported to {output_csv}")returnTrue# Usageexport_hyperlinks_to_csv("document.pdf","hyperlinks.csv")
The following sample file is used in this example: document.pdf
The following sample file is used in this example: hyperlinks.csv
Expected behavior: Creates a CSV file containing all hyperlinks with their properties.
Group hyperlinks by domain
To analyze hyperlink distribution by domain:
fromgroupdocs.parserimportParserfromurllib.parseimporturlparsefromcollectionsimportdefaultdictdefanalyze_hyperlinks_by_domain(file_path):"""
Group hyperlinks by domain and show statistics.
"""withParser(file_path)asparser:ifnotparser.features.hyperlinks:print("Hyperlink extraction not supported")returnhyperlinks=parser.get_hyperlinks()ifnothyperlinks:print("No hyperlinks found")return# Group by domainby_domain=defaultdict(list)forhyperlinkinhyperlinks:try:parsed_url=urlparse(hyperlink.url)domain=parsed_url.netlocor"internal"by_domain[domain].append(hyperlink)except:by_domain["invalid"].append(hyperlink)# Print statisticsprint(f"Total hyperlinks: {sum(len(links)forlinksinby_domain.values())}")print(f"Unique domains: {len(by_domain)}")# Sort by countsorted_domains=sorted(by_domain.items(),key=lambdax:len(x[1]),reverse=True)print(f"{'Domain':<40}{'Count':<10}")print("-"*50)fordomain,linksinsorted_domains[:20]:# Top 20print(f"{domain:<40}{len(links):<10}")# Usageanalyze_hyperlinks_by_domain("article.html")
The following sample file is used in this example: article.htm
Expected behavior: Provides statistical analysis of hyperlinks grouped by domain.
Extract hyperlinks with filtering
To extract only specific types of hyperlinks:
fromgroupdocs.parserimportParserdefextract_filtered_hyperlinks(file_path,url_pattern=None,text_pattern=None):"""
Extract hyperlinks with optional filtering by URL or text pattern.
"""importrewithParser(file_path)asparser:ifnotparser.features.hyperlinks:print("Hyperlink extraction not supported")return[]hyperlinks=parser.get_hyperlinks()ifnothyperlinks:return[]filtered_links=[]forhyperlinkinhyperlinks:# Apply URL pattern filterifurl_patternandnotre.search(url_pattern,hyperlink.url,re.IGNORECASE):continue# Apply text pattern filteriftext_patternandhyperlink.textandnotre.search(text_pattern,hyperlink.text,re.IGNORECASE):continuefiltered_links.append(hyperlink)returnfiltered_links# Usage examples# Extract only PDF linkspdf_links=extract_filtered_hyperlinks("document.html",url_pattern=r'\.pdf$')print(f"Found {len(pdf_links)} PDF links")# Extract links containing "download"download_links=extract_filtered_hyperlinks("document.html",text_pattern=r'download')print(f"Found {len(download_links)} download links")# Extract external links (http/https)external_links=extract_filtered_hyperlinks("document.html",url_pattern=r'^https?://')print(f"Found {len(external_links)} external links")
The following sample file is used in this example: document.htm
The following sample file is used in this example: .pdf
Expected behavior: Returns only hyperlinks matching the specified criteria.
Batch hyperlink extraction
To extract hyperlinks from multiple documents:
fromgroupdocs.parserimportParserfrompathlibimportPathimportjsondefbatch_extract_hyperlinks(input_dir,output_json):"""
Extract hyperlinks from all documents in a directory.
"""extensions=['.pdf','.docx','.doc','.html','.htm']all_hyperlinks={}forfile_pathinPath(input_dir).rglob('*'):iffile_path.suffix.lower()inextensions:print(f"Processing: {file_path.name}")try:withParser(str(file_path))asparser:ifnotparser.features.hyperlinks:continuehyperlinks=parser.get_hyperlinks()ifhyperlinks:link_list=[]forlinkinhyperlinks:link_list.append({'text':link.textor'','url':link.urlor'','page':link.page.index+1})all_hyperlinks[file_path.name]={'count':len(link_list),'links':link_list}exceptExceptionase:print(f" Error: {e}")# Save to JSONwithopen(output_json,'w',encoding='utf-8')asf:json.dump(all_hyperlinks,f,indent=2,ensure_ascii=False)total_links=sum(doc['count']fordocinall_hyperlinks.values())print(f"Totaldocuments:{len(all_hyperlinks)}")print(f"Total hyperlinks: {total_links}")print(f"Saved to {output_json}")# Usagebatch_extract_hyperlinks("documents","all_hyperlinks.json")