There are various circumstances in which you might need to extract the domain from a URL while working with web data in Python.
For operations like online scraping, data analysis, and website security, obtaining the domain can be crucial.
Three techniques for obtaining the domain from a URL in Python will be covered in this article: Regex, the tldextract module, and the urlparse module.
Advertising links are marked with *. We receive a small commission on sales, nothing changes for you.
Understanding URLs and Domains in Python
URLs, which are made up of multiple components such as the scheme, host, and path, are used to access web pages and other online resources.
The domain is an important aspect of a URL and is commonly known as the website address that you enter into the address bar of your browser.
While constructing websites with Python, it is critical to extract the domain from the URL. The extraction of the domain is essential for tasks such as web scraping, data analysis, and security.
The extraction of domains from URLs assures that the data obtained during web scraping initiatives comes from a trustworthy source.
Furthermore, collecting domains from URLs aids in the detection and prevention of phishing attacks.
We will describe these concepts in plain terms throughout this text, avoiding technical jargon as much as possible.
Our goal is to provide a straightforward, understandable explanation of how to extract domains from URLs using Python.
Tip: Find out, if your URL is valid with python (blog post).
3 Popular Python Techniques for Extracting Domains from URLs
You might frequently need to extract the domain from a given URL as a Python web developer. Python fortunately offers various options for doing this.
Three popular techniques for obtaining domains from URLs in Python will be covered in this section along with useful examples.
#1: Using the urlparse Module
Using the included Python package urlparse is one simple method for obtaining the domain from a URL. This module offers several functions for breaking down URLs into their constituent parts, such as the domain.
Here is an illustration of code that uses urlparse to extract the domain from a URL:
from urllib.parse import urlparse url = "https://www.example.com/path/to/page.html" parsed_url = urlparse(url) domain = parsed_url.netloc print(domain) # Output: "www.example.com"
The urlparse module is first imported into the code above from the urllib.parse library.
The URL from which we want to extract the domain is then defined.
The URL is broken down into its component parts using the urlparse function, and the domain is extracted from the netloc attribute.
#2: Using Regular Expressions
You can extract the domain from a URL by matching it with a specified pattern if you prefer regular expressions (regex).
If you need to extract the domain from a URL that doesn’t adhere to a conventional format, this method can be quite helpful.
Here is an example:
import re url = "https://www.example.com/path/to/page.html" pattern = r"(?P<domain>[\w\-]+\.+[\w\-]+)" match = re.search(pattern, url) domain = match.group("domain") print(domain) # Output: "www.example.com"
The re module, which offers functionality for regular expressions, is first imported in the code above.
Following that, we specify the URL from which we wish to extract the domain and develop a regex pattern that recognizes domain names. We locate the pattern in the URL using the re.search function, then we utilize the named group (?P) to extract the domain.
#3: Utilizing the tldextract Module
The tldextract module can also be used to extract the domain from a URL. This third-party library’s ability to extract a URL’s top-level domain (TLD), domain name, and subdomains makes it very helpful.
Here is an example of some code that uses tldextract to extract the domain from a URL:
import tldextract url = "https://www.example.com/path/to/page.html" extracted = tldextract.extract(url) domain = extracted.domain print(domain) # Output: "example"
The tldextract module, which has a parse function for removing domain components from a URL, is first imported in the code above.
The URL from which we wish to extract the domain is specified, and the exact function is used to extract the domain’s constituent parts. Using the extracted.domain attribute, we extract the domain name.
Examples and Use Cases in Real Life
This section will examine real-world applications and use cases for Python’s domain extraction from URLs. We’ll talk about these things:
- Scraping the web
- Analysis of Data
- Security Software
With useful code samples and explanations, each use case will have its own chapter.
#1: Web Scraping
One of the most popular uses for removing domains from URLs in Python is web scraping. It’s crucial to confirm that the data being scraped from the web came from a reliable source.
You may quickly determine if a domain is valid or not by removing the domain from the URL. Also, you can exclude illegitimate domains using the domain information and only scrape data from reputable sources.
Here is an illustration of some sample web scraping code that shows how to retrieve the domain from a list of URLs:
from urllib.parse import urlparse # list of URLs to scrape urls = ["https://www.example.com/page1.html", "https://www.example.com/page2.html", "https://www.notlegit.com/page3.html"] # extract domains from URLs domains = [] for url in urls: parsed_url = urlparse(url) domain = parsed_url.netloc domains.append(domain) # filter out non-legit domains legit_domains = [] for domain in domains: if "example.com" in domain: legit_domains.append(domain) # scrape only from legit domains for url in urls: if urlparse(url).netloc in legit_domains: # scrape data from this URL pass
The urlparse module is used in the code above to extract the domain names from the list of URLs to scrape. Afterwards, we exclude invalid domains and only scrape information from URLs with valid domains.
#2: Data Analysis
Organizing data by domain is frequently helpful when studying web data. For instance, you might want to examine online traffic by domain to identify the most popular domains.
Example:
data = [ {"url": "https://www.example.com/page1.html", "views": 100}, {"url": "https://www.example.com/page2.html", "views": 200}, {"url": "https://www.notlegit.com/page3.html", "views": 50}, {"url": "https://www.otherexample.com/page4.html", "views": 300}, ] # group data by domain domains = {} for d in data: domain = urlparse(d["url"]).netloc if domain not in domains: domains[domain] = {"views": 0, "pages": []} domains[domain]["views"] += d["views"] domains[domain]["pages"].append(d["url"]) # print results for domain, data in domains.items(): print(domain, data["views"], data["pages"])
We define a list of data objects in the code above, each of which has a URL and several views.
We group the data by domain after extracting the domain from each URL using the urlparse function. The total views and pages for each domain are then printed.
#3: Security Applications
Security applications can also benefit from domain extraction from URLs. One such application is determining whether the domain in a URL is real or not in order to identify phishing attacks.
Below is an example of code that shows how to determine whether a URL is trustworthy or not:
from urllib.parse import urlparse def is_legit_url(url): # extract domain from URL domain = urlparse(url).netloc # check if domain is legitimate legit_domains = ["example.com", "google.com", "amazon.com"] if domain in legit_domains: return True else: return False
With the aforementioned code, we build a function that accepts a URL as input and uses the urlparse module to retrieve the domain.
The domain is then compared to a list of legitimate domains to see if it is legitimate. The function returns True if the domain is authentic. Otherwise, False is returned.
The detection and prevention of cross-site scripting (XSS) assaults is another security use of domain extraction from URLs. You may make sure that scripts are only run on valid pages by removing the domain from a URL and contrasting it with the domain of the currently-viewed page.
Here is some sample code that illustrates how to stop XSS attacks:
from flask import Flask, request from urllib.parse import urlparse app = Flask(__name__) @app.route("/submit", methods=["POST"]) def submit(): url = request.form.get("url") domain = urlparse(url).netloc # check if domain is the same as current page current_domain = urlparse(request.referrer).netloc if domain != current_domain: return "Error: Cross-site scripting (XSS) attack detected!" # submit data pass
We define a Flask app with a route for POST request data submission in the code above.
Using the urlparse module, we extract the domain from the submitted URL and compare it to the domain of the currently shown page using the referrer parameter of the request object.
To avoid a possible XSS attack, we return an error message if the domains do not match.
Conclusion
In this article, we looked at many Python scripts for extracting domains from URLs.
We’ve discussed how crucial it is to be able to determine the domain from a URL for applications like web scraping, data analysis, security, and others. We’ve examined the urlparse module, regex, and the tldextract module as three popular approaches for extracting domains from URLs in Python.
For each technique, we’ve also included real-world examples and use cases, such spotting phishing scams and avoiding cross-site scripting (XSS) assaults.
You can extract domains from URLs and utilize them for several things by implementing these methods in your Python programs.
In conclusion, web developers and data analysts can benefit from learning how to extract domains from URLs in Python.
You can now increase the effectiveness and security of your Python projects by applying the techniques presented in this article 😎.
Advertising links are marked with *. We receive a small commission on sales, nothing changes for you.