If you're using Python
to do things like web scraping, there will be time that you want to process a full URL
and get just some of the specifics. This could include things like the protocol (http
or https
), domain name, subdomain, or just the request path.
Python
has urllib
module to do all things URL
. You can dissect and process a URL
using urlparse
function within the urllib
module. It could split the URL
to scheme
(http
or https
), netloc
(subdomain, domain and TLD
) and path.
Python
shell. $ ipython3 Python 3.8.2 (default, Apr 27 2020, 15:53:34) Type 'copyright', 'credits' or 'license' for more information IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.
urllib.parse
module.In [1]: import urllib.parse
URL
using urlparse
function from urllib.parse
module.In [2]: parsed_url = urllib.parse.urlparse('https://www.example.com/page.html')
URL
output.In [3]: print(parsed_url) ParseResult(scheme='https', netloc='www.example.com', path='/page.html', params='', query='', fragment='')
In [4]: print(parsed_url.netloc) www.example.com
Python
script that accepts a URL
as parameter and outputs corresponding parsed URL
. #!/usr/bin/env python3 import urllib.parse import sys url = sys.argv[1] parsed_url = urllib.parse.urlparse(url) print(parsed_url) print("Host name: ", parsed_url.netloc)
URL
as parameter. $ python3 get-host-name-from-url.py https://www.example.com/page.html ParseResult(scheme='https', netloc='www.example.com', path='/page.html', params='', query='', fragment='') Host name: www.example.com
Comment anonymously. Login not required.