Remove URLs from string in Python
A regular expression (regex) is a sequence of characters that defines a search pattern in text. To remove URLs from a string in Python, you can either use regular expressions (regex) or some external libraries like urllib.parse. The re-module in Python is used for working with regular expressions. In this article, we will see how we can remove URLs from a string in Python.
Python Remove URLs from a String
- Using the re.sub() function
- Using the re.findall() function
- Using the re.search() function
- Using the urllib.parse class
Python Remove URLs from String Using re.sub() function
In this example, the code defines a function ‘remove_urls’ to find URLs in text and replace them with a placeholder [URL REMOVED], using regular expressions for pattern matching and the re.sub() method for substitution.
Python3
import re def remove_urls(text, replacement_text = "[URL REMOVED]" ): # Define a regex pattern to match URLs url_pattern = re. compile (r 'https?://\S+|www\.\S+' ) # Use the sub() method to replace URLs with the specified replacement text text_without_urls = url_pattern.sub(replacement_text, text) return text_without_urls # Example: output_text = remove_urls(input_text) print ( "Original Text:" ) print (input_text) print ( "\nText with URLs Removed:" ) print (output_text) |
Original Text: Visit on w3wiki Website: https://www.w3wiki.net/ Text with URLs Removed: Visit on w3wiki Website: [URL REMOVED]
Remove URLs from String Using re.findall() function
In this example, the Python code defines a function ‘remove_urls_findall’ that uses regular expressions to find all URLs using re.findall() method in a given text and replaces them with a replacement text “[URL REMOVED]”.
Python3
import re def remove_urls_findall(text, replacement_text = "[URL REMOVED]" ): url_pattern = re. compile (r 'https?://\S+|www\.\S+' ) urls = url_pattern.findall(text) for url in urls: text = text.replace(url, replacement_text) return text # Example: input_text = "Check out the latest Python tutorials on w3wiki: https://www.w3wiki.net/category/python/" output_text_findall = remove_urls_findall(input_text) print ( "\nUsing re.findall():" ) print ( "Original Text:" ) print (input_text) print ( "\nText with URLs Removed:" ) print (output_text_findall) |
Output:
Using re.findall():
Original Text:
Check out the latest Python tutorials on w3wiki: https://www.w3wiki.net/category/python/
Text with URLs Removed:
Check out the latest Python tutorials on w3wiki: [URL REMOVED]
Remove URLs from String in Python Using re.search() function
In this example, the Python code defines a function ‘remove_urls_search’ using regular expressions and re.search() to find and replace URLs in a given text with a replacement text “[URL REMOVED]”.
Python3
import re def remove_urls_search(text, replacement_text = "[URL REMOVED]" ): url_pattern = re. compile (r 'https?://\S+|www\.\S+' ) while True : match = url_pattern.search(text) if not match: break text = text[:match.start()] + replacement_text + text[match.end():] return text # Example: input_text = "Visit our website at https://w3wiki.net/ for more information. Follow us on Twitter: @w3wiki" output_text_search = remove_urls_search(input_text) print ( "\nUsing re.search():" ) print ( "Original Text:" ) print (input_text) print ( "\nText with URLs Removed:" ) print (output_text_search) |
Output:
Using re.search():
Original Text:
Visit our website at https://w3wiki.net/ for more information. Follow us on Twitter: @w3wiki
Text with URLs Removed:
Visit our website at [URL REMOVED] for more information. Follow us on Twitter: @w3wiki
Remove URLs from String Using urllib.parse
In this example, the Python code defines a function ‘remove_urls_urllib’ that uses urllib.parse to check and replace URLs in a given text with a replacement text “[URL REMOVED]”.
Python3
# Using urllib.parse from urllib.parse import urlparse def remove_urls_urllib(text, replacement_text = "[URL REMOVED]" ): words = text.split() for i, word in enumerate (words): parsed_url = urlparse(word) if parsed_url.scheme and parsed_url.netloc: words[i] = replacement_text return ' ' .join(words) # Example: output_text_urllib = remove_urls_urllib(input_text) print ( "Using urllib.parse:" ) print ( "Text with URLs Removed:" ) print (output_text_urllib) |
Using urllib.parse: Text with URLs Removed: Check out the w3wiki website at [URL REMOVED] for programming tutorials.