Python – Filtering text using Enchant

Enchant is a module in Python which is used to check the spelling of a word, gives suggestions to correct words. Also, gives antonym and synonym of words. It checks whether a word exists in dictionary or not.

Enchant also provides the enchant.tokenize module to tokenize text. Tokenizing involves splitting words from the body of the text. But at times not all the words are required to be tokenized. Suppose we are spell checking, then it is customary to ignore email addresses and URLs. This can be achieved by modifying the tokenization process with filters.
Currently implemented filters are :

  • EmailFilter
  • URLFilter
  • WikiWordFilter

Example 1 : EmailFilter

# import the required modules
from enchant.tokenize import get_tokenizer
from enchant.tokenize import EmailFilter
# the text to be tokenized
text = "The email is"
# getting tokenizer class
tokenizer = get_tokenizer("en_US")
# printing tokens without filtering
print("Printing tokens without filtering:")
token_list = []
for words in tokenizer(text):
# getting tokenizer class with filter
tokenizer_filter = get_tokenizer("en_US", [EmailFilter])
# printing tokens after filtering
print("\nPrinting tokens after filtering:")
token_list_filter = []
for words in tokenizer_filter(text):

Output :

Printing tokens without filtering:
[(‘The’, 0), (’email’, 4), (‘is’, 10), (‘abc’, 13), (‘gmail’, 17), (‘com’, 23)]

Printing tokens after filtering:
[(‘The’, 0), (’email’, 4), (‘is’, 10)

Example 2 : URLFilter

# import the required modules
from enchant.tokenize import get_tokenizer
from enchant.tokenize import URLFilter
# the text to be tokenized
text = "This is an URL:"
# getting tokenizer class
tokenizer = get_tokenizer("en_US")
# printing tokens without filtering
print("Printing tokens without filtering:")
token_list = []
for words in tokenizer(text):
# getting tokenizer class with filter
tokenizer_filter = get_tokenizer("en_US", [URLFilter])
# printing tokens after filtering
print("\nPrinting tokens after filtering:")
token_list_filter = []
for words in tokenizer_filter(text):

Output :

Printing tokens without filtering:
[(‘This’, 0), (‘is’, 5), (‘an’, 8), (‘URL’, 11), (‘https’, 16), (‘www’, 24), (‘w3wiki’, 28), (‘org’, 42)]

Printing tokens after filtering:
[(‘This’, 0), (‘is’, 5), (‘an’, 8), (‘URL’, 11)]

Example 3 : WikiWordFilter
A WikiWord is a word which consists of two or more words with initial capitals, run together.

# import the required modules
from enchant.tokenize import get_tokenizer
from enchant.tokenize import WikiWordFilter
# the text to be tokenized
text = "VersionFiveDotThree is an example of WikiWord"
# getting tokenizer class
tokenizer = get_tokenizer("en_US")
# printing tokens without filtering
print("Printing tokens without filtering:")
token_list = []
for words in tokenizer(text):
# getting tokenizer class with filter
tokenizer_filter = get_tokenizer("en_US", [WikiWordFilter])
# printing tokens after filtering
print("\nPrinting tokens after filtering:")
token_list_filter = []
for words in tokenizer_filter(text):

Output :

Printing tokens without filtering:
[(‘VersionFiveDotThree’, 0), (‘is’, 20), (‘an’, 23), (‘example’, 26), (‘of’, 34), (‘WikiWord’, 37)]

Printing tokens after filtering:
[(‘is’, 20), (‘an’, 23), (‘example’, 26), (‘of’, 34)]