Cleaning Categorical Data in Python

To understand this problem, a new data frame with just one feature, phone numbers are created.

Python3




phone_numbers = []
 
for i in range(100):
  # phone numbers could be of length 9 or 10
  number = random.randint(100000000, 9999999999)
   
  # +91 code is inserted in some cases
  if(i % 2 == 0):
    phone_numbers.append('+91 ' + str(number))
  else:
    phone_numbers.append(str(number))
 
phone_numbers_data = pd.DataFrame({
    'phone_numbers': phone_numbers
})
 
phone_numbers_data.head()


Output:

Based on the use case, the code before numbers could be dropped or added for missing ones. Similarly, phone numbers with less than 10 numbers should be discarded.

Python3




phone_numbers_data['phone_numbers'] = phone_numbers_data['phone_numbers']\
    .str.replace('\+91 ', '')
 
num_digits = phone_numbers_data['phone_numbers'].str.len()
invalid_numbers_index = phone_numbers_data[num_digits < 10].index
phone_numbers_data['phone_numbers'] = phone_numbers_data.drop(
    invalid_numbers_index)
phone_numbers_data = phone_numbers_data.dropna()
 
phone_numbers_data.head()


Output:

Finally, we can verify whether the data is clean or not.

Python3




assert phone_numbers_data['phone_numbers'].str.contains('\+91 ').all() == False
assert (phone_numbers_data['phone_numbers'].str.len() != 10).all() == False


Handling Categorical Data in Python

Categorical data is a set of predefined categories or groups an observation can fall into. Categorical data can be found everywhere. For instance, survey responses like marital status, profession, educational qualifications, etc. However, certain problems can arise with categorical data that must be dealt with before proceeding with any other task. This article discusses various methods to handle categorical data in a DataFrame. So, let us look at some problems posed by categorical data and how to handle categorical data in a DataFrame.

As mentioned earlier, categorical data can only take up a finite set of values. However, due to human error, while filling out a survey form, or any other reason, some bogus values could be found in the dataset.

Similar Reads

Importing Libraries

Python libraries make it very easy for us to handle categorical data in a DataFrame and perform typical and complex tasks with a single line of code....

Cleaning Categorical Data in Python

...

Visualizing Categorical Data in Python Pandas

...

Encoding Categorical Data in Python

...