What is Duplicate Data?

What is Invalid Data?

Sometimes our dataset has similar or identical rows and columns, such type of data is known as duplicate data. Due to this, we can count the same thing twice or more times based on the number of times the value has been duplicated. This alters the output and dealing with it is important. To understand this better we will create a fictional dataset as an example. This example is based on the salary, ID, age, and name of the employee. Duplicate values in such datasets can cause serious confusion and issues.

R

# Create a sample dataset with duplicate data
example_data <- data.frame(
  ID = c(1, 2, 3, 4, 5, 1, 6, 2),
  Name = c("Alice", "Bob", "Charlie", "David", "Eve", "Alice", "Frank", "Bob"),
  Age = c(25, 30, 35, 22, 28, 25, 40, 30),
  Salary = c(50000, 60000, 70000, 45000, 55000, 50000, 80000, 60000)
)
 
# Display the dataset with duplicate data
print("Dataset with Duplicate Data:")
print(example_data)

Output:

[1] "Dataset with Duplicate Data:"
  ID    Name Age Salary
1  1   Alice  25  50000
2  2     Bob  30  60000
3  3 Charlie  35  70000
4  4   David  22  45000
5  5     Eve  28  55000
6  1   Alice  25  50000
7  6   Frank  40  80000
8  2     Bob  30  60000

Here column 6 is a duplicate of column 1 as well and column 8 is a duplicate of column 2 making multiple values for similar things

Identify Duplicate Data

The dataset we took here is small for example therefore identifying duplicate values was easier by going through each value but if we have a large dataset, it is not possible to go through each column and identify duplicate values. It is also time-consuming, to make this issue easier we can follow the below-mentioned code:

R

# Identify duplicate rows based on all columns
duplicates_all <- example_data[duplicated(example_data), ]
 
# Identify duplicate rows based on selected columns (e.g., ID and Name)
duplicates_selected <- example_data[duplicated(example_data[c("ID", "Name")]), ]
 
# Display duplicate rows
print("Duplicate Rows (All Columns):")
print(duplicates_all)
 
print("Duplicate Rows (Selected Columns):")
print(duplicates_selected)

Output:

[1] "Duplicate Rows (All Columns):"
  ID  Name Age Salary
6  1 Alice  25  50000
8  2   Bob  30  60000

[1] "Duplicate Rows (Selected Columns):"
  ID  Name Age Salary
6  1 Alice  25  50000
8  2   Bob  30  60000

This code gave us the duplicated values in our dataset.

Dealing with Duplicate Data

There are several ways of dealing with duplicate data such as Deleting such rows or Aggregation of the duplicated rows or columns. We will understand how to do it with the help of the above example of salary, ID, age, and name of employees in a company.

Deleting duplicate values

We can delete the columns or rows that are twice or more than twice in our dataset.

R

# Remove duplicate rows and create a new dataset
no_duplicates_data <- unique(example_data)
 
# Display the dataset after removing duplicates
print("Dataset after Removing Duplicates:")
print(no_duplicates_data)

Output:

[1] "Dataset after Removing Duplicates:"
  ID    Name Age Salary
1  1   Alice  25  50000
2  2     Bob  30  60000
3  3 Charlie  35  70000
4  4   David  22  45000
5  5     Eve  28  55000
7  6   Frank  40  80000

Aggregating Duplicate Data

We can also merge these values if these values are taken for different periods and we want to merge those two rows or columns we can follow the below code:

R

# Aggregate data by summing Salary for each unique combination of ID and Name
aggregated_data <- aggregate(Salary ~ ID + Name + Age, data = example_data, sum)
 
# Display the aggregated dataset
print("Aggregated Dataset:")
print(aggregated_data)

Output:

[1] "Aggregated Dataset:"
  ID    Name Age Salary
1  4   David  22  45000
2  1   Alice  25 100000
3  5     Eve  28  55000
4  2     Bob  30 120000
5  3 Charlie  35  70000
6  6   Frank  40  80000

Data Matching

This is done when we want to keep the earliest column or row or just one of the duplicated values. This keeps the most relevant value out of the multiple values. The !duplicated condition is used to keep only the first occurrence of each unique combination of columns.

R

# Keep only the first occurrence of each unique combination of ID and Name
matched_data <- example_data[!duplicated(example_data[c("ID", "Name")]), ]
 
# Display the dataset after matching duplicates
print("Dataset after Matching Duplicates:")
print(matched_data)

Output:

[1] "Dataset after Matching Duplicates:"
  ID    Name Age Salary
1  1   Alice  25  50000
2  2     Bob  30  60000
3  3 Charlie  35  70000
4  4   David  22  45000
5  5     Eve  28  55000
7  6   Frank  40  80000

Conclusion

In this article, we understood how to deal with missing, invalid, and duplicate data in R programming language with the help of different examples. We also visualized the original and maintained dataset to understand the difference between them.

Coping with Missing, Invalid and Duplicate Data in R

Data is the base of statistical analysis and machine learning. The free data we get for processing is often raw and has many issues like invalid terms, and missing or duplicate values that can cause major changes in our model processing and estimation.

We use the past data to train our model and predict values based on this past data. These issues like invalid data or missing values can cause lower accuracy in prediction models therefore, handling these problems is an important part of data processing. In this article, we will learn how to cope with missing, invalid, and duplicate data in R Programming Language.

Tags:

#Geeks Premier League 2023 #Geeks Premier League #R Language

What is Invalid Data?

What is Duplicate Data?

R

Identify Duplicate Data

R

Dealing with Duplicate Data

Deleting duplicate values

R

Aggregating Duplicate Data

R

Data Matching

R

Conclusion

Coping with Missing, Invalid and Duplicate Data in R

Similar Reads