What is Duplicate Data?

Sometimes our dataset has similar or identical rows and columns, such type of data is known as duplicate data. Due to this, we can count the same thing twice or more times based on the number of times the value has been duplicated. This alters the output and dealing with it is important. To understand this better we will create a fictional dataset as an example. This example is based on the salary, ID, age, and name of the employee. Duplicate values in such datasets can cause serious confusion and issues.

R




# Create a sample dataset with duplicate data
example_data <- data.frame(
  ID = c(1, 2, 3, 4, 5, 1, 6, 2),
  Name = c("Alice", "Bob", "Charlie", "David", "Eve", "Alice", "Frank", "Bob"),
  Age = c(25, 30, 35, 22, 28, 25, 40, 30),
  Salary = c(50000, 60000, 70000, 45000, 55000, 50000, 80000, 60000)
)
 
# Display the dataset with duplicate data
print("Dataset with Duplicate Data:")
print(example_data)


Output:

[1] "Dataset with Duplicate Data:"
ID Name Age Salary
1 1 Alice 25 50000
2 2 Bob 30 60000
3 3 Charlie 35 70000
4 4 David 22 45000
5 5 Eve 28 55000
6 1 Alice 25 50000
7 6 Frank 40 80000
8 2 Bob 30 60000

Here column 6 is a duplicate of column 1 as well and column 8 is a duplicate of column 2 making multiple values for similar things

Identify Duplicate Data

The dataset we took here is small for example therefore identifying duplicate values was easier by going through each value but if we have a large dataset, it is not possible to go through each column and identify duplicate values. It is also time-consuming, to make this issue easier we can follow the below-mentioned code:

R




# Identify duplicate rows based on all columns
duplicates_all <- example_data[duplicated(example_data), ]
 
# Identify duplicate rows based on selected columns (e.g., ID and Name)
duplicates_selected <- example_data[duplicated(example_data[c("ID", "Name")]), ]
 
# Display duplicate rows
print("Duplicate Rows (All Columns):")
print(duplicates_all)
 
print("Duplicate Rows (Selected Columns):")
print(duplicates_selected)


Output:

[1] "Duplicate Rows (All Columns):"
ID Name Age Salary
6 1 Alice 25 50000
8 2 Bob 30 60000

[1] "Duplicate Rows (Selected Columns):"
ID Name Age Salary
6 1 Alice 25 50000
8 2 Bob 30 60000

This code gave us the duplicated values in our dataset.

Dealing with Duplicate Data

There are several ways of dealing with duplicate data such as Deleting such rows or Aggregation of the duplicated rows or columns. We will understand how to do it with the help of the above example of salary, ID, age, and name of employees in a company.

Deleting duplicate values

We can delete the columns or rows that are twice or more than twice in our dataset.

R




# Remove duplicate rows and create a new dataset
no_duplicates_data <- unique(example_data)
 
# Display the dataset after removing duplicates
print("Dataset after Removing Duplicates:")
print(no_duplicates_data)


Output:

[1] "Dataset after Removing Duplicates:"
ID Name Age Salary
1 1 Alice 25 50000
2 2 Bob 30 60000
3 3 Charlie 35 70000
4 4 David 22 45000
5 5 Eve 28 55000
7 6 Frank 40 80000

Aggregating Duplicate Data

We can also merge these values if these values are taken for different periods and we want to merge those two rows or columns we can follow the below code:

R




# Aggregate data by summing Salary for each unique combination of ID and Name
aggregated_data <- aggregate(Salary ~ ID + Name + Age, data = example_data, sum)
 
# Display the aggregated dataset
print("Aggregated Dataset:")
print(aggregated_data)


Output:

[1] "Aggregated Dataset:"
ID Name Age Salary
1 4 David 22 45000
2 1 Alice 25 100000
3 5 Eve 28 55000
4 2 Bob 30 120000
5 3 Charlie 35 70000
6 6 Frank 40 80000

Data Matching

This is done when we want to keep the earliest column or row or just one of the duplicated values. This keeps the most relevant value out of the multiple values. The !duplicated condition is used to keep only the first occurrence of each unique combination of columns.

R




# Keep only the first occurrence of each unique combination of ID and Name
matched_data <- example_data[!duplicated(example_data[c("ID", "Name")]), ]
 
# Display the dataset after matching duplicates
print("Dataset after Matching Duplicates:")
print(matched_data)


Output:

[1] "Dataset after Matching Duplicates:"
ID Name Age Salary
1 1 Alice 25 50000
2 2 Bob 30 60000
3 3 Charlie 35 70000
4 4 David 22 45000
5 5 Eve 28 55000
7 6 Frank 40 80000

Conclusion

In this article, we understood how to deal with missing, invalid, and duplicate data in R programming language with the help of different examples. We also visualized the original and maintained dataset to understand the difference between them.



Coping with Missing, Invalid and Duplicate Data in R

Data is the base of statistical analysis and machine learning. The free data we get for processing is often raw and has many issues like invalid terms, and missing or duplicate values that can cause major changes in our model processing and estimation.

We use the past data to train our model and predict values based on this past data. These issues like invalid data or missing values can cause lower accuracy in prediction models therefore, handling these problems is an important part of data processing. In this article, we will learn how to cope with missing, invalid, and duplicate data in R Programming Language.

Similar Reads

What is missing data?

Missing data is the missing values in the dataset that can cause issues in various predictions. Many statistical and machine learning models cannot handle such values, so it is important to handle them. To deal with missing values we must identify them first....

What is Invalid Data?

...

What is Duplicate Data?

...