What is Toy Dataset – Types, Purpose, Benefits and Application
Toy datasets are small, simple datasets commonly used in the field of machine learning for training, testing, and demonstrating algorithms. These datasets are typically clean, well-organized, and structured in a way that makes them easy to use for instructional purposes, reducing the complexities associated with real-world data processing.
What is Toy dataset?
A toy dataset is a small yet pretending set of data used in machine learning and statistics, made for the purpose. These datasets are of basic level to help data professionals and amateurs get started while they also provide the necessary tools for deeper knowledge.
Table of Content
- What is Toy dataset?
- Characteristics of Toy DataSet
- Types of Toy DataSets
- 1. Iris Plants Dataset
- 2. Diabetes Dataset
- 3. Optical Recognition of Handwritten Digits Dataset
- 4. Linnerrud Dataset
- 5. Wine recognition Dataset
- 6. Breast cancer wisconsin (diagnostic) dataset
- Purpose and Benefits of Toy Dataset
- Limitations of Toy Datasets
- Conclusion
Characteristics of Toy DataSet
Here’s a breakdown of their key characteristics:
- Simple and Understandable: An easy-to-comprehend and analyze is the toy dataset that deals with a small number of variables and observations in one hand.
- Controlled Environment: Data is frequently fabricated or deliberately selected by removing complications such as noise or nullity, and now such processes become a determinant factor in studying precisely given concepts.
- Focus on Learning: Essentially, they are employed for pedagogical purposes so that newbies can get acquainted with data analysis, learn how to use algorithms, and understand core machine learning concepts.
Scikit-learn comes with some small standard datasets which not required to be downloaded from any external site.
Top Toy DataSets
Some of the most popular Toy Datasets include:
- Iris plants dataset
- Diabetes dataset
- Optical recognition of handwritten digits dataset
- Linnerrud dataset
- Wine recognition dataset
- Breast cancer wisconsin (diagnostic) dataset
Let us see about them one by one:
1. Iris Plants Dataset
This dataset contains 150 records of iris flowers, each with measurements of sepal length, sepal width, petal length, and petal width. The task is typically to classify these records into one of three iris species.
Classes |
3 |
Samples per class |
50 |
Samples total |
150 |
Dimensionality |
4 |
Features | real, positive |
Example for loading iris dataset
from sklearn.datasets import load_iris
import pandas as pd
# Load the Iris dataset
iris = load_iris()
# Creating a DataFrame from the dataset for easier manipulation
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
# Print the first few rows of the DataFrame
print(iris_df.head())
# Print a summary of the DataFrame
print(iris_df.describe())
# Print the target names and feature
Output:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
species
0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
2. Diabetes Dataset
The load_diabetes function from scikit-learn provides a dataset for regression analysis, featuring physiological measurements and diabetes progression indicators from 442 patients.
Samples total |
442 |
Dimensionality |
10 |
Features | real, -.2 < x < .2 |
Targets | integer 25 – 346 |
Diabetes dataset Example:
from sklearn.datasets import load_diabetes
import pandas as pd
# Load the Diabetes dataset
diabetes = load_diabetes()
# Creating a DataFrame from the dataset for easier manipulation
diabetes_df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
diabetes_df['target'] = diabetes.target
# Print the first few rows of the DataFrame
print(diabetes_df.head())
Output:
age sex bmi bp s1 s2 s3 \
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142
s4 s5 s6 target
0 -0.002592 0.019907 -0.017646 151.0
1 -0.039493 -0.068332 -0.092204 75.0
2 -0.002592 0.002861 -0.025930 141.0
3 0.034309 0.022688 -0.009362 206.0
4 -0.002592 -0.031988 -0.046641 135.0
3. Optical Recognition of Handwritten Digits Dataset
The load_digits function from scikit-learn loads a dataset of 1,797 samples of 8×8 images of handwritten digits, useful for practicing image classification techniques in machine learning with 10 class labels (0-9).
Classes |
10 |
Samples per class |
~180 |
Samples total |
1797 |
Dimensionality |
64 |
Features | integers 0-16 |
Optical recognition of handwritten digits dataset Examples:
from sklearn.datasets import load_digits
import pandas as pd
# Load the digits dataset
digits = load_digits()
# Creating a DataFrame from the dataset for easier manipulation
digits_df = pd.DataFrame(data=digits.data)
digits_df['target'] = digits.target
# Adding column names for better readability
digits_df.columns = [f'pixel_{i}' for i in range(digits.data.shape[1])] + ['target']
# Print the first few rows of the DataFrame
print(digits_df.head())
Output:
pixel_0 pixel_1 pixel_2 pixel_3 pixel_4 pixel_5 pixel_6 pixel_7 \
0 0.0 0.0 5.0 13.0 9.0 1.0 0.0 0.0
1 0.0 0.0 0.0 12.0 13.0 5.0 0.0 0.0
2 0.0 0.0 0.0 4.0 15.0 12.0 0.0 0.0
3 0.0 0.0 7.0 15.0 13.0 1.0 0.0 0.0
4 0.0 0.0 0.0 1.0 11.0 0.0 0.0 0.0
pixel_8 pixel_9 ... pixel_55 pixel_56 pixel_57 pixel_58 pixel_59 \
0 0.0 0.0 ... 0.0 0.0 0.0 6.0 13.0
1 0.0 0.0 ... 0.0 0.0 0.0 0.0 11.0
2 0.0 0.0 ... 0.0 0.0 0.0 0.0 3.0
3 0.0 8.0 ... 0.0 0.0 0.0 7.0 13.0
4 0.0 0.0 ... 0.0 0.0 0.0 0.0 2.0
pixel_60 pixel_61 pixel_62 pixel_63 target
0 10.0 0.0 0.0 0.0 0
1 16.0 10.0 0.0 0.0 1
2 11.0 16.0 9.0 0.0 2
3 13.0 9.0 0.0 0.0 3
4 16.0 4.0 0.0 0.0 4
[5 rows x 65 columns]
4. Linnerrud Dataset
The load_linnerud function in scikit-learn provides a multi-output regression dataset containing exercise and physiological measurements from twenty middle-aged men, useful for fitness-related studies.
Samples total |
20 |
Dimensionality | 3 (for both data and target) |
Features | integer |
Targets | integer |
Linnerrud dataset Examples:
from sklearn.datasets import load_linnerud
import pandas as pd
# Load the Linnerud dataset
linnerud = load_linnerud()
# Creating DataFrames from the dataset for easier manipulation
# Features DataFrame
features_df = pd.DataFrame(data=linnerud.data, columns=linnerud.feature_names)
# Target DataFrame
targets_df = pd.DataFrame(data=linnerud.target, columns=linnerud.target_names)
# Print the first few rows of the features DataFrame
print("Features DataFrame:")
print(features_df.head())
Output:
Features DataFrame:
Chins Situps Jumps
0 5.0 162.0 60.0
1 2.0 110.0 60.0
2 12.0 101.0 101.0
3 12.0 105.0 37.0
4 13.0 155.0 58.0
5. Wine recognition Dataset
The load_wine function from scikit-learn offers a dataset for classification tasks, featuring chemical analyses of three different types of Italian wine.
Classes |
3 |
Samples per class |
[59,71,48] |
Samples total |
178 |
Dimensionality |
13 |
Features | real, positive |
Wine recognition dataset Examples:
from sklearn.datasets import load_wine
import pandas as pd
# Load the wine dataset
wine = load_wine()
# Creating a DataFrame from the dataset for easier manipulation
wine_df = pd.DataFrame(data=wine.data, columns=wine.feature_names)
wine_df['target'] = wine.target
# Add a new column with target names for better readability
wine_df['target_name'] = wine_df['target'].apply(lambda x: wine.target_names[x])
# Print the first few rows of the DataFrame
print(wine_df.head())
Output:
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols \
0 14.23 1.71 2.43 15.6 127.0 2.80
1 13.20 1.78 2.14 11.2 100.0 2.65
2 13.16 2.36 2.67 18.6 101.0 2.80
3 14.37 1.95 2.50 16.8 113.0 3.85
4 13.24 2.59 2.87 21.0 118.0 2.80
flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue \
0 3.06 0.28 2.29 5.64 1.04
1 2.76 0.26 1.28 4.38 1.05
2 3.24 0.30 2.81 5.68 1.03
3 3.49 0.24 2.18 7.80 0.86
4 2.69 0.39 1.82 4.32 1.04
od280/od315_of_diluted_wines proline target target_name
0 3.92 1065.0 0 class_0
1 3.40 1050.0 0 class_0
2 3.17 1185.0 0 class_0
3 3.45 1480.0 0 class_0
4 2.93 735.0 0 class_0
6. Breast cancer wisconsin (diagnostic) dataset
The load_breast_cancer function in scikit-learn provides a dataset for binary classification between benign and malignant breast tumors based on features derived from cell nucleus images.
Classes |
2 |
Samples per class | 212(M),357(B) |
Samples total |
569 |
Dimensionality |
30 |
Features | real, positive |
Breast cancer wisconsin (diagnostic) dataset Example:
from sklearn.datasets import load_breast_cancer
import pandas as pd
# Load the breast cancer dataset
breast_cancer = load_breast_cancer()
# Creating a DataFrame from the dataset for easier manipulation
cancer_df = pd.DataFrame(data=breast_cancer.data, columns=breast_cancer.feature_names)
cancer_df['target'] = breast_cancer.target
# Add a new column with target names for better readability
cancer_df['diagnosis'] = cancer_df['target'].apply(lambda x: breast_cancer.target_names[x])
# Print the first few rows of the DataFrame
print(cancer_df.head())
Output:
mean radius mean texture mean perimeter mean area mean smoothness \
0 17.99 10.38 122.80 1001.0 0.11840
1 20.57 17.77 132.90 1326.0 0.08474
2 19.69 21.25 130.00 1203.0 0.10960
3 11.42 20.38 77.58 386.1 0.14250
4 20.29 14.34 135.10 1297.0 0.10030
mean compactness mean concavity mean concave points mean symmetry \
0 0.27760 0.3001 0.14710 0.2419
1 0.07864 0.0869 0.07017 0.1812
2 0.15990 0.1974 0.12790 0.2069
3 0.28390 0.2414 0.10520 0.2597
4 0.13280 0.1980 0.10430 0.1809
mean fractal dimension ... worst perimeter worst area worst smoothness \
0 0.07871 ... 184.60 2019.0 0.1622
1 0.05667 ... 158.80 1956.0 0.1238
2 0.05999 ... 152.50 1709.0 0.1444
3 0.09744 ... 98.87 567.7 0.2098
4 0.05883 ... 152.20 1575.0 0.1374
worst compactness worst concavity worst concave points worst symmetry \
0 0.6656 0.7119 0.2654 0.4601
1 0.1866 0.2416 0.1860 0.2750
2 0.4245 0.4504 0.2430 0.3613
3 0.8663 0.6869 0.2575 0.6638
4 0.2050 0.4000 0.1625 0.2364
worst fractal dimension target diagnosis
0 0.11890 0 malignant
1 0.08902 0 malignant
2 0.08758 0 malignant
3 0.17300 0 malignant
4 0.07678 0 malignant
[5 rows x 32 columns]
Purpose and Benefits of Toy Dataset
- Educational Tools: Toy datasets serve as excellent resources for teaching and learning machine learning concepts. They allow beginners to focus on understanding algorithms and techniques without getting bogged down by the challenges of data cleaning, preprocessing, or large-scale data management.
- Benchmarking: These datasets provide a standardized framework for evaluating and comparing the performance of various algorithms and models. Since the results are easily reproducible, researchers and developers can benchmark their methods against established baselines.
- Rapid Prototyping: They are ideal for prototyping machine learning models quickly. Developers can test the viability of an algorithm or model design before applying it to more complex and larger datasets.
- Algorithm Development and Testing: Developers use toy datasets to test new algorithms for accuracy, efficiency, and other performance metrics. This testing can reveal fundamental strengths and weaknesses in algorithmic approaches under controlled conditions.
Limitations of Toy Datasets
While toy datasets are valuable educational tools, they do have limitations:
- Simplicity: Toy datasets are often too simple and fail to represent the complexity and noise found in real-world data. This can lead to overly optimistic performance estimates for models trained on these datasets.
- Size: Due to their small size, models trained on toy datasets might not scale well or might overfit when applied to larger, real-world datasets.
- Lack of Diversity: These datasets might not capture the diverse scenarios and variations found in real-world applications, which can limit the generalizability of the insights gained.
Conclusion
Toy datasets, with their simplicity and structured format, play a crucial role in the field of machine learning, particularly in education and preliminary testing. They offer an excellent starting point for beginners to understand fundamental concepts and for experts to test and benchmark new algorithms efficiently. The manageable size of these datasets allows for quick computational tasks and easy visualization, which are invaluable for instructional purposes and algorithm development.