KMeans Clustering with Iris Dataset

 K-means clustering is an Unsupervised machine learning algorithm. 

  • First, choose the clusters K
  • Randomly select k centroids from the whole dataset
  • Assign all points to the closest cluster centroid
  • Recompute centroids again for new clusters
  • now repeat steps 3 and 4 until centroids converge

Python3




wcss = []
  
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i,
                    init='k-means++',
                    max_iter=300,
                    n_init=10,
                    random_state=0)
    kmeans.fit(x)
    wcss.append(kmeans.inertia_)
      
# from above array with help of elbow method
#we can get no of cluster to provide.
kmeans = KMeans(n_clusters=3,
                init='k-means++',
                max_iter=300,
                n_init=10,
                random_state=0)
y_kmeans = kmeans.fit_predict(x)


In the above code, we have used the elbow method to get the optimized value of k. If we plot a graph for it we get a value of 3.

Visualizing the Clusters

Python3




# Visualising the clusters
cols = iris.columns
plt.scatter(X.loc[y_kmeans == 0, cols[0]],
            X.loc[y_kmeans == 0, cols[1]],
            s=100, c='purple',
            label='Iris-setosa')
plt.scatter(X.loc[y_kmeans == 1, cols[0]],
            X.loc[y_kmeans == 1, cols[1]],
            s=100, c='orange',
            label='Iris-versicolour')
plt.scatter(X.loc[y_kmeans == 2, cols[0]],
            X.loc[y_kmeans == 2, cols[1]],
            s=100, c='green',
            label='Iris-virginica')
  
# Plotting the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0],
            kmeans.cluster_centers_[:, 1],
            s=100, c='red',
            label='Centroids')
  
plt.legend()


Output:

Clusters obtained by using the K-means algorithm

 

Accuracy and Performance of Model

Now let’s check the performance of the model.

Python3




pd.crosstab(iris.target, y_kmeans)


Output:

 

As the algorithm is an unsupervised algorithm we don’t have test data here to check the performance of the model on it. Setosa class is clustered perfectly. While Versicolor has only 2 misclassifications.  Class virginica is getting overlapped Versicolor hence there is 14 misclassifications.



Analyzing Decision Tree and K-means Clustering using Iris dataset

Iris Dataset is one of best know datasets in pattern recognition literature. This dataset contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2 the latter are NOT linearly separable from each other.

Attribute Information:

  1. Sepal Length in cm
  2. Sepal Width in cm
  3. Petal Length in cm
  4. al Width in cm
  5. Class:
    • Iris Setosa
    • Iris Versicolour
    • Iris Virginica

Let’s perform Exploratory data analysis on the dataset to get our initial investigation right.

Similar Reads

Importing Libraries and Dataset

Python libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code....

Decision Tree Algorithm with Iris Dataset

...

KMeans Clustering with Iris Dataset

...