Step 1: Import the required modules

Importing modules and functions from the numpy and scipy.cluster.vq libraries, which are used for performing K-Means clustering in Python and related operations.

# import modules
import numpy as np
from scipy.cluster.vq import whiten, kmeans, vq, kmeans2

Step 2: Import/generate data. Normalize the data

In this code, we demonstrate how to normalize a dataset using the whiten function from the SciPy library. By scaling the original dataset to have a mean of zero and a variance of one, we can ensure that all features contribute equally to subsequent data analysis tasks. This is a common preprocessing step that helps to ensure fairness in the analysis.

# observations
data = np.array([[1, 3, 4, 5, 2],
                 [2, 3, 1, 6, 3],
                 [1, 5, 2, 3, 1],
                 [3, 4, 9, 2, 1]])

# normalize
data = whiten(data)



Step 3: Calculate the centroids and generate the code book for mapping using kmeans() method

K-Means clustering Algorithm in Python using the kmeans function from the SciPy library. It calculates cluster centroids and provides the mean value of Euclidean distances between data points and their respective cluster centroids. It Randomly choose K data points as initial centroids for the clusters. These centroids will serve as the starting points for the clustering process.

# code book generation
centroids, mean_value = kmeans(data, 3)

print("Code book :\n", centroids, "\n")
print("Mean of Euclidean distances :", 


Step 4: Map the centroids calculated in the previous step to the clusters

In this the vq function from the SciPy library to assign data points to clusters based on pre-calculated centroids and calculate the distances between data points and their respective cluster centroids. This will display the cluster assignments and the distances of each data point to its assigned centroid.

# mapping the centroids
clusters, distances = vq(data, centroids)

print("Cluster index :", clusters, "\n")
print("Distance from the centroids :", distances)


Consider the same example with kmeans2(). This does not require the additional step of calling vq() method. Repeat steps 1 and 2, then use the following snippet.

# assign centroids and clusters
centroids, clusters = kmeans2(data, 3, 

print("Centroids :\n", centroids, "\n")
print("Clusters :", clusters)


Example 2: K-Means clustering of Diabetes dataset

The dataset contains the following attributes based on which a patient is either placed in diabetic cluster or non-diabetic cluster.

  • Pregnancies
  • Glucose
  • Blood Pressure
  • Skin Thickness
  • Insulin
  • BMI
  • Diabetes Pedigree Function
  • Age

This code demonstrates a basic example of using clustering techniques to analyze diabetes patient data and visualize the distribution of diabetic and non-diabetic patients using a pie chart. The code uses Python libraries such as NumPy, SciPy, and Matplotlib.

# import modules
import matplotlib.pyplot as plt
import numpy as np
from scipy.cluster.vq import whiten, kmeans, vq

# load the dataset
dataset = np.loadtxt(r"{your-path}\diabetes-train.csv",

# excluding the outcome column
dataset = dataset[:, 0:8]

print("Data :\n", dataset, "\n")

# normalize
dataset = whiten(dataset)

# generate code book
centroids, mean_dist = kmeans(dataset, 2)
print("Code-book :\n", centroids, "\n")

clusters, dist = vq(dataset, centroids)
print("Clusters :\n", clusters, "\n")

# count non-diabetic patients
non_diab = list(clusters).count(0)

# count diabetic patients
diab = list(clusters).count(1)

# depict illustration
x_axis = []

colors = ['green', 'orange']

print("No.of.diabetic patients : " + str(x_axis[0]) +
      "\nNo.of.non-diabetic patients : " + str(x_axis[1]))

y = ['diabetic', 'non-diabetic']

plt.pie(x_axis, labels=y, colors=colors, shadow='true')


In Conclusion, the overall code demonstrates a comprehensive approach to analyzing a diabetes dataset. By leveraging K-Means clustering, it effectively segments patients into groups, allowing for a better understanding of patient demographics and potentially supporting medical insights. The visualization of the distribution of diabetic and non-diabetic patients offers a clear overview of the clustering result, thus enhancing the comprehension of the analysis.

K-means clustering in Python is one of the most widely used unsupervised machine-learning techniques for data segmentation and pattern discovery. This article will explore K-means clustering in Python using the powerful SciPy library. With a step-by-step approach, we will cover the fundamentals, implementation, and interpretation of K-Means clustering, providing you with a comprehensive understanding of this essential data analysis technique.

K-Means clustering with Scipy library

The K-means clustering in Python can be done on given data by executing the following steps....

