Clustering

Clustering is an unsupervised learning approach that "clusters" data points based on the type of clustering technique. There are many varieties of clustering out there, such as hierarchical and k-means.

Common Applications

Common Industries

Agriscience
Healthcare
Marketing
Tech/Social Media

Common Problem Types

Anomaly Identification
Market Segmentation
Genetic/Biological Analysis
Recommender Systems

Different Types of Clustering

There is a remarkable diversity of clustering methods out there, with at least one scholar claiming that the reason for this is due to the vagueness of the term "cluster". In general, clustering algorithms group data in some way. How they accomplish this is incredibly varied. Here is a short list of different types of clustering algorithms:

k-Means
BIRCH
DBSCAN
OPTICS
Hierarchical
Expectation-Maximization (EM)
Subspace
Fuzzy
Biclustering

Partitional vs. Hierarchical

Clustering algorithms are commonly divided into partitional or hierarchical algorithms. Partitional clustering methods can be said to cluster on a flat partition, where each data point belongs to one and only one cluster. Hierarchical clustering methods create more than one layers of partitioning, where each cluster could have subclusters in it too.

Figure 1. quantdare.com/hierarchical-clustering/

Hierarchical clustering methods:

Agglomerative: group data by larger and larger clusters
Divisive: group data by smaller and smaller clusters

Categories of Clustering Models

Below is a short list of categories of cluster modeling techniques:

Centroid-based clustering (like k-means)
Density-based clustering (like DBSCAN and OPTICS)
Distribution-based clustering (using statistical distributions like the multivariate normal to cluster)
Hierarchical clustering (also called connectivity models)

Code Examples

All of the code examples are written in Python, unless otherwise noted.

Containers

These are code examples in the form of Jupyter notebooks running in a container that come with all the data, libraries, and code you’ll need to run it. Click here to learn why you should be using containers, along with how to do so.

Quickstart: Download Docker, then run the commands below in a terminal.

k-Means Clustering

An implementation of k-Means Clustering to do an RFM marketing analysis.

#pull container, only needs to be run once
docker pull ghcr.io/thedatamine/starter-guides:k-means-clustering

#run container
docker run -p 8888:8888 -it ghcr.io/thedatamine/starter-guides:k-means-clustering

Hierarchical Clustering

An implementation of (agglomerative) hierarchical clustering using national socioeconomic data.

#pull container, only needs to be run once
docker pull ghcr.io/thedatamine/starter-guides:hierarchical-clustering

#run container
docker run -p 8888:8888 -it ghcr.io/thedatamine/starter-guides:hierarchical-clustering

Need help implementing any of this code? Feel free to reach out to datamine-help@purdue.edu and we can help!

Resources

All resources are chosen by Data Mine staff to be of decent quality, and most if not all content is free.

Clustering

Common Applications

Common Industries

Common Problem Types

Different Types of Clustering

Partitional vs. Hierarchical

Categories of Clustering Models

Code Examples

Containers

k-Means Clustering

Hierarchical Clustering

Resources

Websites

Videos

Books

Articles