Blog

Using k-means Clustering in TensorFlow

Sergey Kovalev

using-k-means-clustering-in-tensorflow

The goal of this TensorFlow tutorial is to use the k-means algorithm for grouping data into clusters with similar characteristics. When working with k-means, the data in a training set does not need labels. As an unsupervised learning method, the algorithm builds clusters based on the data itself.

 

Running a TensorFlow graph

First, let’s generate random data points with a uniform distribution and assign them to a 2D tensor constant. Then, randomly choose initial centroids from the set of data points.

points = tf.constant(np.random.uniform(0, 10, (points_n, 2)))
centroids = tf.Variable(tf.slice(tf.random_shuffle(points), [0, 0], [clusters_n, -1]))

For the next step, we want to be able to do element-wise subtraction of points and centroids that are 2D tensors. Because the tensors have different shape, let’s expand points and centroids into 3 dimensions, which allows us to use the broadcasting feature of subtraction operation.

points_expanded = tf.expand_dims(points, 0)
centroids_expanded = tf.expand_dims(centroids, 1)

Then, calculate the distances between points and centroids and determine the cluster assignments.

distances = tf.reduce_sum(tf.square(tf.sub(points_expanded, centroids_expanded)), 2)
assignments = tf.argmin(distances, 0)

Next, we can compare each cluster with a cluster assignments vector, get points assigned to each cluster, and calculate mean values. These mean values are refined centroids, so let’s update the centroids variable with the new values.

means = []
for c in xrange(clusters_n):
    means.append(tf.reduce_mean(
      tf.gather(points, 
                tf.reshape(
                  tf.where(
                    tf.equal(assignments, c)
                  ),[1,-1])
               ),reduction_indices=[1]))

new_centroids = tf.concat(0, means)
update_centroids = tf.assign(centroids, new_centroids)

It’s time to run the graph. For each iteration, we update the centroids and return their values along with the cluster assignments values.

with tf.Session() as sess:
  sess.run(init)
  for step in xrange(iteration_n):
    [_, centroid_values, points_values, assignment_values] = sess.run([update_centroids, centroids, points, assignments])

Lastly, we display the coordinates of the final centroids and a multi-colored scatter plot showing how the data points have been clustered.

print "centroids" + "\n", centroid_values

plt.scatter(points_values[:, 0], points_values[:, 1], c=assignment_values, s=50, alpha=0.5)
plt.plot(centroid_values[:, 0], centroid_values[:, 1], 'kx', markersize=15)
plt.show()

using-k-means-clustering-in-tensorflow

 

Source code

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

points_n = 200
clusters_n = 3
iteration_n = 100

points = tf.constant(np.random.uniform(0, 10, (points_n, 2)))
centroids = tf.Variable(tf.slice(tf.random_shuffle(points), [0, 0], [clusters_n, -1]))

points_expanded = tf.expand_dims(points, 0)
centroids_expanded = tf.expand_dims(centroids, 1)

distances = tf.reduce_sum(tf.square(tf.sub(points_expanded, centroids_expanded)), 2)
assignments = tf.argmin(distances, 0)

means = []
for c in xrange(clusters_n):
    means.append(tf.reduce_mean(
      tf.gather(points, 
                tf.reshape(
                  tf.where(
                    tf.equal(assignments, c)
                  ),[1,-1])
               ),reduction_indices=[1]))

new_centroids = tf.concat(0, means)

update_centroids = tf.assign(centroids, new_centroids)
init = tf.initialize_all_variables()

with tf.Session() as sess:
  sess.run(init)
  for step in xrange(iteration_n):
    [_, centroid_values, points_values, assignment_values] = sess.run([update_centroids, centroids, points, assignments])
    
  print "centroids" + "\n", centroid_values

plt.scatter(points_values[:, 0], points_values[:, 1], c=assignment_values, s=50, alpha=0.5)
plt.plot(centroid_values[:, 0], centroid_values[:, 1], 'kx', markersize=15)
plt.show()

The data in a training set is grouped into clusters as the result of implementing the k-means algorithm in TensorFlow.


Related:


Sergey Kovalev is a senior software engineer with extensive experience in high-load application development, big data and NoSQL solutions, cloud computing, data warehousing, and machine learning. He has strong expertise in back-end engineering, applying the best approaches for development, architecture design, and scaling. He has solid background in software development practices, such as the Agile methodology, prototyping, patterns, refactoring, and code review. Now, Sergey’s main interest lies in big data distributed computing and machine learning.


To stay tuned with the latest updates, subscribe to our blog or follow @altoros.

Get new posts right in your inbox!

12 Comments
  • Arnaud Sors

    Hi! Thanks for this blog post. I implemented the ‘centroid update’ part in the following way, which might be a bit simpler:
    list_by_centroids = tf.dynamic_partition(points, assignments, num_partitions=clusters_n)
    means = [tf.reduce_mean(datapoints, 0) for datapoints in list_by_centroids]
    new_centroids = tf.pack(means)
    For more speed it would be nice to get rid of the list comprehension… (or for loop). Due to different number of datapoints per cluster I cannot think of a nice way to do this though…

    • Sergey Sintsov

      Hi! Thanks a lot for your comment! Did you try to run the code with the ‘centroid update’ on CPU or GPU?
      We experimented on CPU, and the running time was actually a bit longer than for the original code…

      • Arnaud Sors

        Hi Sergey! Okay, well this was quite some time ago so I do not remember but I would say it was on GPU…

        • Sergey Sintsov

          Arnaud, how much do you expect the ‘centroid update’ part to improve the speed, any computation complexity assumptions? May be I’ve used to small dataset to see speedup.
          Thanks!

  • Spandan Samiran

    Can you suggest how to do this procedure on images?

    • Sergey Sintsov

      Hello! You could use the k-means algorithm for the image clusterization by
      a. Representing all images as N-dimentional vectors, where the i-th component corresponds with the i-th pixel
      b. Introducing some distance function (L1, L2, or any other norms) between the pairs of images
      However, the quality of results of such clustering will be low. There’s a number of factors one needs to take into account while accessing “similarity” of images (background clutter, deformation, illumination conditions, intra-class variation, occlusion, scale variation, viewpoint variation, etc.).
      Simply put, it all depends on how we define the clusters, and what features we choose to group pictures into clusters. Probably, in your case it’s better to use other algorithms, for instance, based on neural networks.

  • Hao Yu (Cody)

    Thanks for this helpful tutorial. Two miner changes for the example code to fit the latest TensorFlow:
    1. tf.sub is changed to tf.subtract.
    2. The argument order of tf.concat is changed to (tensor, dim).

    I successfully ran through the example after modifying the above two problems.

    • Prashant Upadhyay

      What exact code did you replace for tf.concat?

      • Tran Duc Nguyen

        Change it to concat(means, 0)

    • Sergey Sintsov

      Hi Cody! Thanks a lot for the valuable input. Indeed, the code has to be slightly modified to fit Python3 and the latest TensorFlow.

  • Dane Lee

    May 5th in 2018, I changed few codes to fit the latest version. I ran it myself, it works well.

    Code Link : https://github.com/danelee2601/Test01/blob/master/a21.py

    • Sergey Sintsov

      Hey Dane! Many thanks for posting the updated version! Indeed, the code in the article has been written for the older versions of TensorFlow and Python.
      For getting the code running on Python v3.6 and TensorFlow v1.8.0, one needs to make the following changes:

      1) line 15
      distances = tf.reduce_sum(tf.square(tf.sub(points_expanded, centroids_expanded)), 2)
      Replace by
      distances = tf.reduce_sum(tf.square(tf.subtract(points_expanded, centroids_expanded)), 2)

      2) line 19
      for c in xrange(clusters_n):
      Replace by
      for c in range(clusters_n):

      3) line 28
      new_centroids = tf.concat(0, means)
      Replace by
      new_centroids = tf.concat(means, 0)

      4) line 31
      init = tf.initialize_all_variables()
      Replace by
      init = tf.global_variables_initializer()

      5) line 35
      for step in xrange(iteration_n):
      Replace by
      for step in range(iteration_n):

      6) line 38
      print “centroids” + “n”, centroid_values
      Replace by
      print(“centroids”, centroid_values)

Benchmarks and Research

Subscribe to new posts

Get new posts right in your inbox!