## Using k-means Clustering in TensorFlow

Posted by Sergey Kovalev, Senior Software Engineer, in Machine Learning

Tags: Machine Learning, TensorFlow

19 Apr 2016 | 12 CommentsThe goal of this TensorFlow tutorial is to use the *k*-means algorithm for grouping data into clusters with similar characteristics. When working with *k*-means, the data in a training set does not need labels. As an unsupervised learning method, the algorithm builds clusters based on the data itself.

### Running a TensorFlow graph

First, let’s generate random data points with a uniform distribution and assign them to a 2D tensor constant. Then, randomly choose initial centroids from the set of data points.

points = tf.constant(np.random.uniform(0, 10, (points_n, 2))) centroids = tf.Variable(tf.slice(tf.random_shuffle(points), [0, 0], [clusters_n, -1]))

For the next step, we want to be able to do element-wise subtraction of *points* and *centroids* that are 2D tensors. Because the tensors have different shape, let’s expand *points* and *centroids* into 3 dimensions, which allows us to use the broadcasting feature of subtraction operation.

points_expanded = tf.expand_dims(points, 0) centroids_expanded = tf.expand_dims(centroids, 1)

Then, calculate the distances between points and centroids and determine the cluster assignments.

distances = tf.reduce_sum(tf.square(tf.sub(points_expanded, centroids_expanded)), 2) assignments = tf.argmin(distances, 0)

Next, we can compare each cluster with a cluster assignments vector, get points assigned to each cluster, and calculate mean values. These mean values are refined centroids, so let’s update the centroids variable with the new values.

means = [] for c in xrange(clusters_n): means.append(tf.reduce_mean( tf.gather(points, tf.reshape( tf.where( tf.equal(assignments, c) ),[1,-1]) ),reduction_indices=[1])) new_centroids = tf.concat(0, means) update_centroids = tf.assign(centroids, new_centroids)

It’s time to run the graph. For each iteration, we update the centroids and return their values along with the cluster assignments values.

with tf.Session() as sess: sess.run(init) for step in xrange(iteration_n): [_, centroid_values, points_values, assignment_values] = sess.run([update_centroids, centroids, points, assignments])

Lastly, we display the coordinates of the final centroids and a multi-colored scatter plot showing how the data points have been clustered.

print "centroids" + "\n", centroid_values plt.scatter(points_values[:, 0], points_values[:, 1], c=assignment_values, s=50, alpha=0.5) plt.plot(centroid_values[:, 0], centroid_values[:, 1], 'kx', markersize=15) plt.show()

### Source code

import matplotlib.pyplot as plt import numpy as np import tensorflow as tf points_n = 200 clusters_n = 3 iteration_n = 100 points = tf.constant(np.random.uniform(0, 10, (points_n, 2))) centroids = tf.Variable(tf.slice(tf.random_shuffle(points), [0, 0], [clusters_n, -1])) points_expanded = tf.expand_dims(points, 0) centroids_expanded = tf.expand_dims(centroids, 1) distances = tf.reduce_sum(tf.square(tf.sub(points_expanded, centroids_expanded)), 2) assignments = tf.argmin(distances, 0) means = [] for c in xrange(clusters_n): means.append(tf.reduce_mean( tf.gather(points, tf.reshape( tf.where( tf.equal(assignments, c) ),[1,-1]) ),reduction_indices=[1])) new_centroids = tf.concat(0, means) update_centroids = tf.assign(centroids, new_centroids) init = tf.initialize_all_variables() with tf.Session() as sess: sess.run(init) for step in xrange(iteration_n): [_, centroid_values, points_values, assignment_values] = sess.run([update_centroids, centroids, points, assignments]) print "centroids" + "\n", centroid_values plt.scatter(points_values[:, 0], points_values[:, 1], c=assignment_values, s=50, alpha=0.5) plt.plot(centroid_values[:, 0], centroid_values[:, 1], 'kx', markersize=15) plt.show()

The data in a training set is grouped into clusters as the result of implementing the *k*-means algorithm in TensorFlow.

### Related:

*Basic Concepts and Manipulations with TensorFlow**Visualizing TensorFlow Graphs with TensorBoard**Using Linear Regression in TensorFlow**Using Logistic and Softmax Regression in TensorFlow**Performance Benchmark: Caffe, Deeplearning4j, TensorFlow, Theano, and Torch*

**Sergey Kovalev** is a senior software engineer with extensive experience in high-load application development, big data and NoSQL solutions, cloud computing, data warehousing, and machine learning. He has strong expertise in back-end engineering, applying the best approaches for development, architecture design, and scaling. He has solid background in software development practices, such as the Agile methodology, prototyping, patterns, refactoring, and code review. Now, Sergey’s main interest lies in big data distributed computing and machine learning.

To stay tuned with the latest updates, subscribe to our blog or follow @altoros.