August 28, 2015

Posted by: [personal profile] t_fischer

I spent the better part of today's evening preparing and testing .ebuild files (available in my GitLab repository) for the so-called ‘free branch’ versions of targetcli, rtslib, and configshell called targetcli-fb, rtslib-fb, and configshell-fb. The ‘free branch’ versions are the recommended way of setting up a iSCSI target on Linux, according to the Arch Linux wiki.

Read more... )

comment count unavailable comments

August 26, 2015

This is a tutorial on how to use scipy's hierarchical clustering.

One of the benefits of hierarchical clustering is that you don't need to already know the number of clusters k in your data in advance. Sadly, there doesn't seem to be much documentation on how to actually use scipy's hierarchical clustering to make an informed decision and then retrieve the clusters.

In the following I'll explain:

Naming conventions:

Before we start, as i know that it's easy to get lost, some naming conventions:

  • X samples (n x m array), aka data points or "singleton clusters"
  • n number of samples
  • m number of features
  • Z cluster linkage array (contains the hierarchical clustering information)
  • k number of clusters

So, let's go.

Imports and Setup

In [1]:
# needed imports
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
import numpy as np
In [2]:
# some setting for this notebook to actually show the graphs inline, you probably won't need this
%matplotlib inline
np.set_printoptions(precision=5, suppress=True)  # suppress scientific float notation

Generating Sample Data

You'll obviously not need this step to run the clustering if you have own data.

The only thing you need to make sure is that you convert your data into a matrix X with n samples and m features, so that X.shape == (n, m).

In [3]:
# generate two clusters: a with 100 points, b with 50:
np.random.seed(4711)  # for repeatability of this tutorial
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
X = np.concatenate((a, b),)
print X.shape  # 150 samples with 2 dimensions
plt.scatter(X[:,0], X[:,1])
plt.show()
(150, 2)

Perform the Hierarchical Clustering

Now that we have some very simple sample data, let's do the actual clustering on it:

In [4]:
# generate the linkage matrix
Z = linkage(X, 'ward')

Done. That was pretty simple, wasn't it?

Well, sure it was, this is python ;), but what does the weird 'ward' mean there and how does this actually work?

As the scipy linkage docs tell us, 'ward' is one of the methods that can be used to calculate the distance between newly formed clusters. 'ward' causes linkage() to use the Ward variance minimization algorithm.

I think it's a good default choice, but it never hurts to play around with some other common linkage methods like 'single', 'complete', 'average', ... and the different distance metrics like 'euclidean' (default), 'cityblock' aka Manhattan, 'hamming', 'cosine'... if you have the feeling that your data should not just be clustered to minimize the overall intra cluster variance in euclidean space. For example, you should have such a weird feeling with long (binary) feature vectors (e.g., word-vectors in text clustering).

As you can see there's a lot of choice here and while python and scipy make it very easy to do the clustering, it's you who has to understand and make these choices. If i find the time, i might give some more practical advice about this, but for now i'd urge you to at least read up on the linked methods and metrics to make a somewhat informed choice. Another thing you can and should definitely do is check the Cophenetic Correlation Coefficient of your clustering with help of the cophenet() function. This (very very briefly) compares (correlates) the actual pairwise distances of all your samples to those implied by the hierarchical clustering. The closer the value is to 1, the better the clustering preserves the original distances, which in our case is pretty close:

In [5]:
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist

c, coph_dists = cophenet(Z, pdist(X))
c
Out[5]:
0.98001483875742679

No matter what method and metric you pick, the linkage() function will use that method and metric to calculate the distances of the clusters (starting with your n individual samples (aka data points) as singleton clusters)) and in each iteration will merge the two clusters which have the smallest distance according the selected method and metric. It will return an array of length n - 1 giving you information about the n - 1 cluster merges which it needs to pairwise merge n clusters. Z[i] will tell us which clusters were merged in the i-th iteration, let's take a look at the first two points that were merged:

In [6]:
Z[0]
Out[6]:
array([ 52.     ,  53.     ,   0.04151,   2.     ])

We can see that ach row of the resulting array has the format [idx1, idx2, dist, sample_count].

In its first iteration the linkage algorithm decided to merge the two clusters (original samples here) with indices 52 and 53, as they only had a distance of 0.04151. This created a cluster with a total of 2 samples.

In [7]:
Z[1]
Out[7]:
array([ 14.     ,  79.     ,   0.05914,   2.     ])

In the second iteration the algorithm decided to merge the clusters (original samples here as well) with indices 14 and 79, which had a distance of 0.04914. This again formed another cluster with a total of 2 samples.

The indices of the clusters until now correspond to our samples. Remember that we had a total of 150 samples, so indices 0 to 149. Let's have a look at the first 20 iterations:

In [8]:
Z[:20]
Out[8]:
array([[  52.     ,   53.     ,    0.04151,    2.     ],
       [  14.     ,   79.     ,    0.05914,    2.     ],
       [  33.     ,   68.     ,    0.07107,    2.     ],
       [  17.     ,   73.     ,    0.07137,    2.     ],
       [   1.     ,    8.     ,    0.07543,    2.     ],
       [  85.     ,   95.     ,    0.10928,    2.     ],
       [ 108.     ,  131.     ,    0.11007,    2.     ],
       [   9.     ,   66.     ,    0.11302,    2.     ],
       [  15.     ,   69.     ,    0.11429,    2.     ],
       [  63.     ,   98.     ,    0.1212 ,    2.     ],
       [ 107.     ,  115.     ,    0.12167,    2.     ],
       [  65.     ,   74.     ,    0.1249 ,    2.     ],
       [  58.     ,   61.     ,    0.14028,    2.     ],
       [  62.     ,  152.     ,    0.1726 ,    3.     ],
       [  41.     ,  158.     ,    0.1779 ,    3.     ],
       [  10.     ,   83.     ,    0.18635,    2.     ],
       [ 114.     ,  139.     ,    0.20419,    2.     ],
       [  39.     ,   88.     ,    0.20628,    2.     ],
       [  70.     ,   96.     ,    0.21931,    2.     ],
       [  46.     ,   50.     ,    0.22049,    2.     ]])

We can observe that until iteration 13 the algorithm only directly merged original samples. We can also observe the monotonic increase of the distance.

In iteration 13 the algorithm decided to merge cluster indices 62 with 152. If you paid attention the 152 should astonish you as we only have original sample indices 0 to 149 for our 150 samples. All indices idx >= len(X) actually refer to the cluster formed in Z[idx - len(X)].

This means that while idx 149 corresponds to X[149] that idx 150 corresponds to the cluster formed in Z[0], idx 151 to Z[1], 152 to Z[2], ...

Hence, the merge iteration 13 merged sample 62 to our samples 33 and 68 that were previously merged in iteration 2 (152 - 2).

Let's check out the points coordinates to see if this makes sense:

In [9]:
X[[33, 68, 62]]
Out[9]:
array([[ 9.83913, -0.4873 ],
       [ 9.89349, -0.44152],
       [ 9.97793, -0.56383]])

Seems pretty close, but let's plot the points again and highlight them:

In [10]:
idxs = [33, 68, 62]
plt.figure(figsize=(10, 8))
plt.scatter(X[:,0], X[:,1])  # plot all points
plt.scatter(X[idxs,0], X[idxs,1], c='r')  # plot interesting points in red again
plt.show()

We can see that the 3 red dots are pretty close to each other, which is a good thing.

The same happened in iteration 14 where the alrogithm merged indices 41 to 15 and 69:

In [11]:
idxs = [33, 68, 62]
plt.figure(figsize=(10, 8))
plt.scatter(X[:,0], X[:,1])
plt.scatter(X[idxs,0], X[idxs,1], c='r')
idxs = [15, 69, 41]
plt.scatter(X[idxs,0], X[idxs,1], c='y')
plt.show()

Showing that the 3 yellow dots are also quite close.

And so on...

We'll later come back to visualizing this, but now let's have a look at what's called a dendrogram of this hierarchical clustering first:

Plotting a Dendrogram

A dendrogram is a visualization in form of a tree showing the order and distances of merges during the hierarchical clustering.

In [12]:
# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    Z,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=8.,  # font size for the x axis labels
)
plt.show()

(right click and "View Image" to see full resolution)

If this is the first time you see a dendrogram, it's probably quite confusing, so let's take this apart...

  • On the x axis you see labels. If you don't specify anything else they are the indices of your samples in X.
  • On the y axis you see the distances (of the 'ward' method in our case).

Starting from each label at the bottom, you can see a vertical line up to a horizontal line. The height of that horizontal line tells you about the distance at which this label was merged into another label or cluster. You can find that other cluster by following the other vertical line down again. If you don't encounter another horizontal line, it was just merged with the other label you reach, otherwise it was merged into another cluster that was formed earlier.

Summarizing:

  • horizontal lines are cluster merges
  • vertical lines tell you which clusters/labels were part of merge forming that new cluster
  • heights of the horizontal lines tell you about the distance that needed to be "bridged" to form the new cluster

You can also see that from distances > 25 up there's a huge jump of the distance to the final merge at a distance of approx. 180. Let's have a look at the distances of the last 4 merges:

In [13]:
Z[-4:,2]
Out[13]:
array([  15.11533,   17.11527,   23.12199,  180.27043])

Such distance jumps / gaps in the dendrogram are pretty interesting for us. They indicate that something is merged here, that maybe just shouldn't be merged. In other words: maybe the things that were merged here really don't belong to the same cluster, telling us that maybe there's just 2 clusters here.

Looking at indices in the above dendrogram also shows us that the green cluster only has indices >= 100, while the red one only has such < 100. This is a good thing as it shows that the algorithm re-discovered the two classes in our toy example.

In case you're wondering about where the colors come from, you might want to have a look at the color_threshold argument of dendrogram(), which as not specified automagically picked a distance cut-off value of 70 % of the final merge and then colored the first clusters below that in individual colors.

Dendrogram Truncation

As you might have noticed, the above is pretty big for 150 samples already and you probably have way more in real scenarios, so let me spend a few seconds on highlighting some other features of the dendrogram() function:

In [14]:
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    Z,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=12,  # show only the last p merged clusters
    show_leaf_counts=False,  # otherwise numbers in brackets are counts
    leaf_rotation=90.,
    leaf_font_size=12.,
    show_contracted=True,  # to get a distribution impression in truncated branches
)
plt.show()

The above shows a truncated dendrogram, which only shows the last p=12 out of our 149 merges.

First thing you should notice are that most labels are missing. This is because except for X[40] all other samples were already merged into clusters before the last 12 merges.

The parameter show_contracted allows us to draw black dots at the heights of those previous cluster merges, so we can still spot gaps even if we don't want to clutter the whole visualization. In our example we can see that the dots are all at pretty small distances when compared to the huge last merge at a distance of 180, telling us that we probably didn't miss much there.

As it's kind of hard to keep track of the cluster sizes just by the dots, dendrogram() will by default also print the cluster sizes in brackets () if a cluster was truncated:

In [15]:
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index or (cluster size)')
plt.ylabel('distance')
dendrogram(
    Z,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=12,  # show only the last p merged clusters
    leaf_rotation=90.,
    leaf_font_size=12.,
    show_contracted=True,  # to get a distribution impression in truncated branches
)
plt.show()

We can now see that the right most cluster already consisted of 33 samples before the last 12 merges.

Eye Candy

Even though this already makes for quite a nice visualization, we can pimp it even more by also annotating the distances inside the dendrogram by using some of the useful return values dendrogram():

In [16]:
def fancy_dendrogram(*args, **kwargs):
    max_d = kwargs.pop('max_d', None)
    if max_d and 'color_threshold' not in kwargs:
        kwargs['color_threshold'] = max_d
    annotate_above = kwargs.pop('annotate_above', 0)

    ddata = dendrogram(*args, **kwargs)

    if not kwargs.get('no_plot', False):
        plt.title('Hierarchical Clustering Dendrogram (truncated)')
        plt.xlabel('sample index or (cluster size)')
        plt.ylabel('distance')
        for i, d, c in zip(ddata['icoord'], ddata['dcoord'], ddata['color_list']):
            x = 0.5 * sum(i[1:3])
            y = d[1]
            if y > annotate_above:
                plt.plot(x, y, 'o', c=c)
                plt.annotate("%.3g" % y, (x, y), xytext=(0, -5),
                             textcoords='offset points',
                             va='top', ha='center')
        if max_d:
            plt.axhline(y=max_d, c='k')
    return ddata
In [17]:
fancy_dendrogram(
    Z,
    truncate_mode='lastp',
    p=12,
    leaf_rotation=90.,
    leaf_font_size=12.,
    show_contracted=True,
    annotate_above=10,  # useful in small plots so annotations don't overlap
)
plt.show()

Selecting a Distance Cut-Off aka Determining the Number of Clusters

As explained above already, a huge jump in distance is typically what we're interested in if we want to argue for a certain number of clusters. If you have the chance to do this manually, i'd always opt for that, as it allows you to gain some insights into your data and to perform some sanity checks on the edge cases. In our case i'd probably just say that our cut-off is 50, as the jump is pretty obvious:

In [18]:
# set cut-off to 50
max_d = 50  # max_d as in max_distance

Let's visualize this in the dendrogram as a cut-off line:

In [19]:
fancy_dendrogram(
    Z,
    truncate_mode='lastp',
    p=12,
    leaf_rotation=90.,
    leaf_font_size=12.,
    show_contracted=True,
    annotate_above=10,
    max_d=max_d,  # plot a horizontal cut-off line
)
plt.show()

As we can see, we ("surprisingly") have two clusters at this cut-off.

In general for a chosen cut-off value max_d you can always simply count the number of intersections with vertical lines of the dendrogram to get the number of formed clusters. Say we choose a cut-off of max_d = 16, we'd get 4 final clusters:

In [20]:
fancy_dendrogram(
    Z,
    truncate_mode='lastp',
    p=12,
    leaf_rotation=90.,
    leaf_font_size=12.,
    show_contracted=True,
    annotate_above=10,
    max_d=16,
)
plt.show()

Automated Cut-Off Selection (or why you shouldn't rely on this)

Now while this manual selection of a cut-off value offers a lot of benefits when it comes to checking for a meaningful clustering and cut-off, there are cases in which you want to automate this.

The problem again is that there is no golden method to pick the number of clusters for all cases (which is why i think the investigative & backtesting manual method is preferable). Wikipedia lists a couple of common methods. Reading this, you should realize how different the approaches and how vague their descriptions are.

I honestly think it's a really bad idea to just use any of those methods, unless you know the data you're working on really really well.

Inconsistency Method

For example, let's have a look at the "inconsistency" method, which seems to be one of the defaults for the fcluster() function in scipy.

The question driving the inconsistency method is "what makes a distance jump a jump?". It answers this by comparing each cluster merge's height h to the average avg and normalizing it by the standard deviation std formed over the depth previous levels:

$$inconsistency = \frac{h - avg}{std}$$

The following shows a matrix of the avg, std, count, inconsistency for each of the last 10 merges of our hierarchical clustering with depth = 5

In [21]:
from scipy.cluster.hierarchy import inconsistent

depth = 5
incons = inconsistent(Z, depth)
incons[-10:]
Out[21]:
array([[  1.80875,   2.17062,  10.     ,   2.44277],
       [  2.31732,   2.19649,  16.     ,   2.52742],
       [  2.24512,   2.44225,   9.     ,   2.37659],
       [  2.30462,   2.44191,  21.     ,   2.63875],
       [  2.20673,   2.68378,  17.     ,   2.84582],
       [  1.95309,   2.581  ,  29.     ,   4.05821],
       [  3.46173,   3.53736,  28.     ,   3.29444],
       [  3.15857,   3.54836,  28.     ,   3.93328],
       [  4.9021 ,   5.10302,  28.     ,   3.57042],
       [ 12.122  ,  32.15468,  30.     ,   5.22936]])

Now you might be tempted to say "yay, let's just pick 5" as a limit in the inconsistencies, but look at what happens if we set depth to 3 instead:

In [22]:
depth = 3
incons = inconsistent(Z, depth)
incons[-10:]
Out[22]:
array([[  3.63778,   2.55561,   4.     ,   1.35908],
       [  3.89767,   2.57216,   7.     ,   1.54388],
       [  3.05886,   2.66707,   6.     ,   1.87115],
       [  4.92746,   2.7326 ,   7.     ,   1.39822],
       [  4.76943,   3.16277,   6.     ,   1.60456],
       [  5.27288,   3.56605,   7.     ,   2.00627],
       [  8.22057,   4.07583,   7.     ,   1.69162],
       [  7.83287,   4.46681,   7.     ,   2.07808],
       [ 11.38091,   6.2943 ,   7.     ,   1.86535],
       [ 37.25845,  63.31539,   7.     ,   2.25872]])

Oups! This should make you realize that the inconsistency values heavily depend on the depth of the tree you calculate the averages over.

Another problem in its calculation is that the previous d levels' heights aren't normally distributed, but expected to increase, so you can't really just treat the current level as an "outlier" of a normal distribution, as it's expected to be bigger.

Elbow Method

Another thing you might see out there is a variant of the "elbow method". It tries to find the clustering step where the acceleration of distance growth is the biggest (the "strongest elbow" of the blue line graph below, which is the highest value of the green graph below):

In [23]:
last = Z[-10:, 2]
last_rev = last[::-1]
idxs = np.arange(1, len(last) + 1)
plt.plot(idxs, last_rev)

acceleration = np.diff(last, 2)  # 2nd derivative of the distances
acceleration_rev = acceleration[::-1]
plt.plot(idxs[:-2] + 1, acceleration_rev)
plt.show()
k = acceleration_rev.argmax() + 2  # if idx 0 is the max of this we want 2 clusters
print "clusters:", k
clusters: 2

While this works nicely in our simplistic example (the green line takes its maximum for k=2), it's pretty flawed as well.

One issue of this method has to do with the way an "elbow" is defined: you need at least a right and a left point, which implies that this method will never be able to tell you that all your data is in one single cluster only.

Another problem with this variant lies in the np.diff(Z[:, 2], 2) though. The order of the distances in Z[:, 2] isn't properly reflecting the order of merges within one branch of the tree. In other words: there is no guarantee that the distance of Z[i] is contained in the branch of Z[i+1]. By simply computing the np.diff(Z[:, 2], 2) we assume that this doesn't matter and just compare distance jumps from different branches of our merge tree.

If you still don't want to believe this, let's just construct another simplistic example but this time with very different variances in the different clusters:

In [24]:
c = np.random.multivariate_normal([40, 40], [[20, 1], [1, 30]], size=[200,])
d = np.random.multivariate_normal([80, 80], [[30, 1], [1, 30]], size=[200,])
e = np.random.multivariate_normal([0, 100], [[100, 1], [1, 100]], size=[200,])
X2 = np.concatenate((X, c, d, e),)
plt.scatter(X2[:,0], X2[:,1])
plt.show()

As you can see we have 5 clusters now, but they have increasing variances... let's have a look at the dendrogram again and how you can use it to spot the problem:

In [25]:
Z2 = linkage(X2, 'ward')
plt.figure(figsize=(10,10))
fancy_dendrogram(
    Z2,
    truncate_mode='lastp',
    p=30,
    leaf_rotation=90.,
    leaf_font_size=12.,
    show_contracted=True,
    annotate_above=40,
    max_d=170,
)
plt.show()

When looking at a dendrogram like this and trying to put a cut-off line somewhere, you should notice the very different distributions of merge distances below that cut-off line. Compare the distribution in the cyan cluster to the red, green or even two blue clusters that have even been truncated away. In the cyan cluster below the cut-off we don't really have any discontinuity of merge distances up to very close to the cut-off line. The two blue clusters on the other hand are each merged below a distance of 25, and have a gap of > 155 to our cut-off line.

The variant of the "elbow" method will incorrectly see the jump from 167 to 180 as minimal and tell us we have 4 clusters:

In [26]:
last = Z2[-10:, 2]
last_rev = last[::-1]
idxs = np.arange(1, len(last) + 1)
plt.plot(idxs, last_rev)

acceleration = np.diff(last, 2)  # 2nd derivative of the distances
acceleration_rev = acceleration[::-1]
plt.plot(idxs[:-2] + 1, acceleration_rev)
plt.show()
k = acceleration_rev.argmax() + 2  # if idx 0 is the max of this we want 2 clusters
print "clusters:", k
clusters: 4

The same happens with the inconsistency metric:

In [27]:
print inconsistent(Z2, 5)[-10:]
[[  13.99222   15.56656   30.         3.86585]
 [  16.73941   18.5639    30.         3.45983]
 [  19.05945   20.53211   31.         3.49953]
 [  19.25574   20.82658   29.         3.51907]
 [  21.36116   26.7766    30.         4.50256]
 [  36.58101   37.08602   31.         3.50761]
 [  12.122     32.15468   30.         5.22936]
 [  42.6137   111.38577   31.         5.13038]
 [  81.75199  208.31582   31.         5.30448]
 [ 147.25602  307.95701   31.         3.6215 ]]

I hope you can now understand why i'm warning against blindly using any of those methods on a dataset you know nothing about. They can give you some indication, but you should always go back in and check if the results make sense, for example with a dendrogram which is a great tool for that (especially if you have higher dimensional data that you can't simply visualize anymore).

Retrieve the Clusters

Now, let's finally have a look at how to retrieve the clusters, for different ways of determining k. We can use the fcluster function.

Knowing max_d:

Let's say we determined the max distance with help of a dendrogram, then we can do the following to get the cluster id for each of our samples:

In [28]:
from scipy.cluster.hierarchy import fcluster
max_d = 50
clusters = fcluster(Z, max_d, criterion='distance')
clusters
Out[28]:
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Knowing k:

Another way starting from the dendrogram is to say "i can see i have k=2" clusters. You can then use:

In [29]:
k=2
fcluster(Z, k, criterion='maxclust')
Out[29]:
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Using the Inconsistency Method (default):

If you're really sure you want to use the inconsistency method to determine the number of clusters in your dataset, you can use the default criterion of fcluster() and hope you picked the correct values:

In [30]:
from scipy.cluster.hierarchy import fcluster
fcluster(Z, 8, depth=10)
Out[30]:
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Visualizing Your Clusters

If you're lucky enough and your data is very low dimensional, you can actually visualize the resulting clusters very easily:

In [31]:
plt.figure(figsize=(10, 8))
plt.scatter(X[:,0], X[:,1], c=clusters, cmap='prism')  # plot points with cluster dependent colors
plt.show()

I hope you enjoyed this tutorial. Feedback welcome ;)

Further Reading:

August 22, 2015

Some notes on the MPS-110NF music player, in particular when using it on Linux. The player is sold under the brand "Denver" or "Bazzix" in different places.

The manual for this player is very terse and printed in a tiny size; so here's what I figured out. (TL;DR: It plays OGG, it works well with Linux)

Basics

  • Well-suited for listening during exercise
    • Very lightweight
    • Has a clip
  • Very cheap
  • Working audio formats are MP3 and WMA, but it also turns out to play OGG! :)
  • Difficult to see how full the battery is
  • File transfer is via FAT filesystem on the SD card (the player functions as micro SD card reader when plugged in)

Navigation

  • Press "+" and "-": change the volume
  • Press "|<<" and ">>|": previous and next track
  • Long press "|<<" and ">>|": switch on/off shuffle mode (green LED will blink)

Navigating between files can become unreliable when the file system is broken. The file system driver has a lower tolerance for that than Linux.

Managing music with Banshee

The Banshee music player seems to get things right — if your USB automounting works. :)

For Banshee to recognize the device as a player and to know the supported filetypes, add the file .is_audio_player to the player's root directory, with the following content:

name="Bazzix"
output_formats=audio/ogg,audio/mpeg,audio/mp3,audio/x-ms-wma

From there on, Banshee will recognize the device just fine. The output_formats line will tell Banshee that the player understands OGG files, so that it won't try to convert to MP3 first.

Managing music on the player — command line

You want the full control! You despise using GUI applications! You're up for a surprise!

The player plays files in the order in which they were created, not in the alphabetical order of the filenames.

Files easily end up in the wrong order, but it's luckily fixable by "touching" (renaming to a temporary name and back) all files in a directory in the right order.

I conveniently placed this script on the player:

#!/usr/bin/python2
"""Recursively touches all file entries in alphabetical order.

Usage:
  ./sortfiles.py /mnt/bazzix
"""
import os
import sys

def walk(basedir):
  for dirpath, dirnames, filenames in os.walk(basedir):
    touchfiles(dirpath, filenames)

def touchfiles(basedir, files):
  for filename in sorted(files):
    tempname = "temp_%s" % filename
    full_filename = os.path.join(basedir, filename)
    full_tempname = os.path.join(basedir, tempname)
    # Touch the directory entry.
    print full_filename
    os.rename(full_filename, full_tempname)
    os.rename(full_tempname, full_filename)

def main(args):
  if not args:
    sys.exit("Needs directory argument, e.g. /mnt/bazzix")
  walk(args[0])

if __name__ == "__main__":
  main(sys.argv[1:])

Other people recommend to run the tool fatsort on the block device. This tool aims to solve exactly the same problem, but fsck.vfat frequently complained about the results and I ended up with broken file systems.

August 17, 2015

Weblog der Fachschaft Informatik

Back again

Weblog der Fachschaft Informatik

Heute Mittag ist es endlich so weit:
Ab 15:00 Uhr wird das neue Fachschaftsbüro endlich offiziell eröffnet. Inklusive Kaffee, Kuchen, kalten Getränken und unserem neuen Wassersprudler.
Als Willkommensangebot bieten wir Krombacher Pils, 0,5l, für gerade einmal 20ct die Flasche feil¹²
Also kommt vorbei und weiht mit uns unseren neuen Raum 48-463 ein!

August 09, 2015

The following two relases of KBibTeX are now available:

Read more... )

comment count unavailable comments

August 06, 2015

Recently I had a create form where one should be able to chose an associated model from a large set of possible values. It wouldn’t be too pleasant to have a drop down listing all these options. My idea was to render paginated radio buttons and a search field to limit the items available. Submitting […]

August 01, 2015

select concat ("Updated ", row_count(), " rows") as ''; #DATE(DATE_SUB(NOW(), INTERVAL ROUND(RAND(1)*10) DAY)); select concat ("Random date: ", DATE(DATE_SUB(NOW(), INTERVAL ROUND(RAND(1)*100) DAY))) as ''; Examples Output a random date: select addtime(concat_ws(' ','1964-12-25' + interval rand()*20000 [...]

July 06, 2015

Weblog der Fachschaft Informatik

Fachschaft geschlossen vom 27.7. bis 16.8.

Weblog der Fachschaft Informatik

Hallo alle zusammen,

ab Ende des Monats finden in 48-4 Umbaumaßnahmen statt, von denen auch wir als Fachschaft betroffen sind. Deswegen sind die Fachschaftsräumlichkeiten (GmF, Sofaraum, Gruft) vom 27.7. bis voraussichtlich zum 16.8. geschlossen, ebenso ist der Seminarraum nicht als Lernraum nutzbar. In dieser Zeit sind wir nur per Mail zu erreichen.
Solltet ihr Altklausuren benötigen, die ihr nur in der FS erhalten könnt, steht euch in diesem Zeitraum ein Terminal im SCI zur Verfügung.

Nach dem Umbau findet ihr uns in 48-463/464. Dieser Raum ist dann die neue GmF, in der ihr wie gewohnt uns, unsere Serviceleistungen und Getränke, Süßkram, Kaffee, Altklausuren, usw. finden könnt.

Viele Grüße,
Eure Fachschaft

July 03, 2015

Guenther Noack

Deep Dream

Guenther Noack

From the Google Research Blog: DeepDream - a code example for visualizing Neural Networks — it's available with source code to play around with now.

On Social Networks, pictures generated with DeepDream are now popping up under the hashtag #DeepDream. The DeepDream pictures generated from fractals by Bernard Geiger are very nice, for instance.

previously on the Google Research Blog: Inceptionism - going deeper into Neural Networks

June 24, 2015

A little-known SSH trick: when using key-based authentication, you can restrict the things that each key is allowed to do on the server side, by attaching a set of rules to the key in ~/.ssh/authorized_keys2.

For example, with the following key, you can only get a port forward to port localhost:8118 on the server side.

# One key per line, with space-separated fields:
# options, keytype, base64-encoded-key, comment
no-pty,no-agent-forwarding,no-X11-forwarding,permitopen="localhost:8118" ssh-rsa BLAH...

There are many more options available, which can be read up in the corresponding section in the man page.

Why would you do that?

In short: Principle of Least Privilege.

This is good for keys that are stored in higher risk places, such as keys used for not manually supervised automated tasks or on devices that can get lost or stolen easily. In that case, the loss of the key only grants limited capabilities to a third person getting access to it.

June 23, 2015

Guenther Noack

Frameworks

Guenther Noack

When you're using a library, you have a box.
When you're using a framework, you're in the box.

June 19, 2015

The following two pre-relases of KBibTeX are now available:

Read more... )

comment count unavailable comments

June 06, 2015

Git hooks are a great way to automate your workflow and to check for faulty commits. I am using them to prevent work-in-progress commits to the master branch (i.e. commits with a line starting with the string WIP) . But wait – this script differs from the sample found in .git/hooks/pre-commit.sample in two ways:

  1. Only pushes to special heads are inspected. This still allows you to backup your WIP commit on the remote server or to share it with your colleagues but prevents integration into the master branch.
  2. Only commits which would actually be merged into the master are checked – and not all commits ever pushed. This way, even when a WIP commit was pushed to the master in the past, the script does not prevent you from pushing new commits. The pre-commit.sample would explicitly disallow that.

We have two staging areas in our development environment: All pushes to ready/<any name> or directPatch/<any name> trigger our continuous integration process and eventually merge the changes into our master (which you cannot push to directly). Pushes to <developer acronym>/<any name> are always allowed and do not trigger any merging. So we want to check only the pushes to ready and directPatch. Of course, you might want to adapt the script to your needs:

  1. Changing the heads to be checked – see line 24
  2. Changing the word to look for – see line 40

The following hook can be activated by storing it in the file <your repository>/.git/hooks/pre-commit (no file extension)

#!/bin/sh

# This hook is called with the following parameters:
#
# $1 -- Name of the remote to which the push is being done
# $2 -- URL to which the push is being done
#
# Information about the commits which are being pushed is
# supplied as lines to the standard input in the form:
#
#   <local ref> <local sha1> <remote ref> <remote sha1>
#
# This sample shows how to prevent push of commits where the
# log message starts with "WIP" (work in progress) and is pushed
# to a refs/heads/ready or refs/heads/directPatch

remote="$1"
url="$2"

z40=0000000000000000000000000000000000000000

IFS=' '
while read local_ref local_sha remote_ref remote_sha
do
	if [[ $remote_ref != refs/heads/ready/* ]] && [[ $remote_ref != refs/heads/directPatch/* ]]
	then
		# Do not check the WIP
		continue
	fi

	if [ "$local_sha" = $z40 ]
	then
		# Handle delete
		:
	else
		# Only inspect commits not yet merged into origin/master
		range="origin/master..$local_sha"

		# Check for WIP commit
		commit=`git rev-list -n 1 --grep '^WIP' "$range"`
		if [ -n "$commit" ]
		then
			echo "Found WIP commit in $local_ref: $commit, not pushing."
			exit 1
		fi
	fi
done

exit 0

June 03, 2015

Weblog der Fachschaft Informatik

[Events] Fahrradtour zum Gelterswoog

Weblog der Fachschaft Informatik

Hallo Leute,

es ist wieder so weit!

Das nächste Event findet statt und wir wollen mit euch den Sommer begrüßen. Dazu haben wir geplant eine Fahrradtour zum Gelterswoog zu machen und dort bei einem gemütlichen Picknick und dem einen oder anderen Schluck Bier einen angenehmen Nachmittag zu verbringen. Dazu treffen wir uns am Sonntag (07.06) um 11 Uhr an der Uni. Falls ihr vorher noch Hilfe braucht um euer Fahrrad auf Vordermann zu bringen, könnt ihr ab um 10 vorbei kommen.

Also entstaubt eure Fahrräder, bringt sie auf Vordermann, packt euch was zu Essen und Trinken ein und freut euch auf die Fahrradtour.

Für die Fahrradtour braucht:

  • - ein Fahrrad, Wasser oder etwas zum Trinken für die Fahrradtour
  •  - Picknick-Essen und & Trinken
  •  - 3 € für den Eintritt zum Gelterswoog

June 02, 2015

After a long time since the last release (13 months ago), I am currently working towards a number of releases for KBibTeX.

Read more... )

comment count unavailable comments

May 01, 2015

Weblog der Fachschaft Informatik

Neuer Fachschaftsrat

Weblog der Fachschaft Informatik

Auf der vergangenen Vollversammlung am 30. April wurden die folgenden Personen zum neuen Fachschaftsrat gewählt:

  • Rouven Bauer
  • Lisa Busser
  • Šandor Dalecke
  • Sarah Dossinger
  • Markus Fögen
  • Patrick Hansert
  • Marc Hauer
  • Philipp Schmitt
  • Marco Stricker

Wir danken allen, die uns ihr Vertrauen ausgesprochen haben.
Im direkten Anschluss an die Vollversammlung fand die Konstituierendensitzung statt, auf der die Referate neu verteilt wurden. Eure aktuellen Referenten könnt ihr wie gewohnt unter Fachschaft einsehen.

April 29, 2015

Using the client $client = new Client(); $client->setUri('unix:///run/docker.sock:/v1.13/version'); $client->setOptions(array( 'maxredirects' => 0, 'timeout' => 30 )); $response = $client->send(); echo $response->getContent(); {"ApiVersion":"1.14","Arch":"amd64","GitCommit":"fa7b24f","GoVersion":"go1.3.1","KernelVersion":"3.16.1-1-ARCH","Os":"linux","Version":"1.2.0"} Using the socket $uri = new [...]

April 27, 2015

GPG can be used as SSH agent, which allows you to store your keys on a smartcard rather than on a (easy to compromise) computer. This is how it's done.

I'm assuming the smartcard is already set up. To try, ask GPG to print out information about the smartcard:

gpg --card-status

Set up the server side

Register your key on the machine you want to log in to.

gpgkey2ssh $KEYID | ssh user@host 'cat >> .ssh/authorized_keys2'

(You may want to use an intermediate file if you need further editing to the list of authorized keys, e.g. for restriction on what they key is allowed to do.)

Set up the client side

Enable SSH support in GPG agent

echo enable-ssh-support >> $HOME/.gnupg/gpg-agent.conf
gpg-connect-agent /bye  # restart agent

In your shell, set the SSH agent environment variables, so that SSH knows how to talk to the GPG agent. (This is also documented in the gpg-agent man page.)

unset SSH_AGENT_PID
export SSH_AUTH_SOCK=$HOME/.gnupg/S.gpg-agent.ssh

You may want to put this into your .bashrc.

Is this even a good idea?

Some relevant questions:

  • Do you trust gpg-agent more than ssh-agent?
    • Do you trust it enough to tunnel it to remote hosts? (If you're using SSH agent forwarding)
  • Do you want to plug in your PGP keys and enter the PIN when doing SSH autentication? (If you're storing them on the same card; avoiding unnecessary exposure seems like a good idea.)

You can compare the source code here:

April 17, 2015

Weblog der Fachschaft Informatik

Glühweinwanderung

Weblog der Fachschaft Informatik

Dies ist die diessemestrige Route, die der Glühweinwagen genommen hat:
Link zum Track

Wie jedes Semester war die Glühweinwanderung zum Humbergturm auch dieses mal ganz großes Kino. Mit dem sanierten Glühweinwagen, dessen Features dieses mal getestet wurden und welche zur nächsten GwwzHt noch ausgebessert werden, haben wir dieses mal einen recht schönen, wenn auch nicht sonderlich direkten Weg gewählt, der mir persönlich sehr gut gefallen hat.

GwwzHt SS15

April 10, 2015

Weblog der Fachschaft Informatik

Scotland Yard 3.3 Rev. B

Weblog der Fachschaft Informatik

Die Bilder vom Scotland Yard 3.3 Rev. B am gestrigen Tage sind online =]

http://ewoche.de/galleries/9022/27/
(login erforderlich)

Wer macht mit?