Associating Region Index With True Labels
Solution 1:
If I understand you correctly, you want to assign the actual digit value as a cluster label that matches that cluster, correct? If that is the case, I don't think it is possible.
K-Means is an unsupervised learning algorithm. It does not understand what it is looking at and the labels it assigns are arbitrary. Instead of 0, 1, 2, ... it could have called them 'apple', 'orange', 'grape' ... . All K-Means can ever do, is to tell you that a bunch of data points are similar to each other based on some metric, that is all. It is great for data exploration or pattern finding. But not for telling you "What" it actually is.
It does not matter what post processing you do, because the computer can never, programmatically, know what the true labels are, unless you, the human, tell it. In which case you might as well use a supervised learning algorithm.
If you want to train a model, that, when it see's a number, it can then assign the correct label to it, you must use a supervised learning method (where labels are a thing). Look into Random Forest instead, for instance. Here is a similar endeavor.
Solution 2:
Here is the code to use my solution:
from sklearn.cluster import KMeans
import utils
# Extraction du dataset
x_train, y_train = utils.get_train_data()
x_test, y_test = utils.get_test_data()
kmeans = KMeans(n_clusters=10)
kmeans.fit(x_train)
training_labels = kmeans.labels_
switches_to_make = utils.find_closest_digit_to_centroids(kmeans, x_train, y_train) # Obtaining the most probable labels (digits) for each region
utils.treat_data(switches_to_make, training_labels, y_train)
predictions = kmeans.predict(x_test)
utils.treat_data(switches_to_make, predictions, y_test)
And utils.py
:
import csv
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import pairwise_distances_argmin_min
use_reduced = True# Flag variable to use the reduced datasets (generated by 'pre_process.py')
verbose = False# Should debugging prints be showndefget_data(reduced_path, path):
"""
Pour obtenir le dataset désiré.
:param reduced_path: path vers la version réduite (générée par 'pre_process.py')
:param path: path vers la version complète
:return: numpy arrays (data, labels)
"""if use_reduced:
data = open(reduced_path)
else:
data = open(path)
csv_file = csv.reader(data)
data_points = []
for row in csv_file:
data_points.append(row)
data_points.pop(0) # On enlève la première ligne, soit les "headers" de nos colonnes
data.close()
# Pour passer de String à intfor i inrange(len(data_points)): # for each imagefor j inrange(len(data_points[0])): # for each pixel
data_points[i][j] = int(data_points[i][j])
# # Pour obtenir des valeurs en FLOAT normalisées entre 0 et 1:# data_points[i][j] = np.divide(float(data_points[i][j]), 255)# Pour séparer les labels du data
y_train = [] # labelsfor row in data_points:
y_train.append(row[0]) # first column is the label
x_train = [] # datafor row in data_points:
x_train.append(row[1:785]) # other columns are the pixels
x_train = np.array(x_train)
y_train = np.array(y_train)
print("Done with loading the dataset.")
return x_train, y_train
defget_test_data():
"""
Retourne le dataset de test désiré.
:return: numpy arrays (data, labels)
"""return get_data('../data/reduced_mnist_test.csv', '../data/mnist_test.csv')
defget_train_data():
"""
Retourne le dataset de training désiré.
:return: numpy arrays (data, labels)
"""return get_data('../data/reduced_mnist_train.csv', '../data/mnist_train.csv')
defdisplay_data(x_train, y_train):
"""
Affiche le digit voulu.
:param x_train: le data (784D)
:param y_train: le label associé
:return:
"""# Exemple pour afficher: conversion de notre vecteur d'une dimension en 2 dimensions
matrix = np.reshape(x_train, (28, 28))
plt.imshow(matrix, cmap='gray')
plt.title("Voici un " + str(y_train))
plt.show()
defgenerate_mean_images(x_train, y_train):
"""
Retourne le tableau des images moyennes pour chaque classe
:param x_train:
:param y_train:
:return:
"""
counts = np.zeros(10).astype(int)
for label in y_train:
counts[label] += 1
sum_pixel_values = np.zeros((10, 784)).astype(int)
for img inrange(len(y_train)):
for pixel inrange(len(x_train[0])):
sum_pixel_values[y_train[img]][pixel] += x_train[img][pixel]
pixel_probability = np.zeros((len(counts), len(x_train[0]))) # (10, 784)for classe inrange(len(counts)):
for pixel inrange(len(x_train[0])):
pixel_probability[classe][pixel] = np.divide(sum_pixel_values[classe][pixel] + 1, counts[classe] + 2)
mean_images = []
if verbose:
plt.figure(figsize=(20, 4)) # values of the size of the plot: (x,y) in INCHES
plt.suptitle("Such wow, much impress !")
for classe inrange(len(counts)):
class_mean = np.reshape(pixel_probability[classe], (28, 28))
mean_images.append(class_mean)
# Aesthetics
plt.subplot(1, 10, classe + 1)
plt.title(str(classe))
plt.imshow(class_mean, cmap='gray')
plt.xticks([])
plt.yticks([])
plt.show()
return mean_images
########## used for "k_mean" (for now)defcount_labels(name, data):
"""
S'occupe de compter le nombre de data associé à chacun des labels.
:param name: nom de ce que l'on compte
:param data: doit être 1D
:return: counts = le nombre pour chaque label
"""
header = "-- " + str(name) + " -- "# making sure it's a String
counts = [0]*10# initializing the counting arrayfor label in data:
counts[label] += 1if verbose: print(header, "Amounts for each label:", counts)
return counts
defget_list_of_indices(data, label):
"""
Retourne une liste d'indices correspondant à tous les endroits
où on peut trouver dans 'data' le 'label' spécifié
:param data:
:param label: le numéro associé à une région générée par k_mean
:return:
"""return (np.where(data == label))[0].tolist()
defrearrange_labels(switches_to_make, to_change):
"""
Takes region numbers and assigns the most probable digit (label) to it.
For example, if switches_to_make[3] = 5, it means that the 4th region (index 3 of the list)
should be considered as representing the digit "5".
:param switches_to_make: list of changes to make
:param to_change: this table will be changed according to 'switches_to_make'
:return: nothing, the change is made in-situ
"""for region inrange(len(to_change)):
for label inrange(len(switches_to_make)):
if to_change[region] == label: # if it corresponds to the "wrong" label given by scikit
to_change[region] = switches_to_make[label] # assign the "most probable" labelbreakdefcount_error_rate(found, truth):
wrong = 0for i inrange(len(found)):
if found[i] != truth[i]:
wrong += 1
percent = wrong / len(found) * 100print("Error rate = ", percent, "%")
return percent
deftreat_data(switches_to_make, predictions, truth):
rearrange_labels(switches_to_make, predictions) # Rearranging the training labels
count_error_rate(predictions, truth) # Counting error rate# TODO: reassign in case of doubles# adapted from https://stackoverflow.com/a/45275056/9768291deffind_closest_digit_to_centroids(kmean, data, labels):
"""
The array 'closest' will contain the index of the point in 'data' that is closest to each centroid.
Let's say the 'closest' gave output as array([0,8,5]) for the three clusters. So data[0] is the
closest point in 'data' to centroid 0, and data[8] is the closest to centroid 1 and so on.
If the returned list is [9,4,2,1,3] it would mean that the region #0 (index 0) represents the digit 9 the best.
:param kmean: the variable where the 'fit' data has been stored
:param data: the actual data that was used with 'fit' (x_train)
:param labels: the true labels associated with 'data' (y_train)
:return: list where each region is at its index and the value at that index is the digit it represents
"""
closest, _ = pairwise_distances_argmin_min(kmean.cluster_centers_,
data,
metric="euclidean")
switches_to_make = []
for region inrange(len(closest)):
truth = labels[closest[region]]
print("The assigned region", region, "should be considered as representing the digit ", truth)
switches_to_make.append(truth)
print("Digits associated to each region (switches_to_make):", switches_to_make)
return switches_to_make
Essentially, here is the function that solved my problems:
# adapted from https://stackoverflow.com/a/45275056/9768291def find_closest_digit_to_centroids(kmean, data, labels):
"""
The array 'closest' will contain the index of the point in 'data' that is closest to each centroid.
Let's say the 'closest' gave output asarray([0,8,5]) for the three clusters. So data[0] is the
closest point in 'data' to centroid 0, and data[8] is the closest to centroid 1 and so on.
If the returned list is [9,4,2,1,3] it would mean that the region #0 (index 0) represents the digit 9 the best.
:param kmean: the variable where the 'fit' data has been stored
:param data: the actual data that was used with 'fit' (x_train)
:param labels: the true labels associated with 'data' (y_train)
:return: list where each region is at its index and the value at that index is the digit it represents
"""
closest, _ = pairwise_distances_argmin_min(kmean.cluster_centers_,
data,
metric="euclidean")
switches_to_make = []
for region inrange(len(closest)):
truth = labels[closest[region]]
print("The assigned region", region, "should be considered as representing the digit ", truth)
switches_to_make.append(truth)
print("Digits associated to each region (switches_to_make):", switches_to_make)
return switches_to_make
Post a Comment for "Associating Region Index With True Labels"