Visualize high-dimensional dataset in a 2D Chart.
In this post, I'll use a well known dataset MINIST handwritten. There are 70,000 images, each image in this dataset is of size 28x28.
First, import the libraries we are going to use.
import matplotlib.pyplot as plt |
I'm using matplotlib and seaborn for visualization. numpy and pandas to handle numerical arrays and dataframe. I'm also use scikit-learn to get the data and perform t-SNE.
Download the dataset
data = datasets.fetch_openml('mnist_784', version=1, return_X_y=True) |
(70000, 784)
The dataset downloaded has 70,000 records, each record has 784 columns.
Let's plot an image to see what does it look like
image = pixel_values[0, :].reshape(28, 28) |

The image in the dataset has size 768, so I need convert it to 28x28.
Now the importance part, compute t-SNE
tsne = manifold.TSNE(n_components=2, random_state=42) |
(6000, 2)
In this example, I using only 6000 rows, and reduce the columns from 768 to 2. Enough for plotting the data to 2D chart.
Let's visualize the transformed dataset
tsne_df = pd.DataFrame(np.column_stack((transformed_data, targets[:6000])), columns=['x', 'y', 'targets']) |

This is one way to visualize dataset. By plotting the dataset in the chart, we can see that, the number 0 and 6 are distinguishable easily. The number 4 and 9 are harder to distinguish.