Skip Navigation

The MNIST database of handwritten digits

digits handwritten MNIST

The MNIST database of handwritten digits has a training set of 60,000 examples and a test set of 10,000 examples. The digits have been size-normalized and centered in a fixed-size image.

This dataset is sourced from THE MNIST DATABASE of handwritten digits. Its a subset of the larger NIST Handprinted Forms and Characters Database published by National Institute of Standards and Technology.

Storage Location

  • Blob account: azureopendatastorage

  • Container name: mnist

Four files are available in the container directly:

  • train-images-idx3-ubyte.gz: training set images (9912422 bytes)

  • train-labels-idx1-ubyte.gz: training set labels (28881 bytes)

  • t10k-images-idx3-ubyte.gz: test set images (1648877 bytes)

  • t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes)

Notices

MICROSOFT PROVIDES AZURE OPEN DATASETS ON AN “AS IS” BASIS. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, GUARANTEES OR CONDITIONS WITH RESPECT TO YOUR USE OF THE DATASETS. TO THE EXTENT PERMITTED UNDER YOUR LOCAL LAW, MICROSOFT DISCLAIMS ALL LIABILITY FOR ANY DAMAGES OR LOSSES, INCLUDING DIRECT, CONSEQUENTIAL, SPECIAL, INDIRECT, INCIDENTAL OR PUNITIVE, RESULTING FROM YOUR USE OF THE DATASETS.

This dataset is provided under the original terms that Microsoft received source data. The dataset may include data sourced from Microsoft.

Access

Available inWhen to use
Azure Notebooks

Quickly explore the dataset with Jupyter notebooks hosted on Azure or your local machine.

Azure Databricks

Use this when you need the scale of an Azure managed Spark cluster to process the dataset.

Select your preferred service:

Azure Notebooks

Azure Databricks

Azure Notebooks

Package: Language: Python Python

Load MNIST into a data frame using Azure Machine Learning tabular datasets.

See https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-create-register-datasets to learn more about datasets.

Get complete dataset into a data frame

In [1]:
from azureml.opendatasets import MNIST

mnist = MNIST.get_tabular_dataset()
mnist_df = mnist.to_pandas_dataframe()
mnist_df.info()
ActivityStarted, get_tabular_dataset
ActivityCompleted: Activity=get_tabular_dataset, HowEnded=Success, Duration=8343.18 [ms]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Columns: 785 entries, 0 to label
dtypes: int64(785)
memory usage: 419.2 MB

Get train and test data frames

In [2]:
mnist_train = MNIST.get_tabular_dataset(datasetFilter='train')
mnist_train_df = mnist_train.to_pandas_dataframe()
X_train = mnist_train_df.drop("label", axis=1).values/255.0
y_train = mnist_train_df.filter(items=["label"]).values

mnist_test = MNIST.get_tabular_dataset(datasetFilter='test')
mnist_test_df = mnist_test.to_pandas_dataframe()
X_test = mnist_test_df.drop("label", axis=1).values/255.0
y_test = mnist_test_df.filter(items=["label"]).values
ActivityStarted, get_tabular_dataset
ActivityCompleted: Activity=get_tabular_dataset, HowEnded=Success, Duration=3537.69 [ms]
ActivityStarted, get_tabular_dataset
ActivityCompleted: Activity=get_tabular_dataset, HowEnded=Success, Duration=4404.79 [ms]

Plot some images of the digits

In [3]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# now let's show some randomly chosen images from the traininng set.
count = 0
sample_size = 30
plt.figure(figsize=(16, 6))
for i in np.random.permutation(X_train.shape[0])[:sample_size]:
    count = count + 1
    plt.subplot(1, sample_size, count)
    plt.axhline('')
    plt.axvline('')
    plt.text(x=10, y=-10, s=y_train[i], fontsize=18)
    plt.imshow(X_train[i].reshape(28, 28), cmap=plt.cm.Greys)
plt.show()

Download or mount MNIST raw files Azure Machine Learning file datasets.

This works only for Linux based compute. See https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-create-register-datasets to learn more about datasets.

In [4]:
mnist_file = MNIST.get_file_dataset()
mnist_file
ActivityStarted, get_file_dataset
ActivityCompleted: Activity=get_file_dataset, HowEnded=Success, Duration=2272.94 [ms]
Out[4]:
{
  "source": [
    "https://azureopendatastorage.blob.core.windows.net/mnist/**/*.gz"
  ],
  "definition": [
    "GetFiles"
  ]
}
In [5]:
mnist_file.to_path()
Out[5]:
array(['/t10k-images-idx3-ubyte.gz', '/t10k-labels-idx1-ubyte.gz',
       '/train-images-idx3-ubyte.gz', '/train-labels-idx1-ubyte.gz'],
      dtype=object)

Download files to local storage

In [6]:
import os
import tempfile

data_folder = tempfile.mkdtemp()
mnist_file.download(data_folder, overwrite=True)
Out[6]:
array(['/tmp/tmpxqh3jcf_/t10k-images-idx3-ubyte.gz',
       '/tmp/tmpxqh3jcf_/t10k-labels-idx1-ubyte.gz',
       '/tmp/tmpxqh3jcf_/train-images-idx3-ubyte.gz',
       '/tmp/tmpxqh3jcf_/train-labels-idx1-ubyte.gz'], dtype=object)
In [7]:
os.listdir(data_folder)
Out[7]:
['train-images-idx3-ubyte.gz',
 't10k-images-idx3-ubyte.gz',
 't10k-labels-idx1-ubyte.gz',
 'train-labels-idx1-ubyte.gz']

Mount files. Useful when training job will run on a remote compute.

In [8]:
import gzip
import struct
import pandas as pd
import numpy as np

# load compressed MNIST gz files and return pandas dataframe of numpy arrays
def load_data(filename, label=False):
    with gzip.open(filename) as gz:
        gz.read(4)
        n_items = struct.unpack('>I', gz.read(4))
        if not label:
            n_rows = struct.unpack('>I', gz.read(4))[0]
            n_cols = struct.unpack('>I', gz.read(4))[0]
            res = np.frombuffer(gz.read(n_items[0] * n_rows * n_cols), dtype=np.uint8)
            res = res.reshape(n_items[0], n_rows * n_cols)
        else:
            res = np.frombuffer(gz.read(n_items[0]), dtype=np.uint8)
            res = res.reshape(n_items[0], 1)
    return pd.DataFrame(res)
In [9]:
import sys
mount_point = tempfile.mkdtemp()
print(mount_point)
print(os.path.exists(mount_point))
print(os.listdir(mount_point))

if sys.platform == 'linux':
  print("start mounting....")
  with mnist_file.mount(mount_point):
    print("list dir...")
    print(os.listdir(mount_point))
    print("get the dataframe info of mounted data...")
    train_images_df = load_data(os.path.join(mount_point, 'train-images-idx3-ubyte.gz'))
    print(train_images_df.info())
/tmp/tmpmtcdrdqr
True
[]
start mounting....
list dir...
['t10k-images-idx3-ubyte.gz', 't10k-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz', 'train-labels-idx1-ubyte.gz']
get the dataframe info of mounted data...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Columns: 784 entries, 0 to 783
dtypes: uint8(784)
memory usage: 44.9 MB
None
In [10]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
In [11]:
import urllib.request
import os

data_folder = os.path.join(os.getcwd(), 'data')
os.makedirs(data_folder, exist_ok=True)

urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/train-images-idx3-ubyte.gz',
                           filename=os.path.join(data_folder, 'train-images.gz'))
urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/train-labels-idx1-ubyte.gz',
                           filename=os.path.join(data_folder, 'train-labels.gz'))
urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/t10k-images-idx3-ubyte.gz',
                           filename=os.path.join(data_folder, 'test-images.gz'))
urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/t10k-labels-idx1-ubyte.gz',
                           filename=os.path.join(data_folder, 'test-labels.gz'))
Out[11]:
('D:\\TEMP\\jupyter\\data\\test-labels.gz',
 <http.client.HTTPMessage at 0x21ac002cb38>)
In [12]:
import gzip
import struct

# load compressed MNIST gz files and return numpy arrays
def load_data(filename, label=False):
    with gzip.open(filename) as gz:
        struct.unpack('I', gz.read(4))
        n_items = struct.unpack('>I', gz.read(4))
        if not label:
            n_rows = struct.unpack('>I', gz.read(4))[0]
            n_cols = struct.unpack('>I', gz.read(4))[0]
            res = np.frombuffer(gz.read(n_items[0] * n_rows * n_cols), dtype=np.uint8)
            res = res.reshape(n_items[0], n_rows * n_cols)
        else:
            res = np.frombuffer(gz.read(n_items[0]), dtype=np.uint8)
            res = res.reshape(n_items[0], 1)
    return res
In [13]:
# note we also shrink the intensity values (X) from 0-255 to 0-1. This helps the model converge faster.
X_train = load_data(os.path.join(
    data_folder, 'train-images.gz'), False) / 255.0
X_test = load_data(os.path.join(data_folder, 'test-images.gz'), False) / 255.0
y_train = load_data(os.path.join(
    data_folder, 'train-labels.gz'), True).reshape(-1)
y_test = load_data(os.path.join(
    data_folder, 'test-labels.gz'), True).reshape(-1)

# now let's show some randomly chosen images from the traininng set.
count = 0
sample_size = 30
plt.figure(figsize=(16, 6))
for i in np.random.permutation(X_train.shape[0])[:sample_size]:
    count = count + 1
    plt.subplot(1, sample_size, count)
    plt.axhline('')
    plt.axvline('')
    plt.text(x=10, y=-10, s=y_train[i], fontsize=18)
    plt.imshow(X_train[i].reshape(28, 28), cmap=plt.cm.Greys)
plt.show()

Azure Databricks

Package: Language: Python

Load MNIST into a data frame using Azure Machine Learning tabular datasets.

See https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-create-register-datasets to learn more about datasets.

Get complete dataset into a data frame

In [1]:
# This is a package in preview.
from azureml.opendatasets import MNIST

mnist = MNIST.get_tabular_dataset()
mnist_df = mnist.to_spark_dataframe()
ActivityStarted, get_tabular_dataset ActivityCompleted: Activity=get_tabular_dataset, HowEnded=Success, Duration=6821.78 [ms]
In [2]:
display(mnist_df.limit(5))
label


000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000067232390000000006281000000000000001201803900000000012616300000000000002153210400000000002201630000000000000272541620000000000222163000000000000018325412500000000046245163000000000000019825456000000000120254