Introduction
On a whim, I decided to learn machine learning. These are notes from my learning process.
Preparation
I made these preparations:
- A MacBook with Python environment set up, and numpy and matplotlib installed
- Registered for Udacity’s free “Deep Learning” course (in collaboration with Google)
- Studied Liao Xuefeng’s Python introductory tutorial
- Spent two days roughly browsing through “Machine Learning in Action”
Learning these fundamentals should be sufficient for the upcoming Udacity course.
Course One: From Machine Learning to Deep Learning
Sections 1-8 mainly introduce the current state of deep learning and related knowledge.
Sections 9-12 introduce the softmax model.
After a rough review of “Machine Learning in Action,” I learned that machine learning consists of several classification and clustering algorithms. On the surface, machine learning appears to be a collection of classification and clustering algorithms. Among these algorithms, one called logistic regression classification was introduced.
In sections 9-12, the focus is on the classifier model—logistic regression, using the softmax function as the classification function.
- What is the softmax function?
This image illustrates what a softmax function is. For each number z in the original sequence, we calculate exp(z), and the proportion of each new number’s magnitude becomes the softmax probability for that number.
- Properties
If the inputs are scaled up proportionally, the classifier’s results become more polarized and confident. If the inputs are scaled down proportionally, the classifier’s results tend toward the average and lack confidence.
- Algorithm
import numpy as np
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
expList = [np.exp(i) for i in x]
expSum = sum(expList)
x = [i/expSum for i in expList]
return np.array(x)
Sections 13-14 mainly discuss One-Hot encoding. After the softmax function provides a sequence of probability values, how do we determine the classification? For example, a sequence where the highest probability is 1 and others are 0 is called One-Hot encoding. This type of encoding has already determined the classification.
Sections 15-16 cover cross-entropy. Softmax can calculate a probability sequence, and OneHot is a determined classification. So how do we calculate the distance from a probability sequence to a specific classification? We use cross-entropy to measure this distance.
Sections 17-20 explain how to use this classifier. Section 18 specifically discusses why special initial data is needed.
sum = 1000000000
for i in range(1000000):
sum += 0.000001
sum -= 1000000000
print(sum)
The result of running this code is not 1. If we change sum to a very small number, like 1, instead of 1000000000, we find that the error becomes smaller. Based on this reason, we want our initial data to always have a mean of 0 and consistent variance in all directions. For example, for a grayscale image with pixel values from 0-255, we need to subtract 128 and then divide by 128, so that each number is between -1 and 1. Such initial data is more suitable for training.
This way, we can proceed with training. Reviewing the video content: xi is the training data matrix, w is a random weight matrix. For performance reasons, random values are taken from a normal distribution with an axis of 0 and very small variance. Then we calculate the probability sequence and the distance to the target. Then we compute the average distance to all targets. Our goal is to make this distance smaller, so we optimize the weight matrix along the direction of gradient descent while optimizing the intercept b. We repeat this process continuously until we reach a local optimum.
- Installing Docker
https://www.docker-cn.com/community-edition#/download
Configure the official Chinese mirror.
- Installing Jupyter Notebook
$ pip3 install jupyter
$ jupyter notebook
You can now use the jupyter notebook command to open a Jupyter editor.
- Setting up TensorFlow environment
$ docker run -it -p 8888:8888 tensorflow/tensorflow
Running this command will automatically download the TensorFlow image, provided that the repository mirror is set to a Chinese mirror; otherwise, the download will be very slow. After running the command, you’ll be prompted to open a webpage. When you open this URL, you’ll see the TensorFlow Jupyter editing environment, assuming Jupyter Notebook is installed correctly.
- Mounting Docker’s file directory
We need to import the official assignments. Close the container, reopen it, and use
-v host_directory:container_directory
for mounting.docker run -v /Users/hahaha/tensorflow/:/notebooks -it -p 8888:8888 tensorflow/tensorflow
Where /Users/hahaha/tensorflow/ is a folder on my Mac, and notebooks is the default Jupyter editing directory in TensorFlow.
Paste the first assignment file, 1_notmnist.ipynb, into the mounted directory on the host. This file can be found here: 1_notmnist.ipynb
Assignment Code Segment One
First, run the import statements in the first code segment. There should be no errors. If you see red error output, it means these imports were not successful.
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
# print function
import matplotlib.pyplot as plt
# plotting tool
import numpy as np
# matrix calculations
import os
# file paths
import sys
# file output
import tarfile
# decompression
from IPython.display import display, Image
# display images
from scipy import ndimage
# image processing
from sklearn.linear_model import LogisticRegression
# logistic regression module for linear models
from six.moves.urllib.request import urlretrieve
# url handling
from six.moves import cPickle as pickle
# data processing
# Config the matplotlib backend as plotting inline in IPython
%matplotlib inline
# matplotlib is the most famous Python chart plotting extension library,
# it supports outputting various formats of graphical images, and can use various GUI interface libraries to display charts interactively.
# Using the %matplotlib command can embed matplotlib charts directly into the Notebook,
# or display charts using a specified interface library, it has a parameter specifying how matplotlib charts are displayed.
# inline indicates embedding charts in the Notebook.
Assignment Code Segment Two
Next is the second code segment, which will download letter sets for training and testing, approximately 300MB in size. After successful download, you can see these two files in the mounted directory.
url = 'https://commondatastorage.googleapis.com/books1000/'
last_percent_reported = None
data_root = '.' # Change me to store data elsewhere
def download_progress_hook(count, blockSize, totalSize):
"""A hook to report the progress of a download. This is mostly intended for users with
slow internet connections. Reports every 5% change in download progress.
"""
# Hook function to display download progress in real-time
global last_percent_reported
percent = int(count * blockSize * 100 / totalSize)
if last_percent_reported != percent:
if percent % 5 == 0:
sys.stdout.write("%s%%" % percent)
sys.stdout.flush()
else:
sys.stdout.write(".")
sys.stdout.flush()
last_percent_reported = percent
def maybe_download(filename, expected_bytes, force=False):
"""Download a file if not present, and make sure it's the right size."""
dest_filename = os.path.join(data_root, filename)
# data_root is the current directory, add the filename to it, set as the location to save the file
if force or not os.path.exists(dest_filename):
# force is to force download, ignoring already downloaded files
print('Attempting to download:', filename)
filename, _ = urlretrieve(url + filename, dest_filename, reporthook=download_progress_hook)
# Use urlretrieve to download the file, with the hook attached
print('\nDownload Complete!')
statinfo = os.stat(dest_filename)
# Get information about the downloaded file
if statinfo.st_size == expected_bytes:
# Correct size
print('Found and verified', dest_filename)
else:
# Wrong size, prompt user to use a browser to download
raise Exception(
'Failed to verify ' + dest_filename + '. Can you get to it with a browser?')
return dest_filename
train_filename = maybe_download('notMNIST_large.tar.gz', 247336696)
test_filename = maybe_download('notMNIST_small.tar.gz', 8458043)
Assignment Code Segment Three
Extracting use cases
num_classes = 10
# Total number of digits
np.random.seed(133)
# Initialize random seed
def maybe_extract(filename, force=False):
# Assuming already extracted
root = os.path.splitext(os.path.splitext(filename)[0])[0] # remove .tar.gz
# splitext(filename)[0] removes one suffix, used twice to remove both suffixes, i.e., remove the .tar.gz suffix
if os.path.isdir(root) and not force:
# You may override by setting force=True.
# If already extracted, don't extract again
print('%s already present - Skipping extraction of %s.' % (root, filename))
else:
print('Extracting data for %s. This may take a while. Please wait.' % root)
tar = tarfile.open(filename)
sys.stdout.flush()
tar.extractall(data_root)
tar.close()
# Extract to the current directory
data_folders = [
os.path.join(root, d) for d in sorted(os.listdir(root))
if os.path.isdir(os.path.join(root, d))]
if len(data_folders) != num_classes:
raise Exception(
'Expected %d folders, one per class. Found %d instead.' % (
num_classes, len(data_folders)))
print(data_folders)
# Check if the number of extracted directories matches expectations, and print the extracted directories
return data_folders
train_folders = maybe_extract(train_filename)
test_folders = maybe_extract(test_filename)
Question One
Write code to display information about the extracted file contents
- Reference answer
import random
import matplotlib.image as mpimg
def plot_samples(data_folders, sample_size, title=None):
fig = plt.figure()
# Create empty figure
if title: fig.suptitle(title, fontsize=16, fontweight='bold')
# Add title
for folder in data_folders:
# Loop through each letter
image_files = os.listdir(folder)
image_sample = random.sample(image_files, sample_size)
# Randomly select a certain number of images from that letter
for image in image_sample:
image_file = os.path.join(folder, image)
ax = fig.add_subplot(len(data_folders), sample_size, sample_size * data_folders.index(folder) +
image_sample.index(image) + 1)
# Create a subplot
image = mpimg.imread(image_file)
# Load subplot image
ax.imshow(image)
# Display subplot image
ax.set_axis_off()
# Turn off subplot coordinate lines
fig.set_size_inches(18.5, 10.5)
# Set the display size of the image
plt.show()
plot_samples(train_folders, 20, 'Train')
plot_samples(test_folders, 20, 'Test')
Running results:
As we can see, some of the training data has issues.
## Assignment Code Segment Four
After this, we need to normalize the data, which means transforming each image pixel from 0~255 to -1.0~1.0, and persisting it to a file.
image_size = 28 # Pixel width and height. pixel_depth = 255.0 # Number of levels per pixel.
Image width, height and pixel depth
def load_letter(folder, min_num_images): “““Load the data for a single letter label.”””
Process files in a folder belonging to one letter
image_files = os.listdir(folder)
List all files in that directory
dataset = np.ndarray(shape=(len(image_files), image_size, image_size), dtype=np.float32)
Create a dataset with length equal to number of files, width and height of 28
print(folder)
Print directory
num_images = 0
Initialize num_images
for image in image_files:
Process each file
image_file = os.path.join(folder, image)
Get complete file path
try:
image_data = (ndimage.imread(image_file).astype(float) -
pixel_depth / 2) / pixel_depth
Read in the image and normalize it
if image_data.shape != (image_size, image_size):
Check image width and height
raise Exception('Unexpected image shape: %s' % str(image_data.shape))
dataset[num_images, :, :] = image_data
Read into the dataset
num_images = num_images + 1
Increment image number
except IOError as e:
If file can’t be read, skip it
print('Could not read:', image_file, ':', e, '- it\'s ok, skipping.')
dataset = dataset[0:num_images, :, :]
If fewer files were read than the minimum required
if num_images < min_num_images: raise Exception(‘Many fewer images than expected: %d < %d’ % (num_images, min_num_images))
Display number of missing files
print(‘Full dataset tensor:’, dataset.shape)
Display file count, image width and height
print(‘Mean:’, np.mean(dataset))
Mean value
print(‘Standard deviation:’, np.std(dataset))
Standard deviation
return dataset
def maybe_pickle(data_folders, min_num_images_per_class, force=False): dataset_names = [] for folder in data_folders:
Process each letter folder
set_filename = folder + '.pickle'
Set output file
dataset_names.append(set_filename)
Set processed folders
if os.path.exists(set_filename) and not force:
# You may override by setting force=True.
Check if processed file already exists
print('%s already present - Skipping pickling.' % set_filename)
else:
print('Pickling %s.' % set_filename)
dataset = load_letter(folder, min_num_images_per_class)
Normalize all images in this folder
try:
with open(set_filename, 'wb') as f:
pickle.dump(dataset, f, pickle.HIGHEST_PROTOCOL)
Persist data, save to disk instead of keeping in memory
except Exception as e:
print('Unable to save data to', set_filename, ':', e)
return dataset_names
train_datasets = maybe_pickle(train_folders, 45000) test_datasets = maybe_pickle(test_folders, 1800)
## Question Two
Display processed images
- Reference answer
def plot_samples_2(data_folders, sample_size, title=None): fig = plt.figure()
Create empty figure
if title: fig.suptitle(title, fontsize=16, fontweight='bold')
Add title
for folder in data_folders:
Loop through each letter
with open(folder, 'rb') as pk_f:
data = pickle.load(pk_f)
for index, image in enumerate(data):
if index < sample_size :
Randomly select a certain number of images from that letter
ax = fig.add_subplot(len(data_folders), sample_size, sample_size * data_folders.index(folder) +
index + 1)
Load subplot image
ax.imshow(image)
Display subplot image
ax.set_axis_off()
Turn off subplot coordinate lines
fig.set_size_inches(18.5, 10.5)
Set the display size of the image
plt.show()
plot_samples_2(train_datasets, 20, ‘Train’) plot_samples_2(test_datasets, 20, ‘Test’)


## Question Three
Check if the number of files under each letter is similar.
- Reference answer
file_path = ’notMNIST_large/{0}.pickle’ for ele in ‘ABCDEFJHIJ’: with open(file_path.format(ele), ‘rb’) as pk_f:
Loop through each directory
dat = pickle.load(pk_f)
Load the persisted file in this directory
print('number of pictures in {}.pickle = '.format(ele), dat.shape[0])
Print relevant information
Results show that the numbers are basically consistent.

## Code Segment—Data Splitting
Data cannot be loaded all at once into memory. This code segment splits the data.
def make_arrays(nb_rows, img_size): if nb_rows: dataset = np.ndarray((nb_rows, img_size, img_size), dtype=np.float32)
Create an empty set, data type is a matrix with rows length, img_size width, img_size height, data type is 32-bit float
labels = np.ndarray(nb_rows, dtype=np.int32)
Create a label, data type is 32-bit integer, length is rows
else: dataset, labels = None, None return dataset, labels
Return created data type
def merge_datasets(pickle_files, train_size, valid_size=0): num_classes = len(pickle_files)
Number of categories to process
valid_dataset, valid_labels = make_arrays(valid_size, image_size)
Build validation dataset, length is validation length
train_dataset, train_labels = make_arrays(train_size, image_size)
Build training dataset, length is training length
vsize_per_class = valid_size // num_classes tsize_per_class = train_size // num_classes
Calculate average length for each category with given training and validation lengths
start_v, start_t = 0, 0
Initialize indices, start_v is the start of validation data, start_t is the start of training data
end_v, end_t = vsize_per_class, tsize_per_class
Initialize indices, end_v is the end of validation data, end_t is the end of training data
end_l = vsize_per_class + tsize_per_class
Initialize indices, end_l is the end of the letter set, equal to length of validation data for each category + length of training data
for label, pickle_file in enumerate(pickle_files):
Loop through each pickle_file
try:
with open(pickle_file, 'rb') as f:
Open this persistence file
letter_set = pickle.load(f)
Load dataset
# let's shuffle the letters to have random validation and training set
np.random.shuffle(letter_set)
Shuffle the dataset
if valid_dataset is not None:
If not a test set, update the test set, otherwise valid_dataset is not updated
valid_letter = letter_set[:vsize_per_class, :, :]
numpy slicing http://brieflyx.me/2015/python-module/numpy-array-split/
Select data of ‘valid data per class’ count from shuffled data for processing, put into valid_letter
valid_dataset[start_v:end_v, :, :] = valid_letter
Put this data into valid_dataset
valid_labels[start_v:end_v] = label
Mark label should be one of 0~9
start_v += vsize_per_class
end_v += vsize_per_class
Update indices
At the end of the loop, valid_dataset should be data with total length valid_size, valid_labels is the label at the corresponding position
train_letter = letter_set[vsize_per_class:end_l, :, :]
Other random elements except valid part, length is end_l - vsize_per_class = tsize_per_class
train_dataset[start_t:end_t, :, :] = train_letter
At the end of the loop, train_dataset should be data with total length train_size
train_labels[start_t:end_t] = label
start_t += tsize_per_class
end_t += tsize_per_class
Update indices
except Exception as e:
print('Unable to process data from', pickle_file, ':', e)
raise
return valid_dataset, valid_labels, train_dataset, train_labels
train_size = 200000 valid_size = 10000 test_size = 10000
valid_dataset, valid_labels, train_dataset, train_labels = merge_datasets( train_datasets, train_size, valid_size) _, _, test_dataset, test_labels = merge_datasets(test_datasets, test_size)
print(‘Training:’, train_dataset.shape, train_labels.shape) print(‘Validation:’, valid_dataset.shape, valid_labels.shape) print(‘Testing:’, test_dataset.shape, test_labels.shape)
## Code Segment—Shuffling Data
Introduction to the permutation function: http://www.jianshu.com/p/f0eb10acaa2d
def randomize(dataset, labels):
labels.shape[0] is the length of labels
permutation = np.random.permutation(labels.shape[0])
Randomly select a shuffled set of this many numbers
print(labels.shape[0]) shuffled_dataset = dataset[permutation,:,:]
Shuffle data
shuffled_labels = labels[permutation]
Shuffle labels
return shuffled_dataset, shuffled_labels train_dataset, train_labels = randomize(train_dataset, train_labels) test_dataset, test_labels = randomize(test_dataset, test_labels) valid_dataset, valid_labels = randomize(valid_dataset, valid_labels)
## Question Four
Verify if the shuffled data is correct
- Reference answer
import random def plot_sample_3(dataset, labels, title): fig = plt.figure() plt.suptitle(title, fontsize=16, fontweight=‘bold’)
Set title style
items = random.sample(range(len(labels)), 200)
Shuffle the sequential sequence of labels length
for i, item in enumerate(items):
Randomly pick one
plt.subplot(10, 20, i + 1)
Draw subplot
plt.axis('off')
Turn off coordinate axes
plt.title(chr(ord('A') + labels[item]))
Add title
plt.imshow(dataset[item])
Display subplot at corresponding position
fig.set_size_inches(18.5, 10.5)
plt.show()
Display image
plot_sample_3(train_dataset, train_labels, ’train dataset suffled’) plot_sample_3(valid_dataset, valid_labels, ‘valid dataset suffled’) plot_sample_3(test_dataset, test_labels, ’test dataset suffled’)

Similar two figures omitted
## Code Segment—Saving Data
pickle_file = os.path.join(data_root, ’notMNIST.pickle’)
Output file path
try: f = open(pickle_file, ‘wb’)
Open this file
save = { ’train_dataset’: train_dataset, ’train_labels’: train_labels, ‘valid_dataset’: valid_dataset, ‘valid_labels’: valid_labels, ’test_dataset’: test_dataset, ’test_labels’: test_labels, }
Write a dictionary string-ndarray
pickle.dump(save, f, pickle.HIGHEST_PROTOCOL) f.close() except Exception as e: print(‘Unable to save data to’, pickle_file, ‘:’, e) raise
## Code Segment—Displaying Saved Data Size
statinfo = os.stat(pickle_file) print(‘Compressed pickle size:’, statinfo.st_size)
## Question Five
Google translation of the question:
By construction, this dataset may contain a lot of overlapping samples, including in the validation and test sets. Overlap between training and test can skew the results if you expect to use your model in an environment where there is never an overlap, but in practice this doesn't usually matter. Measure how much overlap there is between training, validation, and test samples.
Optional question:
What about the duplicates between datasets? (For instance, the same letter images)
Create a sanitized validation and test set, and compare your accuracy on those versus your accuracy on the original sets.
The basic idea is that training data should not overlap with testing data, otherwise it leads to inaccurate accuracy.
Reference code:
- Just check the number of duplicate images
import hashlib
pickle_file = os.path.join(’.’, ’notMNIST.pickle’) try: with open(pickle_file, ‘rb’) as f: data = pickle.load(f) except Exception as e: print(‘Unable to open data from’, pickle_file, ‘:’, e) raise
After saving the data, if the kernel crashed, you can read directly from local without rerunning previous code
If there’s an error, you can search for the exception online
def calcOverlap(sourceSet, targetSet, description): sourceSetMd5 = np.array([hashlib.md5(img).hexdigest() for img in sourceSet])
Build an md5 table
targetSetMd5 = np.array([hashlib.md5(img).hexdigest() for img in targetSet])
Build an md5 table
overlap = np.intersect1d(sourceSetMd5, targetSetMd5, assume_unique=False)
Deduplicate
print(description)
print("overlap",overlap.shape[0], "from",sourceSetMd5.shape[0],"to", targetSetMd5.shape[0])
print("rate",overlap.shape[0]*100.0/sourceSetMd5.shape[0],"% and", overlap.shape[0]*100.0/targetSetMd5.shape[0],"%")
Print overlap count
calcOverlap(data[’train_dataset’], data[‘valid_dataset’], “train_dataset & valid_dataset”) calcOverlap(data[’train_dataset’], data[’test_dataset’], “train_dataset & test_dataset”) calcOverlap(data[’test_dataset’], data[‘valid_dataset’], “test_dataset & valid_dataset”)

- Remove duplicate image resources
To be updated
## Question Six
Use logistic regression to train the model and test it
- Reference code
import random def disp_sample_dataset(dataset, labels,trueLabels, title=None):
Display training results
fig = plt.figure()
if title: fig.suptitle(title, fontsize=16, fontweight='bold')
Set title style
items = random.sample(range(len(labels)), 200)
Randomly select a series of images
for i, item in enumerate(items):
plt.subplot(10, 20, i + 1)
Set a subplot
plt.axis('off')
Turn off coordinate lines
lab = str(chr(ord('A') + labels[item]))
trueLab = str(chr(ord('A') + trueLabels[item]))
if lab == trueLab:
plt.title( lab )
else:
plt.title(lab + " but " + trueLab)
Add title
plt.imshow(dataset[item])
Display this image
fig.set_size_inches(18.5, 10.5)
plt.show()
def train_and_predict(train_dataset, train_labels, test_dataset, test_labels ,sample_size): regr = LogisticRegression()
Generate trainer
X_train = train_dataset[:sample_size].reshape(sample_size, 784)
Choose amount of data to train based on sample_size
Compress 2D vector to 1D vector
y_train = train_labels[:sample_size]
Extract training data
regr.fit(X_train, y_train)
Train data
X_test = test_dataset.reshape(test_dataset.shape[0], 28 * 28)
Compress test data to 1D vector
y_test = test_labels
True labels corresponding to test data
pred_labels = regr.predict(X_test)
Generate prediction data
print('Accuracy:', regr.score(X_test, y_test), 'when sample_size=', sample_size)
disp_sample_dataset(test_dataset, pred_labels, test_labels, 'sample_size=' + str(sample_size))
train_and_predict(data[’train_dataset’],data[’train_labels’],data[’test_dataset’],data[’test_labels’], 1000)

## Model Performance
Sections 22~27 discuss model performance-related knowledge. We usually hope that the model's performance can reach 100%, which is obviously impossible. Also, in order to improve the accuracy of the training set, the model may overfit. At this point, we should follow two points:
- Don't use all training data at once, but use it in blocks, train a portion each time
- When model parameter changes cause 30 or more cases to change from error to correct, then this parameter change is effective.

## Stochastic Gradient Descent
Sections 29~31 explain what stochastic gradient descent is.
During training, to make the model move in the optimal direction, we need to calculate the derivative at that point. 1. The calculation of derivatives is quite large, so we need to randomly select a subset of samples to calculate derivatives, to substitute for the real derivative. This is stochastic gradient descent. 2. To reduce the randomness of random selection, we use momentum inertia to reduce randomness. 3. To make the model stable in later stages, we reduce the learning step size.
End of Course One
> Reference for assignment code
> http://www.hankcs.com/ml/notmnist.html