Occasional blog posts from a random systems engineer

AI/ML Development Intro: Part 1

· Read in about 14 min · (2803 Words)

What is this?

I work in close proximity to a lot of AI/ML/big data engineers and have friends who are of the same variety. I somewhat understand the basic concepts of a neural network and I have some knowledge of vectors etc. but I’m tired of being the one left confused when having conversations… So, my plan is to get stuck in a little.. at least to train something and to work my way into a problem enough that I get confused.. and then end up with more of an understanding/appreciation as a result of it.

I had a quick look at some tutorials, the MNIST handwritten digit recognition being a common one. BUT, I didn’t like the amount of hand-holding, e.g.:

So I’m going to work-through several different tutorials:

The goal is to write down my version, with no stones left unturned.

Wording

I found I used AI, ML, and other phrases arbitrarily to just talk about the whole thing thing. But, breaking this down so I’m a bit more precise:

  • AI
  • ML

Definitions

I’d heard a lot about various tools, but didn’t know exactly what they were, so:

  • Jupyter (well, not new, but will be part of the stack)
    • An interactive environment where you can write Python code, run it piece by piece, see outputs immediately, and add notes or visualizations.
    • Notebook: interactive, cell-based interface
    • Lab: more full-featured, like an IDE for notebooks
    • Basically treated as you would you laptop/local IDE, used for development and development building/training, but not used for production (training OR inference)
  • TensorFlow/Pytorch: Both ML frameworks - TensorFlow written by Google, Pytorch by Facebook. If I have time, I will try to develop using both of these, starting with tensorflow
  • Airflow: A workflow orchestration tool for complex tasks. E.g. data preprocessing, training, evaluation, deployment
    • DAG: Directed Acyclic Graph, basically steps in a multi-stage job that have dependencies on one-another, with tasks as “nodes” and edges meaning dependencies.
  • MLFlow: A tool for experiment tracking, model versioning, and deployment.
    • This can host models in it’s built-in registry
    • Can serve models (production-ready)

My Stack

So I wanted a little bit of each of these layers, so I decided to run Jupyter Lab and Airflow inside docker.

I wanted to expose Airflow to Jupyter, so my DAGs could be written/triggered from the notebook.

Setup

I’m always conscious about installing anything on my macbook.. I can count the number of applications installed in homebrew on one hand. I generally use decontainers for absolutely everything - but this doesn’t really work for interacting with the Macbook’s Metal. After some thought (and discussion), I decided to use common AI/ML tooling on top of docker on another host.

This will be more similar to “real life” (or at least from my point of view within BigCorp) since everything is hosted in the cloud.

Setting up a basic docker-compose looks like:

version: "3.9"

services:
  postgres:
    image: postgres:15
    container_name: airflow_postgres
    restart: always
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    volumes:
      - postgres_data:/var/lib/postgresql/data
    ports:
      - "5432:5432"
  airflow:
    image: apache/airflow:2.8.3-python3.11
    container_name: airflow
    restart: always
    environment:
      # Set UID to align with Jupyter
      AIRFLOW_UID: 50000
      AIRFLOW__CORE__EXECUTOR: LocalExecutor
      AIRFLOW__CORE__LOAD_EXAMPLES: False
      AIRFLOW__CORE__DAGS_FOLDER: /opt/airflow/workspace/dags
      # AIRFLOW__WEBSERVER__AUTH_BACKEND: "airflow.api.auth.backend.basic_auth"
      AIRFLOW__WEBSERVER__SECRET_KEY: "supersecret"
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres:5432/airflow
    user: "${AIRFLOW_UID:-50000}"
    volumes:
      - ./workspace:/opt/airflow/workspace  # Shared DAGs/workspace
      - airflow_logs:/opt/airflow/logs  # Airflow logs persisted
    ports:
      - "8080:8080"  # Airflow web UI
    command: >
      bash -c "airflow db init &&
               airflow scheduler & 
               airflow webserver"      
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           capabilities: [gpu]  # Optional GPU passthrough

  jupyter:
    image: jupyter/datascience-notebook:latest
    container_name: jupyterlab
    restart: always
    environment:
      # Align UID with Airflow, just to help with file permissions
      NB_UID: 50000
      JUPYTER_ENABLE_LAB: "yes"
    volumes:
      - ./workspace:/home/jovyan/work  # Shared workspace
    ports:
      - "8887:8888"
    command: start-notebook.sh --NotebookApp.token='' --NotebookApp.password=''
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           capabilities: [gpu]  # Optional GPU passthrough

volumes:
  airflow_logs:
  postgres_data:

Getting started

The first thing that we need to do is look at the problem we’re trying to solve, I will do this briefly because I need to learn of the workflow to understand how I need to analyse the problem (meaning this would be more of a phase in “day two” projects).

But, a brief idea is that we have hand-drawn images, we will manipulate the image a bit (downsample to a lower resolution, presumably convert to 1-bit colour pixels). I imagine each of the pixels will end up being an “input” of the model. We’ve have some hidden layer magic and then the output will be a translation of the interpreted number.

Therefore we can provide the model with an image and it will tell us the number.

Data

The raw source of the data is here: http://yann.lecun.com/exdb/mnist/, but appears to now be empty. From archive.org, we can see 4 files:

  • train-images-idx3-ubyte.gz: training set images (9912422 bytes)
  • train-labels-idx1-ubyte.gz: training set labels (28881 bytes)
  • t10k-images-idx3-ubyte.gz: test set images (1648877 bytes)
  • t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes)

These files are stored in IDX format:

  • Magic number (4 bytes, MSB first / big-endian)
    • First 2 bytes: always 0
    • Third byte: data type
      • 0x08 -> unsigned byte
      • 0x09 -> signed byte
      • 0x0B -> short (2 bytes)
      • 0x0C -> int (4 bytes)
      • 0x0D -> float (4 bytes)
      • 0x0E -> double (8 bytes)
    • Fourth byte: number of dimensions (1 for vectors, 2 for matrices, etc.)
  • Dimensions:
    • After the magic number, you have 4-byte integers for the size of each dimension, MSB first (big-endian).
  • Data
    • Raw bytes of your array, row-major order, last dimension changes fastest (like C arrays).

The two types of files are described as:

TRAINING SET LABEL FILE (train-labels-idx1-ubyte):
[offset] [type]          [value]          [description]
0000     32 bit integer  0x00000801(2049) magic number (MSB first)
0004     32 bit integer  60000            number of items
0008     unsigned byte   ??               label
0009     unsigned byte   ??               label
........
xxxx     unsigned byte   ??               label

The labels values are 0 to 9.
TRAINING SET IMAGE FILE (train-images-idx3-ubyte):
[offset] [type]          [value]          [description]
0000     32 bit integer  0x00000803(2051) magic number
0004     32 bit integer  60000            number of images
0008     32 bit integer  28               number of rows
0012     32 bit integer  28               number of columns
0016     unsigned byte   ??               pixel
0017     unsigned byte   ??               pixel
........
xxxx     unsigned byte   ??               pixel

Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).

Meaning that for image files, we can read the header, obtain the number of images, the rows and columns and then data for each pixel. For label files, we extract the number of labels and then the data for each label.

Some code to do this would look like:

import numpy as np
import struct

def load_mnist_images(filename):
    with open(filename, 'rb') as f:
        magic, num, rows, cols = struct.unpack(">IIII", f.read(16))
        data = np.frombuffer(f.read(), dtype=np.uint8)
        return data.reshape(num, rows, cols)

def load_mnist_labels(filename):
    with open(filename, 'rb') as f:
        magic, num = struct.unpack(">II", f.read(8))
        return np.frombuffer(f.read(), dtype=np.uint8)

images = load_mnist_images("train-images-idx3-ubyte")
labels = load_mnist_labels("train-labels-idx1-ubyte")

I was able to find a copy of the files here: https://github.com/hamlinzheng/mnist/tree/master/dataset

So, within jupyter, starting a terminal and I cloned this in the home directory (outside of work).

Data storage formats

At least for large datasets, extracting and converting this data into a format that’s more readily readable by the ML libraries is beneficial, this will make all development (and training runs) more efficient.

There’s a couple of different formats:

  • Raw IDX files

    • Minimal binary format with header + data
    • Pros: official format, simple
    • Cons: slow to load, requires parsing, not framework-friendly
  • NumPy (.npy / .npz)

    • Store arrays directly after parsing IDX
    • Pros: super easy to load (np.load), fast for small datasets
    • Cons: single file per array, not optimized for very large datasets
  • PyTorch (.pt / .pth)

    • Save tensors + labels together
    • Pros: native for PyTorch, fast loading, integrates with training scripts
    • Cons: PyTorch-specific
  • HDF5 (.h5)

    • Hierarchical storage for multiple datasets
    • Pros: scalable, random access (load subset), multi-language support
    • Cons: extra dependency (h5py), slightly more complex API
  • TFRecord (TensorFlow)

    • Framework-native binary format for TensorFlow pipelines
    • Pros: streaming-friendly, GPU-optimized, production-ready
    • Cons: harder to inspect manually, TensorFlow-specific
  • WebDataset / LMDB

    • Optimized for sharded, large-scale datasets
    • Pros: supports distributed training, efficient streaming
    • Cons: setup complexity, not beginner-friendly
  • Parquet / Arrow

    • Columnar format for analytics, often combined with metadata
    • Pros: fast for analytics, good for tabular + image data
    • Cons: not ideal for raw image arrays (needs flattening)

After looking through these, HDF5 felt like the best middle ground, used in production environments, but not tied to a particular framework (especially since we’ll be switching later!).

Data ingestion

So, to tie this together, we’ll create a notebook that imports the dataset, inspects them and saves them as HDF5. We’ll then create a RAG to download the dataset, and then convert it.

We create a simple notebook to interpret and dump the data:

import os
import gzip

import numpy as np
import struct
import h5py

def load_mnist_images(filename):
    with gzip.open(filename, 'rb') as f:
        _, num, rows, cols = struct.unpack(">IIII", f.read(16))
        data = np.frombuffer(f.read(), dtype=np.uint8)
        return data.reshape(num, rows, cols)

def load_mnist_labels(filename):
    with gzip.open(filename, 'rb') as f:
        _, num = struct.unpack(">II", f.read(8))
        return np.frombuffer(f.read(), dtype=np.uint8)

# Load images
images = load_mnist_images(os.path.join(SOURCE_DATA_DIRECTORY, "train-images-idx3-ubyte.gz"))

# Load labels
labels = load_mnist_labels(os.path.join(SOURCE_DATA_DIRECTORY, "train-labels-idx1-ubyte.gz"))

# Save as HDF5
with h5py.File("mnist.h5", "w") as f:
    f.create_dataset("images", data=images, compression="gzip")
    f.create_dataset("labels", data=labels, compression="gzip")

Let’s break this down:

magic, num, rows, cols = struct.unpack(">IIII", f.read(16)): struct.unpack will take some binary data and interpret as different data. We’ve providing a format of >IIII, meaning “big-endian” (as per data format spec), and then four unsigned integers (reference). Then simply unpacking the returned tuple into one ignored value (the magic number) and number of images, rows and columns. We’re then passing the first 16 bytes from the filehandle.

np.frombuffer(f.read(), dtype=np.uint8): np.frombuffer will take binary and return an ‘ndarray’, which is an array representing a “multi-dimensional, homogeneous array of fixed-size items”. At this point, we’ve passed it a load of data and given it a data type, so realistically all it has done is split it into the a big array of uint8s.

You can see that for labels, this is where we stop, and this is because labels is just a flat 1d array. But for the image data, we run through data.reshape(num, rows, cols), which then provides the context to the ndarray as to the structure of the data, meaning that we create dimensions for the number of images, the row and the columns of pixels and all of the data will now be indexable via these new dimensions.

Next, let’s take a look at what our data looks like…

This seems to be where Jupyter notebooks sort of shine… now we have images and labels, we should take a look to see what they’re actually made up of. We can add a very basic:

print(images)
print(labels)

and modify this to interact with them however we wish without having to reread/process the files.

![](/images/ai1/print_labels.png {width=‘300’}) ![](/images/ai1/print_images.png {width=‘300’})

So, giving that a go: Let’s take a quick look at labels:

import matplotlib.pyplot as plt

plt.figure(figsize=(8,5))
plt.hist(labels, bins=30, color='skyblue', edgecolor='black')
plt.title("Distribution of Label data")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()

Nice and easy: ![](/images/ai1/label_distribution.png {width=‘400’})

We can see a relatively even distribution for the data for each number.

How about inspecting some of the images:

import matplotlib.pyplot as plt

idx = 0  # first image
image = images[idx]
label = labels[idx]

# Display the image
plt.figure(figsize=(4,4))
plt.imshow(image, cmap='gray')  # grayscale colormap
plt.title(f"Label: {label}")
plt.axis('off')  # turn off axis
plt.show()

![](/images/ai1/image_preview.png {width=‘400’})

Data ingestion pipeline

At this point, I’m not entirely sure if using Airflow here will be overkill… but I want to understand how this fits into a more “real” setup, not just a notebook, so let’s give it a go.

We’ll start with a small DAG pipeline that can process what we have done so far:

  • Obtain data
  • Convert into managable

We’ll create a slightly more dynamic script for performing the conversion:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
import os
import urllib.request
import subprocess
import struct
import numpy as np
import gzip
import h5py

RAW_DIR = "/opt/airflow/dags/data/raw"
PROCESSED_DIR = "/opt/airflow/dags/data/processed"

MNIST_URLS = {
    "train-images-idx3-ubyte.gz": "https://github.com/hamlinzheng/mnist/raw/refs/heads/master/dataset/train-images-idx3-ubyte.gz",
    "train-labels-idx1-ubyte.gz": "https://github.com/hamlinzheng/mnist/raw/refs/heads/master/dataset/train-labels-idx1-ubyte.gz",
}


def download_data():
    os.makedirs(RAW_DIR, exist_ok=True)

    for filename, url in MNIST_URLS.items():
        filepath = os.path.join(RAW_DIR, filename)

        if not os.path.exists(filepath):
            print(f"Downloading {filename}...")
            urllib.request.urlretrieve(url, filepath)
        else:
            print(f"{filename} already exists, skipping.")


def load_images(path):
    with gzip.open(path, 'rb') as f:
        _, num, rows, cols = struct.unpack(">IIII", f.read(16))
        data = np.frombuffer(f.read(), dtype=np.uint8)
        return data.reshape(num, rows, cols)


def load_labels(path):
    with gzip.open(path, 'rb') as f:
        _, num = struct.unpack(">II", f.read(8))
        return np.frombuffer(f.read(), dtype=np.uint8)


def convert_data():
    os.makedirs(PROCESSED_DIR, exist_ok=True)

    images = load_images(os.path.join(RAW_DIR, "train-images-idx3-ubyte.gz"))
    labels = load_labels(os.path.join(RAW_DIR, "train-labels-idx1-ubyte.gz"))

    output_path = os.path.join(PROCESSED_DIR, "mnist.h5")

    with h5py.File(output_path, "w") as f:
        f.create_dataset("images", data=images, compression="gzip")
        f.create_dataset("labels", data=labels, compression="gzip")

    print(f"Saved dataset to {output_path}")


with DAG(
    dag_id="mnist_pipeline",
    start_date=datetime(2024, 1, 1),
    schedule_interval=None,
    catchup=False,
    tags=["ml", "mnist"],
) as dag:

    download_task = PythonOperator(
        task_id="download_mnist",
        python_callable=download_data,
    )

    convert_task = PythonOperator(
        task_id="convert_to_hdf5",
        python_callable=convert_data,
    )

    download_task >> convert_task

Then execute using the following:

import requests
import json

AIRFLOW_URL = "http://airflow:8080/api/v1"
DAG_ID = "mnist_pipeline"
USERNAME = "airflow"
PASSWORD = "airflow"

payload = {
    "conf": {}
}

response = requests.post(
    f"{AIRFLOW_URL}/dags/{DAG_ID}/dagRuns",
    auth=(USERNAME, PASSWORD),
    headers={"Content-Type": "application/json"},
    data=json.dumps(payload)
)

print(response.status_code, response.json())

Now we have data in ./data/pocessed!

notes

Whilst trying to run this, I saw three issues (all fixed in the above docker-compose):

  • Authentication wasn’t working - a user needed to be created - apparently the default airflow:airflow didn’t work
  • Mapping the dags directly directly to the wokspace wasn’t ideal, as it mixed things up a lot. Instead, mapping the workspace to a workspace directory inside airflow and then setting the env variable to look for a dags subdirectory helped organise things a lot. Instead, mapping exactly the same container path in airflow mean that any absolute directories were identical in both :sigh:
  • Linking the containers to allow DNS resolution was much easier :D

But also: airflow didn’t contain the required packages I needed (h5py) and errored during startup because of the DAG code. To combat this, performing a simple build of a custom docker image, which just performed a pip install appeared to help… but I ran into:

airflow           | ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

Whilst unfortunate, installing h5py with --no-deps did fix it, I wouldn’t recommend it.. probably safest would have been to dump the installed packages, add h5py and install the lot - at least this way pre-install packages would have been taken into consideration as dependencies rather than steamrolling all over them. So little Dockerfile.airflow later:

FROM apache/airflow:2.8.3-python3.11

# Install h5py
RUN pip install --no-deps h5py

and a little:

@@ -14,7 +14,9 @@
     ports:
       - "5432:5432"
   airflow:
-    image: apache/airflow:2.8.3-python3.11
+    build:
+      context: .
+      dockerfile: Dockerfile.airflow
     container_name: airflow
     restart: always
     environment:

and all starts with no errors.

And, honestly, I couldn’t be bothered to get authentication working, so just run:

docker exec airflow airflow users  create --role Admin --username airflow --email airflow --firstname airflow --lastname airflow --password airflow

Blame PEBKAC or airflow docs, but :shrug: it works.

Neutral Network

Next we’ll take a look at the basis of a neural network - my basic understanding is currently just:

Input -> Hidden Layers -> Output

We have our inputs (rows * columns * colour depth) of pixels to provide the image to the model.

Let’s first read in our data using the new format and validate it:

import h5py
import numpy as np

with h5py.File("./data/processed/mnist.h5", "r") as f:
    images = f["images"][:]
    labels = f["labels"][:]

![](/images/ai1/verify_h5_data.png {width=‘300’})

To be able to work with the default Jupyter image, I had to install tensorflow. The options are either install in a terminal or build a custom iamge that installs it on top. On top of this, due to cross-dependencies, I had various issues, so pinning tensorflow and numpy worked for me:

pip install tensorflow=2.13 h5py=3.10 numpy=1.24.3

Let’s first gear our data to be suitable for the inputs - we no longer care about the magical 2-dimensions of images, so we’ll have a set of 1-dimensional pixel values. Since everything surrounding neurons in ML is floats between 0 and 1, we’ll need to convert them to be between these values:

x_shaped = images.reshape(images.shape[0], -1)

X = x_shaped.astype("float32") / 255.0
y = labels.astype("int64")

The thing to note here is that images.reshape is taking the shape of outer dimension of our data (the number of images), so 60000 and then telling it to reshape that data based on this and then -1 is telling it to “go figure it out”, so it’s effectively flattening the x/y dimensions into a single dimension of pixels. Realistically, no different to using x_shaped = images.reshape(images.shape[0], 28*28). We are then converting all of the values into floats (for the input data) and dividing by the pixel color depth (which is 1 byte), translating it from 0-255 -> 0-1.

More Definitions

  • Inputs

  • Hidden Layers

  • Outputs

  • Bias

  • Weight