TinyML: Complete Guide to On-Device AI — Models, Hardware, Optimisation & Deployment

A comprehensive guide to TinyML — running machine learning on microcontrollers with kilobytes of memory. Covers hardware platforms (Cortex-M, ESP32, RISC-V), optimisation techniques (quantisation, pruning, knowledge distillation, NAS), practical code for training, quantising, and deploying a keyword-spotting model, real-world use cases, power budgeting, and the full deployment pipeline from prototype to production.

1. Why TinyML Matters

There are over 250 billion microcontrollers deployed worldwide — in appliances, vehicles, industrial sensors, wearables, and agricultural equipment. Until recently, none of them could run machine learning. TinyML changes that.

By pushing ML inference to the smallest, cheapest, lowest-power devices, TinyML enables intelligence where cloud connectivity is impossible, latency is unacceptable, bandwidth is too expensive, or privacy demands that data never leave the device. A $2 microcontroller can now detect keywords, classify sounds, recognise gestures, spot anomalies, and make real-time decisions — entirely offline, running on a coin-cell battery for years.

2. What Is TinyML

TinyML refers to running machine learning inference on microcontrollers (MCUs) and ultra-low-power devices with severe resource constraints:

Memory: 16 KB – 2 MB of SRAM (vs gigabytes in phones or servers)
Storage: 64 KB – 2 MB of flash (vs gigabytes or terabytes)
Compute: 10–400 MHz clock, no GPU, limited FPU
Power: microwatts to milliwatts (battery or energy harvesting)

The core challenge: compress neural networks that normally require megabytes or gigabytes of memory into models that fit in kilobytes while maintaining acceptable accuracy.

2.1 TinyML vs Traditional ML

Dimension	Cloud ML	Edge ML (Phone / SBC)	TinyML (MCU)
Memory	Gigabytes – Terabytes	2–16 GB	16 KB – 2 MB
Compute	GPUs / TPUs	CPU + GPU / NPU	CPU only (10–400 MHz)
Power	Kilowatts	1–10 W	Microwatts – Milliwatts
Connectivity	Always online	Usually online	Often offline
Latency	100–500 ms (network)	10–50 ms	1–10 ms (on-chip)
Model Size	Gigabytes	10–500 MB	10 KB – 500 KB
Cost per Unit	$1,000+/month	$50–500	$1–10

3. How On-Device Inference Works

3.1 The Workflow

Train: Train a model on a powerful machine (GPU server) using standard frameworks (TensorFlow, PyTorch).
Optimise: Apply quantisation, pruning, and architecture search to shrink the model to target device constraints.
Convert: Export to a microcontroller-compatible format (TFLite flatbuffer, ONNX, or C array).
Compile: Embed the model binary into firmware alongside an inference runtime (TF Lite Micro, TVM).
Run: The MCU runs the model in a fixed-size memory arena, reading sensor data as input and producing predictions as output.

3.2 Memory Layout on MCU

// Typical memory allocation for a TinyML application
// Device: ARM Cortex-M4 with 256 KB SRAM, 1 MB Flash

// Flash (Read-Only):
//   Model weights (flatbuffer)    ~120 KB
//   Firmware code                  ~80 KB
//   Lookup tables / constants      ~20 KB
//   Total flash used:              ~220 KB / 1 MB

// SRAM (Read-Write):
//   Tensor arena (activations)     ~40 KB
//   Input buffer (audio window)    ~16 KB
//   Output buffer                    ~1 KB
//   Stack + heap                   ~10 KB
//   Peripheral DMA buffers          ~8 KB
//   Total SRAM used:               ~75 KB / 256 KB

3.3 Inference Loop

On an MCU, inference runs in a tight loop: read sensor → preprocess (e.g., compute MFCCs for audio) → invoke model → postprocess → act on result. The entire cycle typically takes 1–50 ms depending on model complexity and MCU clock speed.

4. Hardware Platforms — Comparison

Platform	Core	RAM	Flash	Clock	ML Accelerator	Price
Arduino Nano 33 BLE Sense	Cortex-M4F	256 KB	1 MB	64 MHz	None	~$33
ESP32-S3	Xtensa LX7 (dual)	512 KB	8 MB (ext.)	240 MHz	Vector instructions	~$5
STM32H747	Cortex-M7 + M4	1 MB	2 MB	480 MHz	None (DSP)	~$20
Raspberry Pi Pico 2	Cortex-M33 (dual)	520 KB	4 MB	150 MHz	None	~$5
MAX78000	Cortex-M4F + RISC-V	512 KB	512 KB	100 MHz	CNN accelerator (64 cores)	~$10
Nordic nRF5340	Cortex-M33 + M33	512 + 64 KB	1 MB + 256 KB	128+64 MHz	None	~$8
GAP9 (GreenWaves)	RISC-V (9 cores)	1.5 MB	2 MB	400 MHz	Hardware loops, SIMD	~$12

5. Frameworks & Tools

Framework	Runtime	Training	Deployment	Best For
TF Lite Micro	C++ (bare-metal)	TensorFlow / Keras	Flatbuffer → C array	Broadest device support, Google ecosystem
Edge Impulse	C++ (auto-generated)	Web-based (AutoML)	One-click OTA	Rapid prototyping, no ML expertise needed
CMSIS-NN	C (Arm optimised)	Any (manual integration)	Library linking	Max performance on Cortex-M
MicroTVM (Apache TVM)	C (generated)	Any framework	Ahead-of-time compiled	Custom hardware targets, max optimisation
ONNX Micro Runtime	C (lightweight)	PyTorch / ONNX	ONNX model loading	PyTorch users, ONNX ecosystem
STM32Cube.AI	C (STM32 optimised)	Keras / TFLite / ONNX	STM32 project generation	STM32 developers, IDE integration

6. Model Optimisation Deep Dive

6.1 Quantisation

The single most impactful technique for TinyML. Converts model weights and activations from 32-bit floating point to 8-bit integers (or lower), reducing model size by 4× and speeding up inference 2–4× on integer-only MCU hardware.

Post-Training Quantisation (PTQ): Apply quantisation after training with a small calibration dataset. No retraining needed. Typical accuracy loss: 0.5–2%.
Quantisation-Aware Training (QAT): Simulate quantisation during training so the model learns to compensate. Better accuracy at the cost of longer training. Essential for very small models.
Mixed-Precision: Quantise different layers to different bit widths (INT8 for most, FP16 or INT16 for sensitive layers) to balance size and accuracy.

6.2 Pruning

Remove weights (or entire neurons/filters) that contribute least to the model's output:

Unstructured pruning: Zero out individual weights. Reduces theoretical compute but requires sparse matrix support (limited on MCUs).
Structured pruning: Remove entire filters or channels. Directly reduces model size and compute without sparse support. Better for MCUs.

6.3 Knowledge Distillation

Train a small "student" model to mimic the predictions of a larger "teacher" model. The student learns from the teacher's soft probability distributions, capturing knowledge that direct training on hard labels may miss. This can improve small-model accuracy by 2–5%.

6.4 Neural Architecture Search (NAS)

Automatically search for model architectures that meet specific constraints (memory, latency, accuracy). MCUNet (MIT) and MicroNets use hardware-aware NAS to find architectures specifically optimised for target MCUs.

6.5 Optimisation Impact

Technique	Size Reduction	Speed Improvement	Accuracy Impact	Complexity
INT8 Quantisation (PTQ)	4×	2–4×	-0.5 to -2%	Low
INT8 QAT	4×	2–4×	-0.1 to -0.5%	Medium
Structured Pruning (50%)	2×	1.5–2×	-1 to -3%	Medium
Knowledge Distillation	—	—	+2 to +5%	Medium
NAS (MCUNet)	Custom	Custom	Best for budget	High
Pruning + QAT combined	8–10×	4–8×	-1 to -3%	High

7. Practical Code — Train & Quantise a Model

Train a keyword-spotting model on the Speech Commands dataset, then quantise it for deployment on a microcontroller.

7.1 Install Dependencies

pip install tensorflow numpy

7.2 Train a Tiny Keyword Spotter

import tensorflow as tf
import numpy as np

# --- 1. Load the Speech Commands dataset ---
# Subset: "yes", "no", "up", "down", plus background noise
CLASSES = ["yes", "no", "up", "down", "_silence_"]

# In production, load real audio data and extract MFCCs.
# For demonstration, simulate MFCC features:
# Each sample = 49 time frames x 10 MFCC coefficients
NUM_SAMPLES = 5000
X_train = np.random.randn(NUM_SAMPLES, 49, 10, 1).astype(np.float32)
y_train = np.random.randint(0, len(CLASSES), NUM_SAMPLES)

# --- 2. Build a tiny CNN ---
model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(8, (3, 3), activation="relu",
                           input_shape=(49, 10, 1)),
    tf.keras.layers.MaxPooling2D((2, 2)),
    tf.keras.layers.Conv2D(16, (3, 3), activation="relu"),
    tf.keras.layers.MaxPooling2D((2, 2)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(32, activation="relu"),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(len(CLASSES), activation="softmax"),
])

model.compile(optimizer="adam",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])
model.summary()
# Total params: ~6,500 — tiny enough for most MCUs

# --- 3. Train ---
model.fit(X_train, y_train, epochs=10, batch_size=32,
          validation_split=0.2)

7.3 Quantise to INT8

# --- 4. Post-Training Quantisation ---
def representative_dataset():
    """Provide calibration data for quantisation."""
    for i in range(100):
        yield [X_train[i:i+1]]

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [
    tf.lite.OpsSet.TFLITE_BUILTINS_INT8
]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

tflite_model = converter.convert()

# Save the quantised model
with open("keyword_model_int8.tflite", "wb") as f:
    f.write(tflite_model)

original_size = model.count_params() * 4  # float32 = 4 bytes
quantised_size = len(tflite_model)
print(f"Original:  ~{original_size:,} bytes")
print(f"Quantised:  {quantised_size:,} bytes")
print(f"Reduction:  {original_size / quantised_size:.1f}x")

7.4 Convert to C Array

# --- 5. Export as C header for firmware ---
def tflite_to_c_array(model_bytes, var_name="keyword_model"):
    """Convert TFLite model to a C header file."""
    hex_lines = []
    for i in range(0, len(model_bytes), 12):
        chunk = model_bytes[i:i+12]
        hex_lines.append("  " + ", ".join(f"0x{b:02x}" for b in chunk))
    c_array = ",\n".join(hex_lines)
    header = (
        f"// Auto-generated — do not edit\n"
        f"#ifndef {var_name.upper()}_H\n"
        f"#define {var_name.upper()}_H\n\n"
        f"alignas(16) const unsigned char {var_name}[] = {{\n"
        f"{c_array}\n}};\n"
        f"const unsigned int {var_name}_len = {len(model_bytes)};\n\n"
        f"#endif // {var_name.upper()}_H\n"
    )
    return header

with open("keyword_model.h", "w") as f:
    f.write(tflite_to_c_array(tflite_model))
print("C header written: keyword_model.h")

8. Practical Code — Deploy on Microcontroller

Embed the quantised model in firmware using TF Lite Micro (Arduino / PlatformIO compatible).

// keyword_detector.ino — Arduino sketch for keyword spotting
#include <TensorFlowLite.h>
#include "tensorflow/lite/micro/all_ops_resolver.h"
#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/schema/schema_generated.h"
#include "keyword_model.h"  // Generated C array

// --- Configuration ---
constexpr int kTensorArenaSize = 24 * 1024;  // 24 KB for activations
static uint8_t tensor_arena[kTensorArenaSize] __attribute__((aligned(16)));

static tflite::AllOpsResolver resolver;
static const tflite::Model* model = nullptr;
static tflite::MicroInterpreter* interpreter = nullptr;
static TfLiteTensor* input = nullptr;
static TfLiteTensor* output = nullptr;

const char* CLASSES[] = {"yes", "no", "up", "down", "silence"};
constexpr int NUM_CLASSES = 5;

void setup() {
  Serial.begin(115200);
  while (!Serial) {}

  // Load model
  model = tflite::GetModel(keyword_model);
  if (model->version() != TFLITE_SCHEMA_VERSION) {
    Serial.println("Model schema mismatch!");
    return;
  }

  // Create interpreter
  static tflite::MicroInterpreter static_interpreter(
      model, resolver, tensor_arena, kTensorArenaSize);
  interpreter = &static_interpreter;

  // Allocate tensors
  if (interpreter->AllocateTensors() != kTfLiteOk) {
    Serial.println("AllocateTensors() failed!");
    return;
  }

  input = interpreter->input(0);
  output = interpreter->output(0);

  Serial.println("TinyML keyword detector ready.");
  Serial.print("Input shape: ");
  Serial.print(input->dims->data[1]);  // time frames
  Serial.print(" x ");
  Serial.println(input->dims->data[2]);  // MFCC bins
  Serial.print("Arena used: ");
  Serial.print(interpreter->arena_used_bytes());
  Serial.println(" bytes");
}

void loop() {
  // 1. Capture audio and compute MFCCs
  //    (Use PDM mic + MFCC library — device-specific)
  //    fill_mfcc_buffer(input->data.int8);

  // 2. Run inference
  unsigned long t0 = micros();
  if (interpreter->Invoke() != kTfLiteOk) {
    Serial.println("Invoke failed!");
    return;
  }
  unsigned long inference_us = micros() - t0;

  // 3. Find top prediction
  int8_t max_score = -128;
  int max_idx = 0;
  for (int i = 0; i < NUM_CLASSES; i++) {
    if (output->data.int8[i] > max_score) {
      max_score = output->data.int8[i];
      max_idx = i;
    }
  }

  // 4. Report result
  float confidence = (max_score - output->params.zero_point)
                     * output->params.scale;
  Serial.print("Detected: ");
  Serial.print(CLASSES[max_idx]);
  Serial.print(" (");
  Serial.print(confidence, 3);
  Serial.print(") in ");
  Serial.print(inference_us);
  Serial.println(" us");

  delay(1000);  // Run every second
}

9. Real-World Use Cases

9.1 Keyword Spotting & Wake Words

The most common TinyML application. Devices listen for specific words ("Hey Siri", "OK Google") on-chip without sending audio to the cloud. Only when a keyword is detected does the device activate cloud processing — preserving privacy and saving bandwidth.

9.2 Predictive Maintenance

Sensors attached to motors, bearings, and pumps collect vibration and acoustic data. A TinyML model detects anomalous patterns (early bearing wear, imbalance, cavitation) and alerts maintenance teams before failures occur — saving downtime and replacement costs.

9.3 Wearables & Health

Activity recognition (walking, running, sleeping), fall detection, heart rhythm anomaly detection, and stress monitoring — all running on wristband-class hardware with days of battery life. Data stays on the wearable; only alerts are sent to phones.

9.4 Agriculture

Solar-powered sensors in fields classify pest sounds, detect crop diseases from camera images, and monitor soil conditions. Connectivity is often unavailable in remote farmland, making on-device inference the only viable approach.

9.5 Smart Home & Appliances

Washing machines that detect load type and adjust cycles, refrigerators that track food inventory, and HVAC systems that learn occupancy patterns — all using embedded TinyML without cloud dependency.

9.6 Wildlife Conservation

Audio classifiers deployed in forests identify bird species, detect chainsaw sounds (illegal logging), or monitor animal populations — running on battery-powered devices for months in remote locations.

10. Power & Memory Budget Planning

10.1 Power Budget Example

Component	Active Power	Duty Cycle	Avg. Power
MCU (Cortex-M4 @ 64 MHz)	10 mW	5% (inference only)	0.5 mW
Microphone (PDM)	0.5 mW	100% (always listening)	0.5 mW
BLE radio (transmit alerts)	15 mW	0.1% (rare)	0.015 mW
Deep sleep (MCU idle)	0.01 mW	95%	0.0095 mW
Total average			~1.0 mW

With a 250 mAh coin-cell battery (CR2032) at 3 V = 750 mWh capacity, this system runs for ~750 hours (31 days) on a single battery — with continuous keyword detection.

10.2 Memory Budget Checklist

Model weights in flash: aim for < 50% of flash capacity (leave room for firmware updates)
Tensor arena in SRAM: size with interpreter->arena_used_bytes() and add 10% margin
Input buffers: audio windows, sensor FIFO buffers — size these first, they are non-negotiable
Stack + heap: reserve 8–16 KB minimum for system operations
DMA buffers: peripheral-specific, often fixed by hardware constraints

11. Deployment Pipeline

Data collection: Gather representative sensor data from the target environment. Edge Impulse provides data collection SDKs for many devices.
Feature engineering: Extract features appropriate for the task (MFCCs for audio, FFT for vibration, statistical features for accelerometer).
Model training: Train on GPU server using TensorFlow, PyTorch, or Edge Impulse Studio.
Optimisation: Quantise (PTQ or QAT), prune if needed, verify accuracy on test set.
Conversion: Export to TFLite → C array, or use STM32Cube.AI / Edge Impulse deployment.
Firmware integration: Embed model and runtime in MCU firmware. Set up sensor → preprocess → inference → action loop.
On-device testing: Measure latency, memory usage, power consumption, and accuracy on real hardware.
OTA updates: Design a mechanism to update models wirelessly for continuous improvement.

12. TinyML vs Edge AI vs Cloud AI

Factor	TinyML (MCU)	Edge AI (Phone / SBC)	Cloud AI
Model complexity	Simple (CNNs, tiny RNNs)	Medium (MobileNet, BERT-tiny)	Unlimited (GPT-4, Gemini)
Latency	1–10 ms	10–100 ms	100–1000 ms
Privacy	Excellent (data stays on chip)	Good (on-device processing)	Depends on provider
Connectivity	None required	Intermittent OK	Always required
Power	Microwatts–Milliwatts	Watts	Kilowatts
Unit cost	$1–10	$50–500	Per-request pricing
Best for	Always-on sensing, battery devices	Mobile apps, on-premise	Complex reasoning, generation

13. Limitations & Challenges

Model capacity: Tiny models cannot match the accuracy of large models on complex tasks. TinyML excels at narrow tasks (detection, classification) not general reasoning.
Training is off-device: MCUs run inference only. Training (or even fine-tuning) on-device is impractical with current hardware — though research on on-device learning is active.
Tooling maturity: The TinyML toolchain is less mature than cloud ML. Debugging on MCUs is harder, and framework support varies across chip vendors.
Fragmented ecosystem: No single framework runs optimally on all MCUs. Each vendor has preferred tools (STM32Cube.AI, ESP-NN, CMSIS-NN), creating lock-in risk.
Model updates: Pushing updated models to deployed MCUs requires OTA infrastructure that many embedded systems lack.
Security: Model weights stored in flash can be extracted via physical access (JTAG, flash dumping). IP protection on MCUs is limited compared to servers.

14. Future Directions

On-device training: Research into training (or at least fine-tuning) models directly on MCUs using techniques like forward-mode differentiation and sparse updates.
Dedicated ML accelerators: Chips like MAX78000 and GAP9 include hardware specifically designed for neural network inference, delivering 10–100× efficiency gains over general-purpose MCUs.
Sub-milliwatt inference: Analogue computing, in-memory compute, and neuromorphic chips promise inference at microwatt power levels.
Foundation models for TinyML: Tiny versions of foundation models that can perform multiple tasks on a single MCU, enabling more flexible on-device intelligence.
Federated learning on MCUs: Distributing model improvement across thousands of deployed devices without centralising data.

15. Frequently Asked Questions

What is the smallest useful TinyML model?

A keyword spotter (e.g., "yes"/"no" detection) can work with a model under 20 KB using a tiny CNN or DS-CNN (depthwise separable CNN). The smallest practical models run on MCUs with just 32 KB of SRAM.

Can I train a model directly on a microcontroller?

Not practically with current hardware. MCUs lack the memory and compute for backpropagation. Training is done on powerful machines; only inference runs on the MCU. Research into on-device training exists but is pre-production.

Which framework should I start with?

If you are new to TinyML, start with Edge Impulse — it handles the full pipeline from data collection to deployment with a web UI. For more control, use TensorFlow Lite Micro directly. If you are on STM32, STM32Cube.AI integrates well with the STM32 IDE.

How much accuracy do I lose from quantisation?

Typically 0.5–2% with post-training quantisation (INT8). With quantisation-aware training, the loss is often below 0.5%. For very small models, QAT is essential because every bit of accuracy matters.

Can TinyML models process images?

Yes, but at low resolution. Person detection (96×96 pixels, binary classification) is a standard TinyML benchmark and runs on devices like the Arduino Nano 33 BLE. Higher resolutions require more capable MCUs (ESP32-S3, STM32H747) or dedicated accelerators.

What about security for deployed models?

Model weights in flash can be extracted via physical access. Use MCUs with secure boot, encrypted flash, and debug port lockout. For high-value IP, consider chips with hardware security modules (TrustZone, secure enclaves).

How do I update a deployed TinyML model?

Over-the-air (OTA) updates via BLE or Wi-Fi. Design firmware with a dual-bank flash layout (active + update partition) so the device can fall back to the previous model if an update fails. Edge Impulse and Arduino IoT Cloud support OTA for TinyML models.

16. Glossary

TinyML: Machine learning inference on microcontrollers and ultra-low-power devices with kilobytes of memory.
Microcontroller (MCU): A compact integrated circuit with CPU, memory (SRAM + Flash), and peripherals on a single chip, designed for embedded applications.
Quantisation: Converting model parameters from high-precision (FP32) to lower-precision (INT8, INT4) representations to reduce size and improve speed.
Tensor Arena: A pre-allocated block of SRAM used by the TF Lite Micro runtime to store intermediate activations during inference.
MFCC (Mel-Frequency Cepstral Coefficients): A representation of audio that captures perceptually relevant frequency information, commonly used as input features for speech/audio ML models.
CMSIS-NN: Arm's optimised library of neural network functions for Cortex-M processors, providing hand-tuned implementations of common operations.
OTA (Over-the-Air): Wirelessly updating firmware or model weights on deployed devices without physical access.
Knowledge Distillation: Training a small model to mimic a larger, more accurate model by learning from its soft prediction probabilities.
NAS (Neural Architecture Search): Automated search for optimal model architectures that meet specific hardware constraints (memory, latency, power).
Depthwise Separable Convolution: A factored convolution that applies a single filter per input channel, then combines channels with 1×1 convolutions. Reduces parameters and compute by 8–9× vs standard convolutions.
Flatbuffer: An efficient serialisation format used by TensorFlow Lite to store model architecture and weights in a compact, memory-mappable binary format.

17. References & Further Reading

Start building: grab an Arduino Nano 33 BLE Sense or an ESP32-S3, install Edge Impulse CLI, collect 2 minutes of audio data, train a keyword spotter, and deploy it to the board. You will have a working TinyML device in under an hour — and a concrete understanding of what fits (and what does not) in kilobytes of memory.