TensorFlow.js — Running AI Directly in the Browser

May 6, 2026 · 8 min read

Fundador da @TautornTech

Machine learning always seemed distant from frontend — Python, powerful servers, expensive GPUs. But that changed. Today it's possible to run a trained AI model directly in the browser, with zero backend, using pure JavaScript and the user's own GPU.

This article explains how this works in practice, using TensorFlow.js as a foundation — and how I applied it in a project called SeeFood, a classifier that solves humanity's greatest problem: knowing whether what you're about to eat is a hot dog or not.

What is TensorFlow.js?

TensorFlow.js is the JavaScript version of TensorFlow — Google's open-source machine learning library. With it, you can:

Run pre-trained models directly in the browser, without a server;
Train models from scratch using JavaScript;
Do transfer learning — take a ready model and fine-tune it for your use case;
Convert models trained in Python to run in the browser.

The proposal is powerful: inference (the process of making a prediction) happens on the user's machine, using local resources. No data is sent to a server. No network latency. Privacy by default.

Tensors: the foundation of everything

Before talking about models, you need to understand the central concept: the tensor.

A tensor is basically a multidimensional array — the generalization of a number (0D), vector (1D), matrix (2D), and so on.

import * as tf from '@tensorflow/tfjs'

// Scalar (0D)
const a = tf.scalar(5)

// Vector (1D)
const b = tf.tensor1d([1, 2, 3])

// Matrix (2D)
const c = tf.tensor2d([[1, 2], [3, 4]])

// 4D tensor — image format: [batch, height, width, channels]
const image = tf.zeros([1, 224, 224, 3])

In the context of images, a 224×224px color photo is represented as a 4D tensor with shape [1, 224, 224, 3]: 1 image, 224 pixels of height, 224 of width, 3 channels (R, G, B).

All processing inside a neural network — from input to output — is done with operations on tensors.

How does an AI model work?

A neural network is a sequence of layers that transform an input tensor into an output tensor. Each layer applies a mathematical transformation using its weights.

What are weights?

Weights are the model's internal parameters — numbers that were adjusted during training so the model learns to recognize patterns.

Imagine a simple edge detector: the weights define which pixel pattern activates that detector. Training is the process of finding the best values for these weights, using a dataset and an algorithm called backpropagation with gradient descent.

At the end of training, these weights are saved to a file. When you load a pre-trained model, you're loading exactly those weights — the "knowledge" that the model accumulated.

Simplified flow example:

Image (224x224x3)
      ↓
Convolutional Layer (detects edges, textures...)
      ↓
Pooling (reduces dimensionality)
      ↓
... more layers ...
      ↓
Dense Layer (classification)
      ↓
Softmax → [prob_class_0, prob_class_1, ..., prob_class_999]

At the end, for image classification, the model returns an array of probabilities — one per class. The highest probability indicates the prediction.

WebGL: the GPU serving JavaScript

Here's the secret that makes all of this viable in the browser: WebGL.

TensorFlow.js uses WebGL as its default backend to execute mathematical operations on the device's GPU. This is fundamental because tensor operations (matrix multiplication, convolutions) are parallelizable by nature — exactly what GPUs do best.

TensorFlow.js
      ↓
WebGL Backend
      ↓
GPU (via GLSL shaders)

In practice, when you run a model in the browser, TensorFlow.js compiles operations into WebGL shaders and executes everything on the user's GPU. The result is delivered back to JavaScript as a tensor.

TensorFlow.js has other available backends — cpu (pure JavaScript), wasm (WebAssembly), and webgpu (the future, for more modern GPUs) — but WebGL is the most broadly supported today.

To see which backend is active:

import * as tf from '@tensorflow/tfjs'

console.log(tf.getBackend()) // 'webgl'

Pre-trained models: no need to train from scratch

Training an image classification model from scratch requires millions of images and days of processing on powerful GPUs. For most use cases, it doesn't make sense.

The alternative is to use a pre-trained model — a model that has already been trained on a huge dataset and whose weights are ready to use.

The most used dataset for this is ImageNet: 1.2 million images across 1000 different classes. Models trained on ImageNet already "understand" edges, textures, shapes, and complex visual patterns.

One of the most commonly used models for browser applications is MobileNet.

Why MobileNet?

MobileNet was designed specifically to be efficient on resource-constrained devices — mobile, IoT, browser. It uses depthwise separable convolutions instead of standard convolutions, drastically reducing the number of operations without losing much in accuracy.

MobileNet V1 with 224×224 input has:

~4.2 million parameters (weights)
13 separable convolution blocks
Output: vector of 1000 probabilities (one per ImageNet class)

For reference, VGG-16 — a classic model — has 138 million parameters. MobileNet fits in the browser. VGG-16 doesn't.

SeeFood: all of this in practice

SeeFood is a classifier that runs 100% in the browser. No server, no API, no infrastructure cost. You point the camera (or upload a photo) and the model responds: hot dog or not.

The reference is obvious — anyone who watched Silicon Valley knows what this is about. But behind the joke there's a real, working implementation.

Tech stack used

TensorFlow.js v4.10.0 — loaded via CDN
MobileNet 1.0 — pre-trained model on ImageNet
WebGL — execution backend
getUserMedia API — device camera access
Canvas API — frame capture and processing

How the model was loaded

The model is served alongside the application in JSON format + binary weight files:

/model/
  model.json          ← architecture + reference to weights
  group1-shard1of1    ← weights (binary tensor)
  group2-shard1of1
  ...
  group55-shard1of1   ← 55 weight groups in total

Loading in the browser:

const model = await tf.loadLayersModel('/model/model.json')

This single await downloads the architecture and all 55 weight files. After that, the model is in memory, ready to do inference locally.

Image preprocessing

MobileNet expects a tensor with shape [1, 224, 224, 3] and values normalized between -1 and 1. The image captured from the camera or upload needs to go through this before entering the model:

function preprocessImage(imageElement) {
  return tf.tidy(() => {
    return tf.browser
      .fromPixels(imageElement)   // converts to tensor [H, W, 3]
      .resizeBilinear([224, 224]) // resizes to expected size
      .toFloat()
      .div(127.5)                 // normalizes to [0, 2]
      .sub(1)                     // normalizes to [-1, 1]
      .expandDims(0)              // adds batch dimension → [1, 224, 224, 3]
  })
}

tf.tidy() is important: it handles memory management, discarding intermediate tensors that are no longer needed. Without it, the GPU leaks memory with each inference.

The inference

const tensor = preprocessImage(imageElement)
const predictions = model.predict(tensor)
const data = await predictions.data() // Float32Array with 1000 values

The result is a Float32Array with 1000 numbers — the probabilities for each of the 1000 ImageNet classes.

Detecting the hot dog

In ImageNet, the class at index 934 is hotdog. Simple as that:

const HOT_DOG_CLASS_INDEX = 934
const HOT_DOG_THRESHOLD = 0.15

const hotdogProbability = data[HOT_DOG_CLASS_INDEX]
const isHotDog = hotdogProbability > HOT_DOG_THRESHOLD

console.log(isHotDog ? '✓ Hot Dog' : '✗ Not Hot Dog')

The threshold of 0.15 was chosen empirically — models rarely reach 90%+ confidence for hot dogs in real images. A threshold too high makes the model never detect anything; too low, and it detects everything.

Why this matters for frontend devs

It might seem like running AI in the browser is an exotic niche, but the practical applications are many:

Facial/gesture recognition for touchless interfaces
Content moderation client-side before upload
Accessibility — scene reading, text recognition in images
Real-time camera filters (like those on Instagram)
Document analysis without sending sensitive data to a server

And there's a very strong privacy argument: if inference happens in the browser, the raw data never leaves the user's machine. For sensitive domains (health, finance, personal documents), this can be decisive.

Performance considerations

Some practical points before going ahead and putting models in production:

Model size: MobileNet has ~16MB in weights. This needs to be downloaded on the first visit. Think about lazy loading and proper caching.

Load time: Even with WebGL, the first inference is slower because TensorFlow.js needs to compile the shaders on the GPU. Subsequent ones are much faster.

Memory management: Tensors need to be manually disposed with tensor.dispose() or using tf.tidy(). Tensor leaks are a real problem.

Mobile devices: Phone GPUs are less powerful. Test on real devices before assuming performance will be good.

// Monitor memory usage
console.log(tf.memory())
// { numTensors: 12, numDataBuffers: 12, numBytes: 1234567 }

Conclusion

TensorFlow.js proves that machine learning in the browser has gone from being a curiosity to a real option. With pre-trained models like MobileNet, WebGL acceleration, and a well-designed JavaScript API, you can build AI applications without needing any server — and without giving up performance.

SeeFood is a small example, with an obviously ridiculous goal. But the same pipeline — load model, preprocess input, inference, interpret output — is what runs under the hood of much more serious things.

It's worth exploring. The barrier to entry is much lower than it seems.

tip

Want to better understand how to optimize frontend applications for resource-intensive usage scenarios? There's a dedicated chapter on performance in my book.

https://www.tautorn.com.br/react-beyond

What is TensorFlow.js?​

Tensors: the foundation of everything​

How does an AI model work?​

What are weights?​

Simplified flow example:​

WebGL: the GPU serving JavaScript​

Pre-trained models: no need to train from scratch​

Why MobileNet?​

SeeFood: all of this in practice​

Tech stack used​

How the model was loaded​

Image preprocessing​

The inference​

Detecting the hot dog​

Why this matters for frontend devs​

Performance considerations​

Conclusion​

References​