Tensors, shapes & broadcasting

Every AI computation is a flow of tensors through operations. If you're fluent in shapes and broadcasting, you can read any model's code and debug the error message that eats 90% of beginners' time: shapes (a,b) and (c,d) not aligned.

What is a tensor?

A tensor is just an n-dimensional array of numbers. The number of dimensions is its rank (NumPy calls it ndim). That's the whole definition — in deep learning, "tensor" is simply the word for the multi-dimensional arrays that flow through a model.

Rank	Name	Example	Shape
0	scalar	`5.0`	`()`
1	vector	`[1, 2, 3]`	`(3,)`
2	matrix	a table / image channel	`(rows, cols)`
3	3-tensor	an RGB image	`(height, width, 3)`
4	4-tensor	a batch of images	`(batch, H, W, 3)`

import numpy as np

s = np.array(5.0)
v = np.array([1., 2., 3.])
M = np.array([[1., 2.], [3., 4.]])
T = np.zeros((2, 3, 4))
print("ranks:", s.ndim, v.ndim, M.ndim, T.ndim)
print("shapes:", s.shape, v.shape, M.shape, T.shape)

Output:

ranks: 0 1 2 3
shapes: () (3,) (2, 2) (2, 3, 4)

Don't be confused: "tensor" (ML) vs "tensor" (physics/math). In physics a tensor is an object with strict transformation rules. In ML, "tensor" just means "n-dimensional array." When a PyTorch person says tensor, they mean the array. Don't overthink it.

Shape is the thing you reason about

Almost every bug is a shape bug. Two habits save you:

Say the shape out loud at each line: "X is (batch, features)."
Know what each axis means — by deep convention, axis 0 is the batch / sample axis (one row per example). X[i] is the i-th example.

Reshaping moves the same numbers into a new shape

a = np.arange(12)          # [0 1 2 ... 11], shape (12,)
print(a.reshape(3, 4))     # same 12 numbers, now 3 rows of 4
print(a.reshape(3, 4).reshape(-1).shape)   # -1 = "infer this axis" -> flat again

Output:

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
(12,)

-1 means "you figure out this dimension so the total count matches." You'll see x.reshape(batch, -1) constantly to flatten everything-but-the-batch.

Broadcasting: the rule that removes loops

Broadcasting lets NumPy combine arrays of different shapes by virtually stretching the smaller one. It's how you add a bias to every row, or scale every column, without a loop. The rule: compare shapes from the right; dimensions are compatible if they're equal or one of them is 1.

Add a scalar to everything

X = np.array([[1., 2., 3.],
              [4., 5., 6.]])
print(X + 10)             # the 10 is stretched to every element

Output:

[[11. 12. 13.]
 [14. 15. 16.]]

Add a per-column vector (shape `(3,)`) to every row

print(X + np.array([100., 200., 300.]))    # (2,3) + (3,) -> stretched down rows

Output:

[[101. 202. 303.]
 [104. 205. 306.]]

Add a per-row vector — you must make it a column `(2,1)`

print(X + np.array([[10.], [20.]]))         # (2,3) + (2,1) -> stretched across cols

Output:

[[11. 12. 13.]
 [24. 25. 26.]]

That [:, None] / (2,1) trick — turning a vector into an explicit column — is how you control which axis broadcasts. We used it in the recipe book to compute all-pairs distances.

`axis`: the other thing everyone trips on

Reductions like mean, sum, max take an axis. The mental model: axis=k is the axis that disappears.

X = np.array([[1., 2., 3.],
              [4., 5., 6.]])
print("axis=0 (down columns):", X.mean(axis=0))   # collapses the 2 rows -> 3 numbers
print("axis=1 (across rows)  :", X.mean(axis=1))   # collapses the 3 cols -> 2 numbers

Output:

axis=0 (down columns): [2.5 3.5 4.5]
axis=1 (across rows)  : [2. 5.]

Don't be confused: axis=0 ≠ "rows." axis=0 is the row axis, so reducing over it collapses the rows and leaves one number per column. "Per-column statistic" (like feature means for standardization) = axis=0. "Per-row statistic" (like normalizing each sample) = axis=1. Read it as "the axis I sum over is the axis that vanishes."

Matrix multiplication: the shape contract

A @ B requires the inner dimensions to match: (m, k) @ (k, n) -> (m, n). The shared k is summed away.

A = np.ones((2, 3))
B = np.ones((3, 5))
print((A @ B).shape)        # (2,3) @ (3,5) -> (2,5)

Output:

(2, 5)

A neural network layer is this: outputs = inputs @ weights + bias, where inputs is (batch, in_features) and weights is (in_features, out_features). When you see the dreaded "shapes not aligned" error, line up the inner dimensions and one of them is wrong.

Dtype: the silent gotcha

Tensors have a dtype (float32, float64, int64…). Deep learning runs in float32 by default (half the memory of float64, plenty of precision), and modern training uses float16/bfloat16 for speed. Integer arrays do integer division and can't hold gradients — a frequent surprise:

print(np.array([1, 2, 3]) / 2)         # NumPy promotes to float -> fine
print((np.array([1, 2, 3]) * 1.0).dtype)

Output:

[0.5 1.  1.5]
float64

Don't be confused: float32 vs float64. NumPy defaults to float64; PyTorch defaults to float32. If you move data between them and get a dtype error, cast explicitly with .astype(np.float32) or .float(). Models almost never need float64.

The takeaway

Tensors are n-D arrays; rank counts the dimensions; shape is what you reason about; axis 0 is the batch; broadcasting stretches size-1 dimensions so you never loop; matmul contracts the shared inner dimension. With this, you can read model code. Next: how raw data becomes the tensors in the first place — features. 👉

AI Foundations in Depth