Linear¶
The Linear layer implements an affine transformation between finite-dimensional real vector spaces and constitutes a fundamental operator in deep learning architectures. Formally, it defines a linear mapping from an input feature space to an output representation space, optionally augmented by a bias term. This transformation is applied independently to each element of a batch.
Mathematical definition¶
Let $\mathbf{X} \in \mathbb{R}^{N \times d}$ be an input tensor representing a batch of $N$ samples, where each sample is a vector in a $d$-dimensional feature space. The Linear layer defines the affine transformation
$$\mathbf{Y} = \mathbf{X}\mathbf{W} + \mathbf{b}$$
where the involved quantities have the following dimensions:
- $ \mathbf{X} \in \mathbb{R}^{N \times d} $ : input batch matrix
- $ \mathbf{W} \in \mathbb{R}^{d \times m} $ : weight matrix (trainable parameters)
- $ \mathbf{b} \in \mathbb{R}^{1 \times m} $ : bias vector associated with the output neurons
- $ \mathbf{Y} \in \mathbb{R}^{N \times m} $ : output tensor
- $ m $ : number of neurons, i.e., the dimensionality of the output space
From a dimensional analysis standpoint, the matrix product
$$ \mathbf{X}\mathbf{W} : \mathbb{R}^{N \times d} \times \mathbb{R}^{d \times m} \;\longrightarrow\; \mathbb{R}^{N \times m} $$
is well-defined. The bias term $\mathbf{b}$ is then added column-wise to the resulting matrix, meaning that each component $b_j$ is added to all entries of the $j$-th output column. Explicitly,
$$ Y_{ij} = (\mathbf{X}\mathbf{W})_{ij} + b_j, \quad i = 1,\dots,N,\; j = 1,\dots,m. $$
Interpretation as a linear mapping¶
At the level of individual samples, for each $ i \in \{1, \dots, N\}, $ the transformation can be written as
$$ \mathbf{y}_i = \mathbf{x}_i \mathbf{W} + \mathbf{b}, \quad \mathbf{x}_i \in \mathbb{R}^{1 \times d},\; \mathbf{y}_i \in \mathbb{R}^{1 \times m}. $$
Thus, each output vector $\mathbf{y}_i$ is obtained as a linear combination of the input features, defined by the columns of $\mathbf{W}$, followed by a translation in the output space determined by the bias vector $\mathbf{b}$.
Functional view¶
The Linear layer realizes the mapping
$$ \text{Linear}:\; \mathbb{R}^{N \times d} \;\longrightarrow\; \mathbb{R}^{N \times m}, $$
where the same affine transformation is applied independently to each sample in the batch. This operator forms the mathematical foundation upon which more complex nonlinear models are constructed when composed with activation and normalization layers.
Parameterization and gradients¶
The parameters $\mathbf{W}$ and $\mathbf{b}$ are represented as tensor objects with requires_grad=True, enabling automatic gradient computation via automatic differentiation. During backpropagation, the gradients with respect to the parameters and the input $\mathbf{X}$ are computed as:
Gradient w.r.t. Weights: $$\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \mathbf{X}^T \frac{\partial \mathcal{L}}{\partial \mathbf{Y}}$$
Gradient w.r.t. Bias: $$\frac{\partial \mathcal{L}}{\partial \mathbf{b}} = \sum_{i=1}^N \left( \frac{\partial \mathcal{L}}{\partial \mathbf{Y}} \right)_{i, \cdot}$$
Gradient w.r.t. Input: $$\frac{\partial \mathcal{L}}{\partial \mathbf{X}} = \frac{\partial \mathcal{L}}{\partial \mathbf{Y}} \mathbf{W}^T$$
where $\mathcal{L}$ denotes the global loss function and $\frac{\partial \mathcal{L}}{\partial \mathbf{Y}}$ is the gradient propagated from the subsequent layer.
Parameter initialization¶
The weight matrix $\mathbf{W}$ is initialized from a zero-mean normal distribution with a standard deviation determined by the chosen initialization scheme:
He initialization (recommended for ReLU-like activations): $$ \sigma = \sqrt{\frac{2}{d}}. $$
Xavier initialization (suitable for symmetric activations such as $\tanh$): $$ \sigma = \sqrt{\frac{2}{d + m}}. $$
Formally, $$ W_{ij} \sim \mathcal{N}(0, \sigma^2). $$
When present, the bias vector $\mathbf{b}$ is initialized to zero.
Forward computation¶
Given an input tensor $\mathbf{X}$, the forward evaluation of the layer is performed through the matrix operation
$$ \text{Linear}(\mathbf{X}) = \mathbf{X}\mathbf{W} + \mathbf{b}. $$
In the implementation, this computation is exposed via the __call__ method, enabling a concise and functional syntax consistent with the rest of the framework.
Multi-device support¶
The Linear layer is device-aware. Parameters and computations may reside on either CPU or GPU, using NumPy or CuPy as the numerical backend, respectively. The to(device) method ensures consistent parameter transfer across devices while preserving the mathematical semantics of the transformation.
Parameter interface¶
The trainable parameters of the layer are exposed through the parameters() method, which returns the set
$$ \{\mathbf{W}, \mathbf{b}\}, $$
or only $\mathbf{W}$ when the bias term is disabled. This abstraction allows direct integration with gradient-based optimization algorithms.
Statistical interpretation¶
From a statistical perspective, the Linear layer can be interpreted as a multivariate linear regression model, where each output neuron represents a linear combination of the input features. In this context, the coefficients of the weight matrix $\mathbf{W}$ and the bias vector $\mathbf{b}$ define hyperplanes in the output space that approximate the relationship between input and output variables.
# Uncomment the next line and run this cell to install sorix
#!pip install 'sorix @ git+https://github.com/Mitchell-Mirano/sorix.git@main'
from sorix import tensor
from sorix.nn import Linear
import numpy as np
# create random input data
samples = 10
features = 3
neurons = 2
# X ∈ ℝ^(samples × features)
X = tensor(np.random.randn(samples, features))
X
tensor([[ 1.68897847, 0.95780477, -0.67256397],
[-0.24741235, 0.8279418 , -0.12785988],
[-0.34724227, -0.28500933, -0.10984726],
[ 0.57447891, -1.63470577, -0.23525714],
[ 0.84202 , 1.62048221, -0.97327998],
[-2.04520371, -0.64436421, 1.39538598],
[ 1.56181071, -1.56024376, 0.94887986],
[ 1.65149527, 0.01713004, 0.36482856],
[ 1.19733922, 0.59846951, -2.17169579],
[ 0.56463719, 1.87145863, -1.40925879]], dtype=sorix.float64)
# instantiate a Linear layer: ℝ^(samples × features) → ℝ^(samples × neurons)
linear = Linear(features, neurons)
# weight matrix W ∈ ℝ^(features × neurons)
print(linear.W)
# bias vector b ∈ ℝ^(1 × neurons)
print(linear.b)
tensor([[ 0.331668 , 1.1764342 ],
[-0.14172798, 0.6686574 ],
[-0.6454266 , 0.41196334]], requires_grad=True)
tensor([[0., 0.]], requires_grad=True)
# forward pass:
# Y ∈ ℝ^(samples × neurons) = X @ W + b
Y = linear(X)
print(Y)
tensor([[ 0.85852301, 2.35034354],
[-0.11687712, 0.20987151],
[-0.003877 , -0.64433431],
[ 0.57406103, -0.51413885],
[ 0.67778417, 1.67317287],
[-1.48762335, -2.26205854],
[ 0.12670055, 1.18500262],
[ 0.30985026, 2.10462557],
[ 1.71396938, 0.91410278],
[ 0.83160709, 1.33506022]], dtype=sorix.float64, requires_grad=True)