CS231n Homework 1

Posted Jan 22, 2026

By Zepeng Lin

22 min read

CS231n Homework 1

KNN

Inline Question 1

Notice the structured patterns in the distance matrix, where some rows or columns are visibly brighter.
(Note that with the default color scheme black indicates low distances while white indicates high distances.)

What in the data is the cause behind the distinctly bright rows?
What causes the columns?

Your Answer:

The bright rows indicate large distances between a test sample and all training samples, which usually means the test sample is very different from the training set (e.g., an outlier or unusual example).
The bright columns indicate large distances between a particular training sample and all test samples, which usually means that training sample is an outlier or otherwise atypical.

Inline Question 2

We can also use other distance metrics such as L1 distance. For pixel values $p_{ij}^{(k)}$ at location $(i,j)$ of some image $I_k$,

the mean $\mu$ across all pixels over all images is $\mu=\frac{1}{nhw}\sum_{k=1}^n\sum_{i=1}^{h}\sum_{j=1}^{w}p_{ij}^{(k)}$ And the pixel-wise mean $\mu_{ij}$ across all images is $\mu_{ij}=\frac{1}{n}\sum_{k=1}^np_{ij}^{(k)}.$ The general standard deviation $\sigma$ and pixel-wise standard deviation $\sigma_{ij}$ is defined similarly.

Which of the following preprocessing steps will not change the performance of a Nearest Neighbor classifier that uses L1 distance? Select all that apply. To clarify, both training and test examples are preprocessed in the same way.

Subtracting the mean $\mu$
Subtracting the per pixel mean $\mu_{ij}$
Subtracting the mean $\mu$ and dividing by the standard deviation $\sigma$.
Subtracting the pixel-wise mean $\mu_{ij}$ and dividing by the pixel-wise standard deviation $\sigma_{ij}$.
Rotating the coordinate axes of the data, which means rotating all the images by the same angle. Empty regions in the image caused by rotation are padded with a same pixel value and no interpolation is performed.

Your Answer:

(1) and (2)

Your Explanation:

Inline Question 3

Which of the following statements about $k$-Nearest Neighbor ($k$-NN) are true in a classification setting, and for all $k$? Select all that apply.

The decision boundary of the k-NN classifier is linear.
The training error of a 1-NN will always be lower than or equal to that of 5-NN.
The test error of a 1-NN will always be lower than that of a 5-NN.
The time needed to classify a test example with the k-NN classifier grows with the size of the training set.
None of the above.

$\color{blue}{\textit Your Answer:}$ 2, 4

$\color{blue}{\textit Your Explanation:}$

False. k-NN 的决策边界一般是高度非线性的（尤其在高维/复杂分布下），只有在非常特殊的数据分布下才可能看起来接近线性。

True. 在训练集上，1-NN 对每个训练样本本身的最近邻就是它自己（距离为 0），因此训练集上可以做到训练误差为 0（假设没有完全重复但标签不同的冲突点）。

False. 1-NN 容易过拟合、方差大，5-NN 更平滑、泛化更好

True. 参考下面的代码复杂度分析

额外的代码复杂度分析

  
 def compute_distances_two_loops(self, X):
        """
        Compute the distance between each test point in X and each training point
        in self.X_train using a nested loop over both the training data and the
        test data.

        Inputs:
        - X: A numpy array of shape (num_test, D) containing test data.

        Returns:
        - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
          is the Euclidean distance between the ith test point and the jth training
          point.
        """
        num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))
        for i in range(num_test):
            for j in range(num_train):
                dists[i][j]= np.sqrt(np.sum((X[i]-self.X_train[j]) ** 2))
        return dists

    def compute_distances_one_loop(self, X):
        """
        Compute the distance between each test point in X and each training point
        in self.X_train using a single loop over the test data.

        Input / Output: Same as compute_distances_two_loops
        """
        num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))
        for i in range(num_test):
            diff=self.X_train-X[i]
            dists[i,:]=np.sqrt(np.sum(diff**2,axis=1))
            #######################################################################
            #                                                               #
            # Compute the l2 distance between the ith test point and all training #
            # points, and store the result in dists[i, :].                        #
            # Do not use np.linalg.norm().                                        #
            #######################################################################
            
        return dists

    def compute_distances_no_loops(self, X):
        """
        Compute the distance between each test point in X and each training point
        in self.X_train using no explicit loops.

        Input / Output: Same as compute_distances_two_loops
        """
        num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))
        X_squre=np.sum(X**2,axis=1).reshape(num_test,1)
        X_time=np.ones((1,num_train))
        X_train_time=np.ones((1,num_test))
        X_train_squre=np.sum(self.X_train**2,axis=1).reshape(num_train ,1)
        cross=X @ self.X_train.T
        dists=np.sqrt(np.maximum(X_squre @ X_time + X_train_time.T @ X_train_squre.T -cross *2,0.0))
        #########################################################################
        #                                                                #
        # Compute the l2 distance between all test points and all training      #
        # points without using any explicit loops, and store the result in      #
        # dists.                                                                #
        #                                                                       #
        # You should implement this function using only basic array operations; #
        # in particular you should not use functions from scipy,                #
        # nor use np.linalg.norm().                                             #
        #                                                                       #
        # HINT: Try to formulate the l2 distance using matrix multiplication    #
        #       and two broadcast sums.                                         #
        #########################################################################

        return dists

two_loop: 时间复杂度 Θ(N⋅M⋅D)
one_loop:时间复杂度 Θ(N⋅M⋅D)
- 渐进时间复杂度和 two-loops 一样，但通常更快，因为 NumPy 内部用 C 实现，减少 Python 层循环开销。
no_loop:时间复杂度Θ(N⋅M⋅D)
- 渐进复杂度仍然一样，但通常是最快的：因为把大量运算交给 BLAS/底层矩阵乘法实现，CPU 向量化、缓存友好，并且几乎没有 Python 循环开销。

SoftMax

Inline Question 1

Why do we expect our loss to be close to -log(0.1)? Explain briefly.**

$\color{blue}{\textit Your Answer:}$ 一开始模型对于任何类别都没有特殊的偏好,所以按理来说是均分的每个类别的可能性都是log(1/n)这里n=10,所以损失函数是-log(0.1)

Inline Question 2

Although gradcheck is reliable softmax loss, it is possible that for SVM loss, once in a while, a dimension in the gradcheck will not match exactly. What could such a discrepancy be caused by? Is it a reason for concern? What is a simple example in one dimension where a svm loss gradient check could fail? How would change the margin affect of the frequency of this happening?

$\color{blue}{\textit Your Answer:}$

$\color{green}{\textit Discrepancy\ Reason:}$

the multiclass SVM loss has “kinks”: each term is a $\max(0, \cdot)$, so it’s not differentiable at the point where the margin is exactly zero: $s_j - s_{y_i} + \Delta = 0$ A numerical gradient check uses finite differences, which can “step” from one side of the kink to the other, so the numerical gradient may not match your analytic gradient

$\color{green}{\textit Is\ it\ a\ reason\ for\ concern?}$:

Usually no. At the kink the gradient is not uniquely defined (there’s a set of valid subgradients), so a mismatch in one dimension occasionally is expected. If mismatches are frequent or huge everywhere, then it’s concerning.

$\color{green}{\textit Simple\ Example}$:

Consider a 1D hinge loss: $f(w) = \max(0, w)$ At $w=0$ it’s not differentiable.

Analytic “gradient” (subgradient choice) might return $0$ (many implementations do).
Numerical gradient with symmetric finite differences: $\frac{f(h) - f(-h)}{2h} = \frac{h - 0}{2h} = \frac{1}{2}$

which doesn’t match 0.

That’s exactly the kind of discrepancy you can see in SVM loss when some class satisfies $s_j - s_{y_i} + \Delta \approx 0$ $\color{green}{\textit How\ does\ the\ margin\ \Delta\ affect\ how\ often\ this\ happens?}$:

The kink happens when $s_j - s_{y_i} + \Delta = 0$. If you increase $\Delta$, you shift the hinge boundary, so more constraints are near/over the boundary, meaning you’re more likely to have some terms very close to 0 during training → gradcheck mismatches become more frequent.

If you decrease $\Delta$ (toward 0), fewer terms sit near the boundary → less frequent kink-crossing in finite differences → fewer mismatches.

(Still, exact equality is measure-zero with real numbers; the practical issue is being very close to 0, where finite differences can cross the kink.)

Inline question 3

Describe what your visualized Softmax classifier weights look like, and offer a brief explanation for why they look the way they do.

$\color{blue}{\textit Your Answer:}$ 每个类别的weight可视化后和对应的类别平均图像都很像, 因为训练的时候我们的loss是算的他们权重与训练集像素的乘积作为score, 然后训练目标是最大化对应类别的score且最小化别的类别的score, 第i列✖️矩阵得到训练集对于类别i的分数, 所以会趋向于对像素出现在种类i比较多的值为1而不在i的像素表现为0 从而我们可视化的时候他会长得更像对应的类别

Inline Question 4 - True or False

Suppose the overall training loss is defined as the sum of the per-datapoint loss over all training examples. It is possible to add a new datapoint to a training set that would change the softmax loss, but leave the SVM loss unchanged.

$\color{blue}{\textit Your Answer:}$True

$\color{blue}{\textit Your Explanation:}$ Softmax loss is never exactly zero for a finite score gap, because it depends on the full probability distribution: $L_i^{\text{softmax}} = -\log \frac{e^{s_{y_i}}}{\sum_j e^{s_j}}$

So even if the correct class already has the highest score by a lot, adding that datapoint still contributes a small (but positive) softmax loss.

SVM hinge loss does become exactly zero once the example is correctly classified with margin: $L_i^{\text{svm}}=\sum_{j\ne y_i}\max(0, s_j - s_{y_i} + \Delta)$

If the new datapoint satisfies $s_{y_i} \ge s_j + \Delta$ for all $j\ne y_i$, then every term inside the max is negative, so the SVM loss contribution is 0. Adding such a datapoint leaves the total SVM loss unchanged, while the softmax loss still increases by a tiny amount.

总结

一个Model是怎么built的?
Batch_size怎么分割? 然后cross_validation怎么实现 ? 超参的网格化搜索
怎么可视化学习到的矩阵对于不同class的表现:

  
w = best_softmax.W[:-1,:] # strip out the bias dim * num class
w = w.reshape(32, 32, 3, 10) 
w_min, w_max = np.min(w), np.max(w)
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
for i in range(10):
    plt.subplot(2, 5, i + 1)

    # Rescale the weights to be between 0 and 255
    wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)
    plt.imshow(wimg.astype('uint8'))
    plt.axis('off')
    plt.title(classes[i])

首先为什么是32，32，3，10的shape呢? 搞清楚训练的过程就很好懂了,每个matrix的参数对应哪个

Two-Layer Neural Network

Inline Question 1

We’ve only asked you to implement ReLU, but there are a number of different activation functions that one could use in neural networks, each with its pros and cons. In particular, an issue commonly seen with activation functions is getting zero (or close to zero) gradient flow during backpropagation. Which of the following activation functions have this problem? If you consider these functions in the one dimensional case, what types of input would lead to this behaviour?

Sigmoid
ReLU
Leaky ReLU

$\color{blue}{\textit Your Answer:}$ 1 and 2

Sigmoid $\sigma(x)=\frac{1}{1+e^{-x}}, \quad \sigma'(x)=\sigma(x)(1-\sigma(x))$
当输入 很大正数 或 很大负数 时：

\[x \gg 0：\sigma(x)\approx 1，\sigma'(x)\approx 0\\ x \ll 0：\sigma(x)\approx 0，\sigma'(x)\approx 0\]

ReLU：

\[\mathrm{ReLU}(x)=\max(0,x),\quad \mathrm{ReLU}'(x)= \begin{cases} 1,& x>0\\ 0,& x\le 0 \end{cases}\]

$x \le 0$（尤其长期落在负半区）会造成梯度为 0。

Leaky ReLU:

\[f(x)= \begin{cases} x,& x>0\\ \alpha x,& x\le 0 \end{cases} \quad\]

导数: $f'(x)= \begin{cases} 1,& x>0\\ \alpha,& x\le 0 \end{cases}$ 不会因为 $x\le 0$ 而梯度变成 0

Debug the training

针对模型训练的时候的调整

画出损失函数曲线和在训练集和测试集上面的模型性能曲线
- 损失函数是线性下降的，这通常表明学习率可能设置得过低
- 训练集准确率与验证集准确率之间几乎没有差距，这说明当前使用的模型容量较低，有必要增大模型规模。
- 训练集准确率与验证集准确率之间出现明显的差距。则说明模型规模过大导致过拟合
将权重可视化, 在视觉数据上第一层权重typically会有一些visible structure

Inline Question 2

Now that you have trained a Neural Network classifier, you may find that your testing accuracy is much lower than the training accuracy. In what ways can we decrease this gap? Select all that apply.

Train on a larger dataset.
Add more hidden units.
Increase the regularization strength.
None of the above.

$\color{blue}{\textit Your Answer:}$ 1 和 3

$\color{blue}{\textit Your Explanation:}$

训练准确率远高于测试准确率通常说明模型**过拟合,缩小差距的方法是减少过拟合：

(1) 训练更大的数据集有助于提升泛化能力，(3) 增大正则化强度限制模型复杂度，防止权重过大、减少过拟合，从而提升测试表现、缩小差距。 (2) 增加更多隐藏单元 会让模型容量更大，反而更容易过拟合

特征提取

HSV 颜色空间（Hue, Saturation, Value）

HSV 是一种把颜色拆成“色相 + 饱和度 + 明度”的表示方式，更接近人对颜色的直觉。

H（Hue，色相）：颜色“是哪种色”，比如红/黄/绿/蓝
- 常被用来做颜色相关的特征，因为比 RGB 更容易把“颜色种类”和“亮暗”分开。一般用直方图来可视化图片的颜色分布
S（Saturation，饱和度）：颜色“有多纯”，越低越灰、越淡
V（Value，明度/亮度）：颜色“有多亮”，越高越亮

HOG 方向梯度直方图:捕捉局部边缘分布

核心思想

物体的形状/纹理主要由边缘决定，而边缘可以用“梯度方向”来描述。
把图像分成小块，在每块里统计“边缘朝向”的分布（直方图），再做归一化来抵抗光照变化。
强调“局部形状”，弱化“颜色/整体亮度”

梯度计算

方法以及为什么和边缘有关？

对一张灰度图 $I(x,y)$，梯度近似为: $G_x = I(x+1,y) - I(x-1,y), \quad G_y = I(x,y+1) - I(x,y-1)$ 梯度幅值（边缘强度）: $m = \sqrt{G_x^2 + G_y^2}$ 梯度方向（边缘方向）: $\theta = \mathrm{arctan2}(G_y, G_x)$

直觉上：

平坦区域：变化小 → $m$ 小
边缘处：变化大 → $m$ 大
边缘朝向由 $\theta$ 给出

HOG 的标准流程

预处理

转灰度（或取亮度通道）并做 Gamma / 对比度归一化

计算每个像素的梯度$G_x,G_y$

简单差分 or Sobel

得到每个像素的 $m,\theta$

划分 Cell（单元格）
- cell 太小：细节多、维度高、对噪声敏感
- cell 太大：细节丢失、区分能力下降
在每个 cell 内统计“方向直方图”
- bins 太少：方向信息粗糙
- bins 太多：更敏感、维度高、需要更多数据

常见：9 bins 覆盖 0–180°（无符号方向）
- 因为边缘方向反过来通常等价（比如 30° 和 210°）

投票方式

每个像素对直方图投票,投票到其方向所在的 bin,有三种方法

**权重 = 梯度幅值 **（边缘越强贡献越大）
方向插值（orientation interpolation）：方向在两个 bin 之间时按比例分给两边
空间插值（spatial interpolation）：像素在 cell 边界附近时也可以按比例分到邻近 cell （Dalal-Triggs 原论文常用“三线性插值”）

划分 Block（块）并归一化（最重要的鲁棒性来源）

仅有 cell 直方图还不够，因为亮度变化会导致梯度整体变大/变小。因此把相邻 cell 组成一个 block，做归一化：

常见 block：2×2 cells

令 block 向量为 $v$

L2-norm（最常用）: $v \leftarrow \frac{v}{\sqrt{\|v\|_2^2 + \epsilon^2}}$ L1-norm: $v \leftarrow \frac{v}{\|v\|_1 + \epsilon}$ L2-Hys:

先 L2
再把每个元素截断（比如上限 0.2）
再 L2 一次这样对局部强边缘更稳。

PS:block 通常以 stride=1 cell 滑动，因此一个 cell 会出现在多个 block 里，增强稳定性。

整张图（或检测窗口）所有 block 的归一化向量串起来，就是最终特征。

全连接神经网络

Inline Question 1:

Did you notice anything about the comparative difficulty of training the three-layer network vs. training the five-layer network? In particular, based on your experience, which network seemed more sensitive to the initialization scale? Why do you think that is the case?

Answer: Five-Layer Network，5-layer 更难训、更依赖初始化尺度，越深越容易梯度爆炸/消失。

Inline Question 2:

AdaGrad, like Adam, is a per-parameter optimization method that uses the following update rule:

cache += dw**2

w += - learning_rate * dw / (np.sqrt(cache) + eps)

John notices that when he was training a network with AdaGrad that the updates became very small, and that his network was learning slowly. Using your knowledge of the AdaGrad update rule, why do you think the updates would become very small? Would Adam have the same issue?

用 AdaGrad 的更新式可以直接看出为什么会“越学越慢”。

为什么 AdaGrad 的更新会变得很小？

AdaGrad 会累积每个参数历史梯度的平方：

cache += dw**2 这一步是一直只增不减的（单调增长）。
更新量是: $\Delta w = -\eta \frac{dw}{\sqrt{\text{cache}}+\epsilon}$
随着训练进行，cache 越来越大，分母 $\sqrt{\text{cache}}$ 也越来越大，于是等效学习率$\eta /(\sqrt{\text{cache}}+\epsilon)$会不断变小。

所以即使当前梯度 dw 还不算小，除以一个越来越大的分母后，更新幅度也会变得非常小 → 参数几乎不动 → 学习变慢，尤其是在训练后期或梯度长期不为 0 的方向上更明显。

一句话：AdaGrad 的“累计平方梯度”会让学习率持续衰减，最后更新趋近于 0。

Adam 一般 不会像 AdaGrad 那么严重。

原因是 Adam 虽然也用二阶量（梯度平方）来缩放，但它用的是指数滑动平均而不是“无限累加”：

AdaGrad：cache 是历史平方梯度的总和（越积越大，理论上无上限）
Adam：二阶矩 (v_t) 是平方梯度的滑动平均（会“遗忘”很久以前的梯度，趋于稳定的尺度），大致： $v_t = \beta_2 v_{t-1} + (1-\beta_2) dw^2$

因此 Adam 的分母通常不会无限变大，等效步长不会一路衰减到几乎 0，所以训练后期不太会因为这个机制而“彻底走不动”。

补充一句（写作业很加分的点）

AdaGrad 在稀疏特征/稀疏梯度场景（比如 NLP 的词向量、one-hot 特征）反而很合适：常出现的参数会被快速衰减步长，不常出现的参数仍能保持较大的步长。但在深度网络/非稀疏梯度里，就容易出现你说的“更新越来越小、学习变慢”。

如果你愿意，我也可以帮你用一两句话把这段改成更像作业答案的英文/中文精简版。

Computer vision

This post is licensed under CC BY 4.0 by the author.