Key Findings¶

The Core Problem: Video Domain Gap¶

The similarity analysis revealed a critical limitation that makes reliable angle prediction infeasible with the current data. Ideally, intra-class similarity across videos should be at least 0.7 for robust detection.

Similarity Matrix Results¶

Metric	ResNet-50	ConvNeXt-Base
Intra-class similarity (same video)	~0.7-0.9	~0.7-0.9
Intra-class similarity (cross-video)	~0.3-0.5	~0.3-0.5
Inter-class similarity	~0.3-0.4	~0.3-0.4

The Problem

Cross-video similarity within the same class (~0.3-0.5) is nearly as low as inter-class similarity (~0.3-0.4).

This means the models can't tell the difference between:

A tube from video 1 vs a tube from video 2 (same class, different video)
A tube vs a pipette (different classes)

Visual Evidence¶

The per-class similarity matrices show a clear block diagonal structure:

Samples from the same video cluster together (high similarity)
Samples from different videos are dissimilar (low similarity)
The video boundary is clearly visible as a sharp drop in similarity

This pattern appears for both ResNet-50 and ConvNeXt-Base, confirming it's not model-specific (see Analysis Notebook for implementation).

Why This Happens¶

1. Video-Specific Context Dominates¶

The features capture:

Background patterns and colors
Lighting conditions
Camera angle and perspective
Tube rack layout

These video-specific characteristics dominate the feature representation, overwhelming the subtle geometric differences we actually care about (pipette-tube angle).

This is further compounded by the fact that the pipette and tube are both semi-transparent.

2. Only 2 Videos = 2 Domains¶

With only 2 videos, we effectively have only 2 "domains." Any learned representation will overfit to distinguishing these domains rather than learning generalizable angle features.

Implications¶

Training a Custom Model: Not Viable¶

With ~100 frames from only 2 videos and a few photos:

Insufficient diversity — A model would memorize video-specific patterns
No generalization — Would fail on any new video (like the holdout set)
Overfitting guaranteed — 2 videos can't represent the distribution of all possible setups

I tried this using both pretrained conv-nets and SAM-like models, in both cases there is overfitting and a lack of generalization.

Conclusion¶

Summary

The provided dataset is insufficient for reliable angle prediction due to:

Little data — generalizable feature extraction for semi-transparent objects will require much more data
Low cross-video feature similarity — Pretrained models are unable to extract relevant features, even from cropped image sets

Alternatives

Alternative solutions can be found here