Key Findings¶
The Core Problem: Video Domain Gap¶
The similarity analysis revealed a critical limitation that makes reliable angle prediction infeasible with the current data. Ideally, intra-class similarity across videos should be at least 0.7 for robust detection.
Similarity Matrix Results¶
| Metric | ResNet-50 | ConvNeXt-Base |
|---|---|---|
| Intra-class similarity (same video) | ~0.7-0.9 | ~0.7-0.9 |
| Intra-class similarity (cross-video) | ~0.3-0.5 | ~0.3-0.5 |
| Inter-class similarity | ~0.3-0.4 | ~0.3-0.4 |
The Problem
Cross-video similarity within the same class (~0.3-0.5) is nearly as low as inter-class similarity (~0.3-0.4).
This means the models can't tell the difference between:
- A tube from video 1 vs a tube from video 2 (same class, different video)
- A tube vs a pipette (different classes)
Visual Evidence¶
The per-class similarity matrices show a clear block diagonal structure:
- Samples from the same video cluster together (high similarity)
- Samples from different videos are dissimilar (low similarity)
- The video boundary is clearly visible as a sharp drop in similarity
This pattern appears for both ResNet-50 and ConvNeXt-Base, confirming it's not model-specific (see Analysis Notebook for implementation).
Why This Happens¶
1. Video-Specific Context Dominates¶
The features capture:
- Background patterns and colors
- Lighting conditions
- Camera angle and perspective
- Tube rack layout
These video-specific characteristics dominate the feature representation, overwhelming the subtle geometric differences we actually care about (pipette-tube angle).
This is further compounded by the fact that the pipette and tube are both semi-transparent.
2. Only 2 Videos = 2 Domains¶
With only 2 videos, we effectively have only 2 "domains." Any learned representation will overfit to distinguishing these domains rather than learning generalizable angle features.
Implications¶
Training a Custom Model: Not Viable¶
With ~100 frames from only 2 videos and a few photos:
- Insufficient diversity — A model would memorize video-specific patterns
- No generalization — Would fail on any new video (like the holdout set)
- Overfitting guaranteed — 2 videos can't represent the distribution of all possible setups
I tried this using both pretrained conv-nets and SAM-like models, in both cases there is overfitting and a lack of generalization.
Conclusion¶
Summary
The provided dataset is insufficient for reliable angle prediction due to:
- Little data — generalizable feature extraction for semi-transparent objects will require much more data
- Low cross-video feature similarity — Pretrained models are unable to extract relevant features, even from cropped image sets
Alternatives
Alternative solutions can be found here