Skip to content

Key Findings

The Core Problem: Video Domain Gap

The similarity analysis revealed a critical limitation that makes reliable angle prediction infeasible with the current data. Ideally, intra-class similarity across videos should be at least 0.7 for robust detection.

Similarity Matrix Results

Metric ResNet-50 ConvNeXt-Base
Intra-class similarity (same video) ~0.7-0.9 ~0.7-0.9
Intra-class similarity (cross-video) ~0.3-0.5 ~0.3-0.5
Inter-class similarity ~0.3-0.4 ~0.3-0.4

The Problem

Cross-video similarity within the same class (~0.3-0.5) is nearly as low as inter-class similarity (~0.3-0.4).

This means the models can't tell the difference between:

  • A tube from video 1 vs a tube from video 2 (same class, different video)
  • A tube vs a pipette (different classes)

Visual Evidence

The per-class similarity matrices show a clear block diagonal structure:

  • Samples from the same video cluster together (high similarity)
  • Samples from different videos are dissimilar (low similarity)
  • The video boundary is clearly visible as a sharp drop in similarity

This pattern appears for both ResNet-50 and ConvNeXt-Base, confirming it's not model-specific (see Analysis Notebook for implementation).


Why This Happens

1. Video-Specific Context Dominates

The features capture:

  • Background patterns and colors
  • Lighting conditions
  • Camera angle and perspective
  • Tube rack layout

These video-specific characteristics dominate the feature representation, overwhelming the subtle geometric differences we actually care about (pipette-tube angle).

This is further compounded by the fact that the pipette and tube are both semi-transparent.

2. Only 2 Videos = 2 Domains

With only 2 videos, we effectively have only 2 "domains." Any learned representation will overfit to distinguishing these domains rather than learning generalizable angle features.

Implications

Training a Custom Model: Not Viable

With ~100 frames from only 2 videos and a few photos:

  • Insufficient diversity — A model would memorize video-specific patterns
  • No generalization — Would fail on any new video (like the holdout set)
  • Overfitting guaranteed — 2 videos can't represent the distribution of all possible setups

I tried this using both pretrained conv-nets and SAM-like models, in both cases there is overfitting and a lack of generalization.

Conclusion

Summary

The provided dataset is insufficient for reliable angle prediction due to:

  1. Little data — generalizable feature extraction for semi-transparent objects will require much more data
  2. Low cross-video feature similarity — Pretrained models are unable to extract relevant features, even from cropped image sets

Alternatives

Alternative solutions can be found here