I am interested in contributing to Keras by implementing DINOv3 (Distillation with No labels v3), a state-of-the-art self-supervised Vision Transformer model, as an example/tutorial. Before proceeding, I would like to confirm if this aligns with the project's goals and if there are any existing implementations or guidelines I should be aware of.
Why DINOv3? - State-of-the-art performance: DINOv3 achieves top-tier results on various vision tasks without requiring labeled data, making it a valuable addition to Keras examples. - Versatility: It serves as a strong backbone for tasks like image classification, segmentation, and object detection. - Alignment with Keras 3: Given Keras 3's multi-backend support (TensorFlow, JAX, PyTorch), implementing DINOv3 would showcase the framework's flexibility.
Implementation Plan: - Model Architecture: Implement the Vision Transformer (ViT) backbone with self-supervised learning using DINOv3. - Training: Utilize standard datasets such as CIFAR-10 or ImageNet for training. - Backend Compatibility: Ensure the implementation is compatible with TensorFlow, JAX, and PyTorch backends. - Documentation: Provide clear instructions on how to use the model, including training and evaluation scripts.