EAAI Journal 2025 Journal Article
Unlocking language boundaries: AraCLIP - transforming Arabic language and image understanding through cross-lingual models
- Muhammad Al-Barham
- Imad Afyouni
- Khalid Almubarak
- Ayad Turky
- Ibrahim Abaker Targio Hashem
- Ali Bou Nassif
- Ismail Shahin
- Ashraf Elnagar
In the domain of image retrieval, the integration of text and images has been transformative, facilitating models that transcend language barriers. This paper introduces Arabic Contrastive Language-Image Pre-training (AraCLIP), an extension of the Contrastive Language-Image Pre-training (CLIP) model tailored for Arabic image retrieval. AraCLIP leverages the CLIP architecture, introducing Knowledge Distillation to transfer cross-modal knowledge from a pre-trained English model to an Arabic counterpart. Unlike existing multilingual models lacking Arabic contextual nuances, AraCLIP addresses biases in image retrieval tasks. Our methodology involves dataset preparation, incorporating a synthetic dataset of about 12. 5M samples, which translated using a unique neural machine translation model for accurate English-to-Arabic translation. The training phase utilizes Knowledge Distillation, treating an English model as the teacher and a pre-trained Arabic model as the student. AraCLIP focuses on optimizing computational efficiency, maximizing cosine similarity, and minimizing Mean Squared Error. Our best model surpasses the state-of-the-art multilingual model by approximately 10% across various evaluation metrics, including Mean Reciprocal Rank (MRR) and Recall. In addition, it demonstrates competitive performance in ImageNet-based zero-shot classification tasks. We have also released several datasets to support text image-related tasks 1 1 https: //huggingface. co/Arabic-Clip. . AraCLIP along with published resources will leverage enhanced capabilities for Arabic image retrieval and opens avenues for diverse applications.