Today, the exponential rise of large models developed by academic and
industrial institutions with the help of massive computing resources raises the
question of whether someone without access to such resources can make a
valuable scientific contribution. To explore this, we tried to solve the
challenging task of multilingual image retrieval having a limited budget of
1,000.Asaresult,wepresentNLLB−CLIP−CLIPmodelwithatextencoderfromtheNLLBmodel.Totrainthemodel,weusedanautomaticallycreateddatasetof106,246good−qualityimageswithcaptionsin201languagesderivedfromtheLAIONCOCOdataset.Wetrainedmultiplemodelsusingimageandtextencodersofvarioussizesandkeptdifferentpartsofthemodelfrozenduringthetraining.WethoroughlyanalyzedthetrainedmodelsusingexistingevaluationdatasetsandnewlycreatedXTD200andFlickr30k−200datasets.WeshowthatNLLB−CLIPiscomparableinqualitytostate−of−the−artmodelsandsignificantlyoutperformsthemonlow−resourcelanguages.