We understood that we should develop an image classification solution that could be scaled with minimal effort to maximize cost savings for our client. We did R&D and figured out that OpenAI's new CLIP model would be a perfect fit.
Trained using 400 million image-text pairs, i.e., a huge amount of labeled data, CLIP can understand the semantic meaning of images, providing a powerful bridge between computer vision and NLP. It is a zero-shot model, meaning that no retraining is needed to make the model perform the image classification tasks in domains it was not trained to do.
In our project, we first focused on classification of Nazi-era symbol images as required by the client. So we created a dataset of thousands of relevant images, encoding both images and their describing texts to evaluate the CLIP model. Then we performed evaluation using metrics such as precision and recall and adjusted algorithm threshold values to fine-tune the model.
The solution was implemented as a microservice using open-source libraries. To make the model classify images in new domains like weapons, drugs, etc., the client only needs to perform fine-tuning and update the configuration of the microservice.