Multi-Modal learning is about gathering and integrating information of different modalities of data like tabular, text, image & video. There are many increasing industrial applications specially in e-commerce and energy sector where the only solution towards solving the business problem is multi-modal prediction, and hence the need to implement the same in the best way possible is ever emerging. The current scope will be tabular, text & image-based execution with comprehensive methodology explanation.
There are two primary challenges, first how to interpret text and image such that our model can learn from the changes and predict successfully. Second how to consider and connect information from different modalities.
Interpreting Text & Image
In order to interpret text, there are many available algorithms enabling in keyword extraction. These keywords have a respective scoring connected to them which help explain their significance towards the overall available text document along with seed keywords (seed keywords can be passed as parameters for some specific algorithms, a simple use case of seed keyword is that it helps amplify the score of those keywords which are more close to seed keywords). Various algorithms historically best suited for keyword extraction are –
Bag of words
Each of these have drawbacks based on use cases, simplicity, accuracy and time complexity. For further details on keyword extraction kindly refer https://medium.com/@publiciscommerce/keyword-extraction-in-e-commerce-ce9bea81b471
Interpreting image is efficiently extracting, identifying and tagging all insights and visual content in image constituting objects, labels, area, dominant colors, text inside image (Optical Character Recognition) and respective bounding boxes. There are various powerful computer vision APIs available which help in easily extracting respective information. Below mentioned are few of the best such APIs –
Google Cloud Vision API
Microsoft Azure Computer Vision API
IBM Watson Visual Recognition
Interpreting based on each of above can be easily done following readme files provided by respective parent organization.
Connecting Information from all modalities
One easy and obvious way towards prediction based on these 3 modalities is to build 3 different models. As the good accuracy of these separate models states reliability towards prediction results and corresponding recommendation. But the drawback of it is that relation amongst features corresponding to different modalities is lost with this approach.
Second approach is to consider traditional ML models like CatBoost, XGBoost, Random Forest and convert every type of data from all modalities to tabular. Text can be converted into keywords based columns with one hot encoding or their scores as values. For Image Insights, most of the features can be encoded and handling text inside image same as we handled Text modality. Now, we have a single reliable model ready with prediction. Output of the model can be understood using explainable AI or SHAP etc. This helps in understanding influence of each feature from different modalities resulting in specific prediction. Drawback of this approach is generating recommendation becomes almost impossible considering time complexity of so many encoded features.
However, a more optimal and scientifically viable approach is to create one single Deep Learning model with all types of features in same embeddings space so that the model understands all the possible features by feature type together and further enhances prediction results without us encoding most of the features coming from Text and Image modality. There are different methods for converting higher dimensional Text and Image data to embeddings.
Getting Text & Image Embeddings
Embeddings can be obtained from Text using Tokenization or embedding approach. Tokenization can be performed using many pretrained available huggingface models. Through tokenization a document input is first auto tokenized into Input-id, token-type and attention-mask array. Later using TFAutomodel, a pretrained model can be used for Text classification or regression. However, to use it along with Tabular data, shape of tokenized feature along with tabular features has to be in sync for Neural Network model to train properly. Also, additional customized layers will have to be added to already existing with TFAutomodel for both modalities together. Various models which can be used for tokenization are –
BERT (base, large, word)
Roberta (base, large)
Distilbert (base, large)
But the drawback to this is that size of tokenized feature varies based on actual length of input text making it difficult to getting standardized shape for all features before model training. It is possible but the time complexity is too high when we handle same with the help of padding. Therefore, converting text directly into embeddings of fixed maximum length is best approach. Various models which help us in same are –
BERT base nli mean tokens
MiniLM L6 v2
These differ based on use case, time complexity and accuracy.
For Image, we have similar models which help us convert any image to embeddings however to use both embeddings together they should be in similar space and in sync. One way to do that is use models which provide both text and image embedding conversion algorithm like CLIP as a result both embeddings obtained are automatically in sync. However, sometimes that is not possible for example each of the text embedding algorithm has limitation corresponding to maximum text size it can take as an input. Therefore, we might end up using a different text embedding algorithm with higher input text size and a separate algorithm for image embedding. In that case, Neural Network libraries like pytorch, keras etc. can be used to get same shape and size of embedding output.
These embeddings along with tabular features can now be used for training directly in a custom defined ANN or any other NN (based on use case) and the prediction obtained is both reliable and handles relationship amongst features of different modalities. As we have both embeddings and normal features, it’s highly possible that data becomes too huge even for GPU based CUDA enabled training to handle so its trivial to convert your input data to Tensorflow dataset format so that we are doing batch parallel processing and training. Also, if the features are not in sync then converting them to dataset itself will throw an error rather than in model training so this also acts as a precautionary check for same.
In conclusion, there are many use cases where the intended input towards prediction is from multiple modalities. Converting data in higher dimension from these modalities to embeddings and with these embeddings along with tabular features in same embedding space as input to a customized neural model acts as a key to generating a Multi-Modal prediction engine.