A journey to scaling the training of HuggingFace models for large data through tokenizers and Trainer API.
There are a lot of example notebooks available for different NLP tasks that can be accomplished through the mighty HuggingFace library. When I personally tried to apply one of these tasks on a custom problem and a dataset I have, I faced a major issue with memory usage.
The examples presented by HuggingFace follow the pipeline of using first a
tokenizer and then a
model. However, applying tokenization for your whole dataset can be cumbersome on your memory and you might not even…
Machine Learning Engineer & Enthusiast