A journey to scaling the training of HuggingFace models for large data through tokenizers and Trainer API.

Photo by Bernard Hermant on Unsplash

There are a lot of example notebooks available for different NLP tasks that can be accomplished through the mighty HuggingFace library. When I personally tried to apply one of these tasks on a custom problem and a dataset I have, I faced a major issue with memory usage.

The examples presented by HuggingFace follow the pipeline of using first a tokenizer and then a model. However, applying tokenization for your whole dataset can be cumbersome on your memory and you might not even…

Fatih Kılıç

Machine Learning Engineer & Enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store