NLP, ML frameworks, and cloud infrastructure have evolved considerably in 2022. This post sets out an opinionated technology stack template for common NLP tasks.

The Task

Part 1. Fine-tuning and training (YOU ARE HERE).
Part 2. Deployment and inference.

Part 1 in this post will focus on the stack for training and fine-tuning Large Language Models (LLM) within the current most performant transfer-learning paradigm using transformer architecture.

The Features

The required features based in personal experience and common stack with multiple clients and academic research. Of course, different use-cases will require different tech and the cost-benefit calculation is made on a case-by-case basis. You might not even need deep learning or transformers at all. The proposed stack is likely over-specced for many NLP tasks and can/should be trimmed down as-needed.

Typed Python stack with test-driven development and linting. For hardcore modeling purposes, Typed Python is arguably not very useful (everything’s a torch.Tensor or a well-defined module within torch.nn). But for all the scaffolding and data pipelining, type hinting assures safety and correctness. Easy speed gains with “mypyc” compiling is in its infancy and will probably not be prime-time ready in 2023, but who knows?
Train-time efficiency: make maximal use of plug-and-play solutions to cost-reduce training of PyTorch models with Automatic Mixed Precision and model compilation.
Checkpointing and model training monitoring: Good MLOps maintaining visibility on model lifecycle and data artifacts is key to success in ML projects.
Multi-cloud capable: This example will rely AWS services (S3 and Sagemaker). Cloud services change depending on your usecase/client so any provider-specific dependencies should not be integrated strongly. As a principle, you also want to minimize the amount of vendor-lockin and non-transferable “VendorOps”. We prefer mocked local-first development here.

The Stack

Generic Python development stack: Typed Python in test-driven development with decent linting.
- CPython 3.11+: Python’s latest main version shows 10%-60% performance increase in common textfile-IO tasks thanks to the Faster CPython initiative. Pandas support for 3.11 is maturing but not yet benchmarked, performance improvements for common DataFrame operations will likely be modest.
- poetry for package/venv management: Late 2021, Poetry has added the ability to target multiple environments (e.g. dev, prod) with named dependency groups eliminating its biggest drawback for full environment management. I used to recommend “pyenv+pipenv”, but no longer; Poetry has become a best-practices Swiss Army knife of Python packaging and venv’ing.
- mypy for static type-checking with the local caching server for frictionless iterative development.
- pytest for the testsuite with plug-ins as-needed.
PyTorch 2.0: A killer feature added to the new version of Pytorch is model compilation using TorchDynamo. 1.3-2x training speedup by simply adding model = torch.compile(model) to your existing code.
huggingface/transformers: we will rely on the huggingface ecosystem for ease-of-use and fast time-to-MVP using state-of-the-art models. While huggingface offers batteries-included cloud-services, we will only rely on them for local modeling, training, inference, and deployment functions to avoid vendor lockin.
- huggingface/accelerate for significant training-time memory and speed gains.

A source-code template is available as a cookiecutter git repository. A base docker image on Dockerhub and buildfile is included.