| | .. _training_api: |
| |
|
| | Training API (experimental) |
| | =========================== |
| |
|
| | Kornia provides a Training API with the specific purpose to train and fine-tune the |
| | supported deep learning algorithms within the library. |
| |
|
| | .. sidebar:: **Deep Alchemy** |
| |
|
| | .. image:: https://github.com/kornia/data/raw/main/pixie_alchemist.png |
| | :width: 100% |
| | :align: center |
| |
|
| | A seemingly magical process of transformation, creation, or combination of data to usable deep learning models. |
| |
|
| |
|
| | .. important:: |
| | In order to use our Training API you must: ``pip install kornia[x]`` |
| |
|
| | Why a Training API ? |
| | -------------------- |
| |
|
| | Kornia includes deep learning models that eventually need to be updated through fine-tuning. |
| | Our aim is to have an API flexible enough to be used across our vision models and enable us to |
| | override methods or dynamically pass callbacks to ease the process of debugging and experimentations. |
| |
|
| | .. admonition:: **Disclaimer** |
| | :class: seealso |
| |
|
| | We do not pretend to be a general purpose training library but instead we allow Kornia users to |
| | experiment with the training of our models. |
| |
|
| | Design Principles |
| | ----------------- |
| |
|
| | - `kornia` golden rule is to not have heavy dependencies. |
| | - Our models are simple enough so that a light training API can fulfill our needs. |
| | - Flexible and full control to the training/validation loops and customize the pipeline. |
| | - Decouple the model definition from the training pipeline. |
| | - Use plane PyTorch abstractions and recipes to write your own routines. |
| | - Implement `accelerate <https://github.com/huggingface/accelerate/>`_ library to scale the problem. |
| |
|
| | Trainer Usage |
| | ------------- |
| |
|
| | The entry point to start traning with Kornia is through the :py:class:`~kornia.x.Trainer` class. |
| |
|
| | The main API is a self contained module that heavily relies on `accelerate <https://github.com/huggingface/accelerate/>`_ |
| | to easily scale the training over multi-GPUs/TPU/fp16 `(see more) <https://github.com/huggingface/accelerate#supported-integrations/>`_ |
| | by following standard PyTorch recipes. Our API expects to consume standard PyTorch components and you decide if `kornia` makes the magic |
| | for you. |
| |
|
| | 1. Define your model |
| |
|
| | .. code:: python |
| |
|
| | model = nn.Sequential( |
| | kornia.contrib.VisionTransformer(image_size=32, patch_size=16), |
| | kornia.contrib.ClassificationHead(num_classes=10), |
| | ) |
| |
|
| | 2. Create the datasets and dataloaders for training and validation |
| |
|
| | .. code:: python |
| |
|
| | # datasets |
| | train_dataset = torchvision.datasets.CIFAR10( |
| | root=config.data_path, train=True, download=True, transform=T.ToTensor()) |
| |
|
| | valid_dataset = torchvision.datasets.CIFAR10( |
| | root=config.data_path, train=False, download=True, transform=T.ToTensor()) |
| |
|
| | # dataloaders |
| | train_dataloader = torch.utils.data.DataLoader( |
| | train_dataset, batch_size=config.batch_size, shuffle=True) |
| |
|
| | valid_daloader = torch.utils.data.DataLoader( |
| | valid_dataset, batch_size=config.batch_size, shuffle=True) |
| |
|
| | 3. Create your loss function, optimizer and scheduler |
| |
|
| | .. code:: python |
| |
|
| | # loss function |
| | criterion = nn.CrossEntropyLoss() |
| |
|
| | # optimizer and scheduler |
| | optimizer = torch.optim.AdamW(model.parameters(), lr=config.lr) |
| | scheduler = torch.optim.lr_scheduler.CosineAnnealingLR( |
| | optimizer, config.num_epochs * len(train_dataloader) |
| | ) |
| |
|
| | 4. Create the Trainer and execute the training pipeline |
| |
|
| | .. code:: python |
| |
|
| | trainer = kornia.train.Trainer( |
| | model, train_dataloader, valid_daloader, criterion, optimizer, scheduler, config, |
| | ) |
| | trainer.fit() # execute your training ! |
| |
|
| |
|
| | Customize [callbacks] |
| | --------------------- |
| | |
| | At this point you might think - *Is this API generic enough ?* |
| | |
| | Of course not ! What is next ? Let's have fun and **customize**. |
| | |
| | The :py:class:`~kornia.x.Trainer` internals are clearly defined such in a way so that e.g you can |
| | subclass and just override the :py:func:`~kornia.x.Trainer.evaluate` method and adjust |
| | according to your needs. We provide predefined classes for generic problems such as |
| | :py:class:`~kornia.x.ImageClassifierTrainer`, :py:class:`~kornia.x.SemanticSegmentationTrainer`. |
| | |
| | .. note:: |
| | More trainers will come as soon as we include more models. |
| | |
| | You can easily customize by creating your own class, or even through ``callbacks`` as follows: |
| | |
| | .. code:: python |
| | |
| | @torch.no_grad() |
| | def my_evaluate(self) -> dict: |
| | self.model.eval() |
| | for sample_id, sample in enumerate(self.valid_dataloader): |
| | source, target = sample # this might change with new pytorch ataset structure |
| | |
| | # perform the preprocess and augmentations in batch |
| | img = self.preprocess(source) |
| | # Forward |
| | out = self.model(img) |
| | # Loss computation |
| | val_loss = self.criterion(out, target) |
| | |
| | # measure accuracy and record loss |
| | acc1, acc5 = accuracy(out.detach(), target, topk=(1, 5)) |
| | |
| | # create the trainer and pass the evaluate method as follows |
| | trainer = K.train.Trainer(..., callbacks={"evaluate", my_evaluate}) |
| | |
| | **Still not convinced ?** |
| | |
| | You can even override the whole :py:func:`~kornia.x.ImageClassifierTrainer.fit()` |
| | method and implement your custom for loops and the trainer will setup for you using the Accelerator all |
| | the data to the device and the rest of the story is just PyTorch :) |
| | |
| | .. code:: python |
| | |
| | def my_fit(self, ): # this is a custom pytorch training loop |
| | self.model.train() |
| | for epoch in range(self.num_epochs): |
| | for source, targets in self.train_dataloader: |
| | self.optimizer.zero_grad() |
| | |
| | output = self.model(source) |
| | loss = self.criterion(output, targets) |
| | |
| | self.backward(loss) |
| | self.optimizer.step() |
| | |
| | stats = self.evaluate() # do whatever you want with validation |
| | |
| | # create the trainer and pass the evaluate method as follows |
| | trainer = K.train.Trainer(..., callbacks={"fit", my_fit}) |
| | |
| | .. note:: |
| | The following hooks are available to override: ``preprocess``, ``augmentations``, ``evaluate``, ``fit``, |
| | ``on_checkpoint``, ``on_epoch_end``, ``on_before_model`` |
| | |
| | |
| | Preprocess and augmentations |
| | ---------------------------- |
| |
|
| | Taking a pre-trained model from an external source and assume that fine-tuning with your |
| | data by just changing few things in your model is usually a bad assumption in practice. |
| |
|
| | Fine-tuning a model need a lot tricks which usually means designing a good augmentation |
| | or preprocess strategy before you execute the training pipeline. For this reason, we enable |
| | through callbacks to pass pointers to the ``proprocess`` and ``augmentation`` functions to make easy |
| | the debugging and experimentation experience. |
| |
|
| | .. code:: python |
| |
|
| | def preprocess(x): |
| | return x.float() / 255. |
| |
|
| | augmentations = nn.Sequential( |
| | K.augmentation.RandomHorizontalFlip(p=0.75), |
| | K.augmentation.RandomVerticalFlip(p=0.75), |
| | K.augmentation.RandomAffine(degrees=10.), |
| | K.augmentation.PatchSequential( |
| | K.augmentation.ColorJitter(0.1, 0.1, 0.1, 0.1, p=0.8), |
| | grid_size=(2, 2), # cifar-10 is 32x32 and vit is patch 16 |
| | patchwise_apply=False, |
| | ), |
| | ) |
| |
|
| | # create the trainer and pass the augmentation or preprocess |
| | trainer = K.train.ImageClassifierTrainer(..., |
| | callbacks={"preprocess", preprocess, "augmentations": augmentations}) |
| |
|
| | Callbacks utilities |
| | ------------------- |
| |
|
| | We also provide utilities to save checkpoints of the model or early stop the training. You can use |
| | as follows passing as ``callbacks`` the classes :py:class:`~kornia.x.ModelCheckpoint` and |
| | :py:class:`~kornia.x.EarlyStopping`. |
| |
|
| | .. code:: python |
| |
|
| | model_checkpoint = ModelCheckpoint( |
| | filepath="./outputs", monitor="top5", |
| | ) |
| |
|
| | early_stop = EarlyStopping(monitor="top5") |
| |
|
| | trainer = K.train.ImageClassifierTrainer(..., |
| | callbacks={"on_checkpoint", model_checkpoint, "on_epoch_end": early_stop}) |
| |
|
| | Hyperparameter sweeps |
| | --------------------- |
| |
|
| | Use `hydra <https://hydra.cc>`_ to implement an easy search strategy for your hyper-parameters as follows: |
| |
|
| | .. note:: |
| |
|
| | Checkout the toy example in `here <https://github.com/kornia/kornia/tree/master/examples/train/image_classifier>`__ |
| |
|
| | .. code:: python |
| |
|
| | python ./train/image_classifier/main.py num_epochs=50 batch_size=32 |
| |
|
| | .. code:: python |
| |
|
| | python ./train/image_classifier/main.py --multirun lr=1e-3,1e-4 |
| |
|
| | Distributed Training |
| | -------------------- |
| |
|
| | Kornia :py:class:`~kornia.x.Trainer` heavily relies on `accelerate <https://github.com/huggingface/accelerate/>`_ to |
| | decouple the process of running your training scripts in a distributed environment. |
| |
|
| | .. note:: |
| |
|
| | We haven't tested yet all the possibilities for distributed training. |
| | Expect some adventures or `join us <https://join.slack.com/t/kornia/shared_invite/zt-csobk21g-CnydWe5fmvkcktIeRFGCEQ>`_ and help to iterate :) |
| |
|
| | The below recipes are taken from the `accelerate` library in `here <https://github.com/huggingface/accelerate/tree/main/examples#simple-vision-example>`__: |
| |
|
| | - single CPU: |
| |
|
| | * from a server without GPU |
| |
|
| | .. code:: bash |
| |
|
| | python ./train/image_classifier/main.py |
| |
|
| | * from any server by passing `cpu=True` to the `Accelerator`. |
| |
|
| | .. code:: bash |
| |
|
| | python ./train/image_classifier/main.py --data_path path_to_data --cpu |
| |
|
| | * from any server with Accelerate launcher |
| |
|
| | .. code:: bash |
| |
|
| | accelerate launch --cpu ./train/image_classifier/main.py --data_path path_to_data |
| |
|
| | - single GPU: |
| |
|
| | .. code:: bash |
| |
|
| | python ./train/image_classifier/main.py # from a server with a GPU |
| |
|
| | - with fp16 (mixed-precision) |
| |
|
| | * from any server by passing `fp16=True` to the `Accelerator`. |
| |
|
| | .. code:: bash |
| |
|
| | python ./train/image_classifier/main.py --data_path path_to_data --fp16 |
| |
|
| | * from any server with Accelerate launcher |
| |
|
| | .. code:: bash |
| |
|
| | accelerate launch --fp16 ./train/image_classifier/main.py --data_path path_to_data |
| |
|
| | - multi GPUs (using PyTorch distributed mode) |
| |
|
| | * With Accelerate config and launcher |
| |
|
| | .. code:: bash |
| |
|
| | accelerate config # This will create a config file on your server |
| | accelerate launch ./train/image_classifier/main.py --data_path path_to_data # This will run the script on your server |
| |
|
| | * With traditional PyTorch launcher |
| |
|
| | .. code:: bash |
| |
|
| | python -m torch.distributed.launch --nproc_per_node 2 --use_env ./train/image_classifier/main.py --data_path path_to_data |
| |
|
| | - multi GPUs, multi node (several machines, using PyTorch distributed mode) |
| |
|
| | * With Accelerate config and launcher, on each machine: |
| |
|
| | .. code:: bash |
| |
|
| | accelerate config # This will create a config file on each server |
| | accelerate launch ./train/image_classifier/main.py --data_path path_to_data # This will run the script on each server |
| |
|
| | * With PyTorch launcher only |
| |
|
| | .. code:: bash |
| |
|
| | python -m torch.distributed.launch --nproc_per_node 2 \ |
| | --use_env \ |
| | --node_rank 0 \ |
| | --master_addr master_node_ip_address \ |
| | ./train/image_classifier/main.py --data_path path_to_data # On the first server |
| |
|
| | python -m torch.distributed.launch --nproc_per_node 2 \ |
| | --use_env \ |
| | --node_rank 1 \ |
| | --master_addr master_node_ip_address \ |
| | ./train/image_classifier/main.py --data_path path_to_data # On the second server |
| |
|
| | - (multi) TPUs |
| |
|
| | * With Accelerate config and launcher |
| |
|
| | .. code:: bash |
| |
|
| | accelerate config # This will create a config file on your TPU server |
| | accelerate launch ./train/image_classifier/main.py --data_path path_to_data # This will run the script on each server |
| |
|
| | * In PyTorch: |
| | Add an `xmp.spawn` line in your script as you usually do. |
| |
|