制作一个简单的目标检测训练镜像#

参考ymir镜像制作简介, 通过加载 /in 目录下的数据集，超参数，任务信息，预训练权重，在 /out 目录下产生模型权重，进度文件，训练日志。

镜像输入输出示例#

.
├── in
│   ├── annotations [257 entries exceeds filelimit, not opening dir]
│   ├── assets -> /home/ymir/ymir/ymir-workplace/sandbox/0001/training_asset_cache
│   ├── config.yaml
│   ├── env.yaml
│   ├── models
│   ├── train-index.tsv
│   └── val-index.tsv
├── out
│   ├── models [29 entries exceeds filelimit, not opening dir]
│   ├── monitor.txt
│   ├── tensorboard -> /home/ymir/ymir/ymir-workplace/ymir-tensorboard-logs/0001/t00000010000028774b61663839849
│   └── ymir-executor-out.log
└── task_config.yaml

工作目录#

cd det-demo-tmi

提供超参数模型文件#

镜像中包含/img-man/training-template.yaml 表示镜像支持训练

img-man/training-template.yaml

指明数据格式 export_format 为 det-ark:raw, 即目标检测标注格式，详情参考Ymir镜像数据集格式

# training template for your executor app
# after build image, it should at /img-man/training-template.yaml
# key: gpu_id, task_id, pretrained_model_paths, class_names should be preserved

# gpu_id: '0'
# task_id: 'default-training-task'
# pretrained_model_params: []
# class_names: []
export_format: 'det-ark:raw'

# just for test, remove this key in your own docker image
expected_map: 0.983  # expected map for training task
idle_seconds: 60  # idle seconds for each task

Dockerfile

RUN mkdir -p /img-man  # 在镜像中生成/img-man目录
COPY img-man/*.yaml /img-man/  # 将主机中img-man目录下的所有yaml文件复制到镜像/img-man目录

提供镜像说明文件#

object_type 为 2 表示镜像支持目标检测

img-man/manifest.yaml

# 2 for object detection
"object_type": 2

Dockerfile COPY img-man/*.yaml /img-man/ 在复制training-template.yaml的同时，会将manifest.yaml复制到镜像中的/img-man目录

提供默认启动脚本#

Dockerfile

RUN echo "python /app/start.py" > /usr/bin/start.sh  # 生成启动脚本 /usr/bin/start.sh
CMD bash /usr/bin/start.sh  # 将镜像的默认启动脚本设置为 /usr/bin/start.sh

实现基本功能#

app/start.py

sample function of training, which shows: 1. how to get config file 2. how to read training and validation datasets 3. how to write logs 4. how to write training result

Source code in det-demo-tmi/app/start.py

def _run_training(cfg: edict) -> None:
    """
    sample function of training, which shows:
    1. how to get config file
    2. how to read training and validation datasets
    3. how to write logs
    4. how to write training result
    """
    # use `env.get_executor_config` to get config file for training
    gpu_id: str = cfg.param.get('gpu_id')
    class_names: List[str] = cfg.param.get('class_names')
    expected_mAP: float = cfg.param.get('expected_map')
    idle_seconds: float = cfg.param.get('idle_seconds')
    trigger_crash: bool = cfg.param.get('trigger_crash')
    # use `logging` or `print` to write log to console
    #   notice that logging.basicConfig is invoked at executor.env
    logging.info(f'gpu device: {gpu_id}')
    logging.info(f'dataset class names: {class_names}')
    logging.info(f"training config: {cfg.param}")

    # count for image and annotation file
    with open(cfg.ymir.input.training_index_file, 'r') as fp:
        lines = fp.readlines()

    valid_image_count = 0
    valid_ann_count = 0

    N = len(lines)
    monitor_gap = max(1, N // 100)
    for idx, line in enumerate(lines):
        asset_path, annotation_path = line.strip().split()
        if os.path.isfile(asset_path):
            valid_image_count += 1

        if os.path.isfile(annotation_path):
            valid_ann_count += 1

        # use `monitor.write_monitor_logger` to write write task process percent to monitor.txt
        if idx % monitor_gap == 0:
            monitor.write_monitor_logger(percent=0.2 * idx / N)

    logging.info(f'total image-ann pair: {N}')
    logging.info(f'valid images: {valid_image_count}')
    logging.info(f'valid annotations: {valid_ann_count}')

    # use `monitor.write_monitor_logger` to write write task process percent to monitor.txt
    monitor.write_monitor_logger(percent=0.2)

    # suppose we have a long time training, and have saved the final model
    models_dir = cfg.ymir.output.models_dir
    os.makedirs(models_dir, exist_ok=True)
    with open(os.path.join(models_dir, 'epoch10.pt'), 'w') as f:
        f.write('fake model weight')
    with open(os.path.join(models_dir, 'config.py'), 'w') as f:
        f.write('fake model config file')
    # use `rw.write_model_stage` to save training result
    rw.write_model_stage(stage_name='epoch10',
                         files=['epoch10.pt', 'config.py'],
                         evaluation_result=dict(mAP=random.random() / 2))

    _dummy_work(idle_seconds=idle_seconds, trigger_crash=trigger_crash)

    write_tensorboard_log(cfg.ymir.output.tensorboard_dir)

    with open(os.path.join(models_dir, 'epoch20.pt'), 'w') as f:
        f.write('fake model weight')
    with open(os.path.join(models_dir, 'config.py'), 'w') as f:
        f.write('fake model config file')
    rw.write_model_stage(stage_name='epoch20',
                         files=['epoch20.pt', 'config.py'],
                         evaluation_result=dict(mAP=expected_mAP))

    # if task done, write 100% percent log
    logging.info('training done')
    monitor.write_monitor_logger(percent=1.0)

写进度#

if idx % monitor_gap == 0:
    monitor.write_monitor_logger(percent=0.2 * idx / N)

monitor.write_monitor_logger(percent=0.2)

monitor.write_monitor_logger(percent=1.0)

写结果文件#

# use `rw.write_model_stage` to save training result
rw.write_model_stage(stage_name='epoch10',
                     files=['epoch10.pt', 'config.py'],
                     evaluation_result=dict(mAP=random.random() / 2))

rw.write_model_stage(stage_name='epoch20',
                     files=['epoch20.pt', 'config.py'],
                     evaluation_result=dict(mAP=expected_mAP))

写tensorboard日志#

write_tensorboard_log(cfg.ymir.output.tensorboard_dir)

制作镜像 demo/det:training#

# a docker file for an sample training / mining / infer executor

FROM python:3.8.13-alpine

RUN sed -i 's/dl-cdn.alpinelinux.org/mirrors.tuna.tsinghua.edu.cn/g' /etc/apk/repositories
# Add bash
RUN apk add bash
# Required to build numpy wheel
RUN apk add g++ git make

COPY requirements.txt ./
RUN pip3 install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

WORKDIR /app
# copy user code to WORKDIR
COPY ./app/start.py /app/

# copy user config template and manifest.yaml to /img-man
RUN mkdir -p /img-man
COPY img-man/*.yaml /img-man/

# view https://github.com/protocolbuffers/protobuf/issues/10051 for detail
ENV PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python

# entry point for your app
# the whole docker image will be started with `nvidia-docker run <other options> <docker-image-name>`
# and this command will run automatically

RUN echo "python /app/start.py" > /usr/bin/start.sh
CMD bash /usr/bin/start.sh

docker build -t demo/det:training -f Dockerfile .