制作一个简单的语义分割训练镜像#

参考ymir镜像制作简介, 通过加载 /in 目录下的数据集，超参数，任务信息，预训练权重，在 /out 目录下产生模型权重，进度文件，训练日志。

镜像输入输出示例#

.
├── in
│   ├── annotations
│   │   └── coco-annotations.json
│   ├── assets -> /home/ymir/ymir/ymir-workplace/sandbox/0001/asset_cache
│   ├── config.yaml
│   ├── env.yaml
│   ├── models
│   │   ├── best_mIoU_iter_180.pth
│   │   └── fast_scnn_lr0.12_8x4_160k_cityscapes.py
│   ├── train-index.tsv
│   └── val-index.tsv
├── out
│   ├── models
│   │   ├── 20221103_082913.log
│   │   ├── 20221103_082913.log.json
│   │   ├── fast_scnn_lr0.12_8x4_160k_cityscapes.py
│   │   ├── iter_10000.pth
│   │   ├── iter_12000.pth
│   │   ├── iter_14000.pth
│   │   ├── iter_16000.pth
│   │   ├── iter_18000.pth
│   │   ├── iter_20000.pth
│   │   ├── latest.pth -> iter_20000.pth
│   │   └── result.yaml
│   ├── monitor.txt
│   ├── tensorboard -> /home/ymir/ymir/ymir-workplace/ymir-tensorboard-logs/0001/t00000010000043b47591667304420
│   └── ymir-executor-out.log
└── task_config.yaml

工作目录#

cd seg-semantic-demo-tmi

提供超参数模型文件#

镜像中包含/img-man/training-template.yaml 表示镜像支持训练

img-man/training-template.yaml

指明数据格式 export_format 为 seg-coco:raw, 即语义/实例分割标注格式，详情参考Ymir镜像数据集格式

# training template for your executor app
# after build image, it should at /img-man/training-template.yaml
# key: gpu_id, task_id, pretrained_model_paths, class_names, gpu_count should be preserved

# gpu_id: '0'
# gpu_count: 1
# task_id: 'default-training-task'
# pretrained_model_params: []
# class_names: []

# format of annotations and images that ymir should provide to this docker container
#   annotation format: must be seg-coco
#   image format: must be raw
export_format: 'seg-coco:raw'

# just for test, remove this key in your own docker image
expected_miou: 0.983  # expected mIoU for training task
idle_seconds: 3  # idle seconds for each task

Dockerfile

RUN mkdir -p /img-man  # 在镜像中生成/img-man目录
COPY img-man/*.yaml /img-man/  # 将主机中img-man目录下的所有yaml文件复制到镜像/img-man目录

提供镜像说明文件#

object_type 为 3 表示镜像支持语义分割

img-man/manifest.yaml

# 3 for semantic segmentation
"object_type": 3

Dockerfile COPY img-man/*.yaml /img-man/ 在复制training-template.yaml的同时，会将manifest.yaml复制到镜像中的/img-man目录

提供默认启动脚本#

Dockerfile

RUN echo "python /app/start.py" > /usr/bin/start.sh  # 生成启动脚本 /usr/bin/start.sh
CMD bash /usr/bin/start.sh  # 将镜像的默认启动脚本设置为 /usr/bin/start.sh

实现基本功能#

app/start.py

sample function of training

which shows: - how to get config file - how to read training and validation datasets - how to write logs - how to write training result

Source code in seg-semantic-demo-tmi/app/start.py

def _run_training(cfg: edict) -> None:
    """sample function of training

    which shows:
    - how to get config file
    - how to read training and validation datasets
    - how to write logs
    - how to write training result
    """
    # use `env.get_executor_config` to get config file for training
    gpu_id: str = cfg.param.get('gpu_id')
    class_names: List[str] = cfg.param.get('class_names')
    expected_miou: float = cfg.param.get('expected_miou', 0.6)
    idle_seconds: float = cfg.param.get('idle_seconds', 60)
    trigger_crash: bool = cfg.param.get('trigger_crash', False)
    # use `logging` or `print` to write log to console
    #   notice that logging.basicConfig is invoked at executor.env
    logging.info(f'gpu device: {gpu_id}')
    logging.info(f'dataset class names: {class_names}')
    logging.info(f"training config: {cfg.param}")

    # count for image and annotation file
    with open(cfg.ymir.input.training_index_file, 'r') as fp:
        lines = fp.readlines()

    valid_image_count = 0
    valid_ann_count = 0

    N = len(lines)
    monitor_gap = max(1, N // 100)
    for idx, line in enumerate(lines):
        asset_path, annotation_path = line.strip().split()
        if os.path.isfile(asset_path):
            valid_image_count += 1

        if os.path.isfile(annotation_path):
            valid_ann_count += 1

        # use `monitor.write_monitor_logger` to write write task process percent to monitor.txt
        if idx % monitor_gap == 0:
            monitor.write_monitor_logger(percent=0.2 * idx / N)

    logging.info(f'total image-ann pair: {N}')
    logging.info(f'valid images: {valid_image_count}')
    logging.info(f'valid annotations: {valid_ann_count}')

    # use `monitor.write_monitor_logger` to write write task process percent to monitor.txt
    monitor.write_monitor_logger(percent=0.2)

    # suppose we have a long time training, and have saved the final model
    # model output dir: os.path.join(cfg.ymir.output.models_dir, your_stage_name)
    stage_dir = os.path.join(cfg.ymir.output.models_dir, 'epoch10')
    os.makedirs(stage_dir, exist_ok=True)
    with open(os.path.join(stage_dir, 'epoch10.pt'), 'w') as f:
        f.write('fake model weight')
    with open(os.path.join(stage_dir, 'config.py'), 'w') as f:
        f.write('fake model config file')
    # use `rw.write_model_stage` to save training result
    rw.write_model_stage(stage_name='epoch10',
                         files=['epoch10.pt', 'config.py'],
                         evaluation_result=dict(mIoU=random.random() / 2))

    _dummy_work(idle_seconds=idle_seconds, trigger_crash=trigger_crash)

    write_tensorboard_log(cfg.ymir.output.tensorboard_dir)

    stage_dir = os.path.join(cfg.ymir.output.models_dir, 'epoch20')
    os.makedirs(stage_dir, exist_ok=True)
    with open(os.path.join(stage_dir, 'epoch20.pt'), 'w') as f:
        f.write('fake model weight')
    with open(os.path.join(stage_dir, 'config.py'), 'w') as f:
        f.write('fake model config file')
    rw.write_model_stage(stage_name='epoch20',
                         files=['epoch20.pt', 'config.py'],
                         evaluation_result=dict(mIoU=expected_miou))

    # if task done, write 100% percent log
    logging.info('training done')
    monitor.write_monitor_logger(percent=1.0)

写进度#

if idx % monitor_gap == 0:
    monitor.write_monitor_logger(percent=0.2 * idx / N)

monitor.write_monitor_logger(percent=0.2)

monitor.write_monitor_logger(percent=1.0)

写结果文件#

rw.write_model_stage(stage_name='epoch20',
                     files=['epoch20.pt', 'config.py'],
                     evaluation_result=dict(mIoU=expected_miou))

写tensorboard日志#

write_tensorboard_log(cfg.ymir.output.tensorboard_dir)

制作镜像 demo/semantic_seg:training#

# a docker file for an sample training / mining / infer executor

# FROM ubuntu:20.04
FROM python:3.8.16

ENV LANG=C.UTF-8

# Change mirror
RUN sed -i 's#http://archive.ubuntu.com#http://mirrors.ustc.edu.cn#g' /etc/apt/sources.list \
    && sed -i 's#http://security.ubuntu.com#http://mirrors.ustc.edu.cn#g' /etc/apt/sources.list

# Set timezone
RUN ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime \
    && echo 'Asia/Shanghai' >/etc/timezone

# Install linux package
RUN apt-get update && apt-get install -y gnupg2 git libglib2.0-0 \
    libgl1-mesa-glx libsm6 libxext6 libxrender-dev \
    build-essential ninja-build \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt /app/
RUN pip3 install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

WORKDIR /app
# copy user code to WORKDIR
COPY ./app/*.py /app/

# copy user config template and manifest.yaml to /img-man
RUN mkdir -p /img-man
COPY img-man/*.yaml /img-man/

# view https://github.com/protocolbuffers/protobuf/issues/10051 for detail
ENV PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python

# entry point for your app
# the whole docker image will be started with `nvidia-docker run <other options> <docker-image-name>`
# and this command will run automatically

RUN echo "python /app/start.py" > /usr/bin/start.sh
CMD bash /usr/bin/start.sh

docker build -t demo/semantic_seg:training -f Dockerfile .