制作一个简单的目标检测训练镜像#
参考ymir镜像制作简介, 通过加载 /in 目录下的数据集,超参数,任务信息,预训练权重, 在 /out 目录下产生模型权重,进度文件,训练日志。
镜像输入输出示例#
.
├── in
│ ├── annotations [257 entries exceeds filelimit, not opening dir]
│ ├── assets -> /home/ymir/ymir/ymir-workplace/sandbox/0001/training_asset_cache
│ ├── config.yaml
│ ├── env.yaml
│ ├── models
│ ├── train-index.tsv
│ └── val-index.tsv
├── out
│ ├── models [29 entries exceeds filelimit, not opening dir]
│ ├── monitor.txt
│ ├── tensorboard -> /home/ymir/ymir/ymir-workplace/ymir-tensorboard-logs/0001/t00000010000028774b61663839849
│ └── ymir-executor-out.log
└── task_config.yaml
工作目录#
cd det-demo-tmi
提供超参数模型文件#
镜像中包含/img-man/training-template.yaml 表示镜像支持训练
指明数据格式 export_format 为 det-ark:raw, 即目标检测标注格式,详情参考Ymir镜像数据集格式
# training template for your executor app
# after build image, it should at /img-man/training-template.yaml
# key: gpu_id, task_id, pretrained_model_paths, class_names should be preserved
# gpu_id: '0'
# task_id: 'default-training-task'
# pretrained_model_params: []
# class_names: []
export_format: 'det-ark:raw'
# just for test, remove this key in your own docker image
expected_map: 0.983 # expected map for training task
idle_seconds: 60 # idle seconds for each task
RUN mkdir -p /img-man # 在镜像中生成/img-man目录
COPY img-man/*.yaml /img-man/ # 将主机中img-man目录下的所有yaml文件复制到镜像/img-man目录
提供镜像说明文件#
object_type 为 2 表示镜像支持目标检测
# 2 for object detection
"object_type": 2
- Dockerfile
COPY img-man/*.yaml /img-man/在复制training-template.yaml的同时,会将manifest.yaml复制到镜像中的/img-man目录
提供默认启动脚本#
- Dockerfile
RUN echo "python /app/start.py" > /usr/bin/start.sh # 生成启动脚本 /usr/bin/start.sh
CMD bash /usr/bin/start.sh # 将镜像的默认启动脚本设置为 /usr/bin/start.sh
实现基本功能#
sample function of training, which shows: 1. how to get config file 2. how to read training and validation datasets 3. how to write logs 4. how to write training result
Source code in det-demo-tmi/app/start.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | |
写进度#
if idx % monitor_gap == 0:
monitor.write_monitor_logger(percent=0.2 * idx / N)
monitor.write_monitor_logger(percent=0.2)
monitor.write_monitor_logger(percent=1.0)
写结果文件#
# use `rw.write_model_stage` to save training result
rw.write_model_stage(stage_name='epoch10',
files=['epoch10.pt', 'config.py'],
evaluation_result=dict(mAP=random.random() / 2))
rw.write_model_stage(stage_name='epoch20',
files=['epoch20.pt', 'config.py'],
evaluation_result=dict(mAP=expected_mAP))
写tensorboard日志#
write_tensorboard_log(cfg.ymir.output.tensorboard_dir)
制作镜像 demo/det:training#
# a docker file for an sample training / mining / infer executor
FROM python:3.8.13-alpine
RUN sed -i 's/dl-cdn.alpinelinux.org/mirrors.tuna.tsinghua.edu.cn/g' /etc/apk/repositories
# Add bash
RUN apk add bash
# Required to build numpy wheel
RUN apk add g++ git make
COPY requirements.txt ./
RUN pip3 install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
WORKDIR /app
# copy user code to WORKDIR
COPY ./app/start.py /app/
# copy user config template and manifest.yaml to /img-man
RUN mkdir -p /img-man
COPY img-man/*.yaml /img-man/
# view https://github.com/protocolbuffers/protobuf/issues/10051 for detail
ENV PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
# entry point for your app
# the whole docker image will be started with `nvidia-docker run <other options> <docker-image-name>`
# and this command will run automatically
RUN echo "python /app/start.py" > /usr/bin/start.sh
CMD bash /usr/bin/start.sh
docker build -t demo/det:training -f Dockerfile .