vscode调试deepspeed的方法之一(无需调整脚本)
现在deepspeed的脚本文件是:
# 因为使用 RTX 4000 系列显卡时,不支持通过 P2P 或 IB 实现更快的通信宽带,需要设置以下两个环境变量
# 禁用 NCCL 的 P2P 通信,以避免可能出现的兼容性问题
export NCCL_P2P_DISABLE="1"
# 禁用 NCCL 的 IB 通信,以适应 RTX 4000 系列显卡的特性
export NCCL_IB_DISABLE="1"# 设置 Hugging Face 模型仓库的镜像地址,方便下载模型等资源
export HF_ENDPOINT=https://hf-mirror.com# 使用 deepspeed 工具运行 simple_LLaVA_run.py 脚本
# --include localhost:0,1 表示指定在本地的 0 号和 1 号 GPU 上运行任务
# 注:localhost 代表本地机器,0 和 1 是 GPU 的编号
deepspeed --include localhost:0,1 simple_LLaVA_run.py \--deepspeed ds_zero2_no_offload.json \--model_name_or_path /home/louis/LK/study/transformers/lk_study/llava_study/my_llava_model/model_01 \--train_type use_lora \--data_path /home/louis/LK/study/transformers/lk_study/llava_study/train_llava/data \--remove_unused_columns false \--bf16 true \--fp16 false \--dataloader_pin_memory True \--dataloader_num_workers 10 \--dataloader_persistent_workers True \--output_dir output_model_user_lora_simple_train \--num_train_epochs 10 \--per_device_train_batch_size 1 \--per_device_eval_batch_size 1 \--gradient_accumulation_steps 8 \--evaluation_strategy "no" \--save_strategy "epoch" \--save_total_limit 3 \--report_to "tensorboard" \--learning_rate 4e-4 \--logging_steps 10
要用vscode对这个deepspeed命令 执行的python程序进行调试,一个方法是:
1)点击侧边栏“调试”按钮
然后点击“设置”,就会出现“launch.json”文件。
2)launch.json添加内容
在“launch.json”文件的"configurations"的内容中增加下面的内容:
{"name": "DeepSpeed调试单GPU","type": "debugpy","request": "launch","program": "/home/louis/anaconda3/envs/unsloth_env_py311_torch240/bin/deepspeed", // 替换为实际脚本路径"console": "integratedTerminal","justMyCode": true,"args": ["--num_gpus", "1","/home/louis/LK/study/transformers/lk_study/llava_study/simple_LLaVA_run.py","--deepspeed", "/home/louis/LK/study/transformers/lk_study/llava_study/ds_zero2_no_offload.json","--model_name_or_path", "/home/louis/LK/study/transformers/lk_study/llava_study/my_llava_model/model_01","--train_type", "use_lora","--data_path", "/home/louis/LK/study/transformers/lk_study/llava_study/train_llava/data","--remove_unused_columns", "false","--bf16", "true","--fp16", "false","--dataloader_pin_memory", "True","--dataloader_num_workers", "10","--dataloader_persistent_workers", "True","--output_dir", "output_model_user_lora_simple_train","--num_train_epochs", "10","--per_device_train_batch_size", "1","--per_device_eval_batch_size", "1","--gradient_accumulation_steps", "8","--evaluation_strategy", "no","--save_strategy", "epoch","--save_total_limit", "3","--report_to", "tensorboard","--learning_rate", "4e-4","--logging_steps", "10"],"env": {"NCCL_P2P_DISABLE": "1","NCCL_IB_DISABLE": "1","HF_ENDPOINT": "https://hf-mirror.com","CUDA_VISIBLE_DEVICES": "0", // 关键:强制单GPU调试"PYTHONUNBUFFERED": "1", // 确保日志立即输出"CUDA_LAUNCH_BLOCKING": "1" // 同步CUDA操作}
}
保存文件
3)调试
点击调试窗口的下三角,选择要调式的deepspeed选项,然后点击做百年的绿色三角,开始调试程序