当前位置：首页 > article >正文

vscode调试deepspeed的方法之一（无需调整脚本）

article 2025/8/14 20:25:53

现在deepspeed的脚本文件是：

# 因为使用 RTX 4000 系列显卡时，不支持通过 P2P 或 IB 实现更快的通信宽带，需要设置以下两个环境变量
# 禁用 NCCL 的 P2P 通信，以避免可能出现的兼容性问题
export NCCL_P2P_DISABLE="1"
# 禁用 NCCL 的 IB 通信，以适应 RTX 4000 系列显卡的特性
export NCCL_IB_DISABLE="1"# 设置 Hugging Face 模型仓库的镜像地址，方便下载模型等资源
export HF_ENDPOINT=https://hf-mirror.com# 使用 deepspeed 工具运行 simple_LLaVA_run.py 脚本
# --include localhost:0,1 表示指定在本地的 0 号和 1 号 GPU 上运行任务
# 注：localhost 代表本地机器，0 和 1 是 GPU 的编号
deepspeed --include localhost:0,1 simple_LLaVA_run.py \--deepspeed ds_zero2_no_offload.json \--model_name_or_path /home/louis/LK/study/transformers/lk_study/llava_study/my_llava_model/model_01 \--train_type use_lora \--data_path /home/louis/LK/study/transformers/lk_study/llava_study/train_llava/data \--remove_unused_columns false \--bf16 true \--fp16 false \--dataloader_pin_memory True \--dataloader_num_workers 10 \--dataloader_persistent_workers True \--output_dir output_model_user_lora_simple_train \--num_train_epochs 10 \--per_device_train_batch_size 1 \--per_device_eval_batch_size 1 \--gradient_accumulation_steps 8 \--evaluation_strategy "no" \--save_strategy "epoch" \--save_total_limit 3 \--report_to "tensorboard" \--learning_rate 4e-4 \--logging_steps 10

要用vscode对这个deepspeed命令执行的python程序进行调试，一个方法是：

1）点击侧边栏“调试”按钮

在这里插入图片描述
然后点击“设置”，就会出现“launch.json”文件。

2）launch.json添加内容

在“launch.json”文件的"configurations"的内容中增加下面的内容：

{"name": "DeepSpeed调试单GPU","type": "debugpy","request": "launch","program": "/home/louis/anaconda3/envs/unsloth_env_py311_torch240/bin/deepspeed",  // 替换为实际脚本路径"console": "integratedTerminal","justMyCode": true,"args": ["--num_gpus", "1","/home/louis/LK/study/transformers/lk_study/llava_study/simple_LLaVA_run.py","--deepspeed", "/home/louis/LK/study/transformers/lk_study/llava_study/ds_zero2_no_offload.json","--model_name_or_path", "/home/louis/LK/study/transformers/lk_study/llava_study/my_llava_model/model_01","--train_type", "use_lora","--data_path", "/home/louis/LK/study/transformers/lk_study/llava_study/train_llava/data","--remove_unused_columns", "false","--bf16", "true","--fp16", "false","--dataloader_pin_memory", "True","--dataloader_num_workers", "10","--dataloader_persistent_workers", "True","--output_dir", "output_model_user_lora_simple_train","--num_train_epochs", "10","--per_device_train_batch_size", "1","--per_device_eval_batch_size", "1","--gradient_accumulation_steps", "8","--evaluation_strategy", "no","--save_strategy", "epoch","--save_total_limit", "3","--report_to", "tensorboard","--learning_rate", "4e-4","--logging_steps", "10"],"env": {"NCCL_P2P_DISABLE": "1","NCCL_IB_DISABLE": "1","HF_ENDPOINT": "https://hf-mirror.com","CUDA_VISIBLE_DEVICES": "0",  // 关键：强制单GPU调试"PYTHONUNBUFFERED": "1",      // 确保日志立即输出"CUDA_LAUNCH_BLOCKING": "1"   // 同步CUDA操作}
}