나는 os.environ["CUDA_VISIBLE_DEVICES"] = '2,3’ 인데 🚫 RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu 위 에러는 gpu 를 0 번 쓰고 근데 그거도 모자라서 cpu 사용한다고 되어있음. ** 찾아보니 : https://github.com/microsoft/DeepSpeed/issues/3070 [BUG] cannot set gpu 2,3 to train with deepspeed and trainer in huggingface · Issue #3070 · microsoft/DeepSpee..
Python 및 Torch 코딩 이모저모
RuntimeError: CUDA error: device-side assert triggeredCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.For debugging consider passing CUDA_LAUNCH_BLOCKING=1.Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. 이거는 GPU 에 코드 돌릴 때 나타나는 에러인데코드가 잘못됬을 때 나타난다고 함..그니까 그냥 구현오류인거,,, https://builtin.com/software-engineerin..
File "/root/.venv/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 172, in _create_c10d_store return TCPStore(RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).[202..
Exception has occurred: OSErrorYou are trying to access a gated repo.Make sure to request access at https://huggingface.co/LDCC/LDCC-Instruct-Llama-2-ko-13B-v1.4 and pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=`.requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/LDCC/LDCC-Instruct-Ll..