Skip to content

jaxlib.xla_extension.XlaRuntimeError: INTERNAL: cuSolver internal error #44

@tekntrash

Description

@tekntrash

(openvla-env) root@ubuntu:~/ksim-gym# python -m train
INFO 2025-05-20 13:08:38 [xax.task.mixins.compile] Setting JAX logging level to INFO
INFO 2025-05-20 13:08:38 [xax.task.mixins.compile] Setting JAX compilation cache directory to /root/.cache/jax/jaxcache
INFO 2025-05-20 13:08:38 [xax.task.mixins.compile] Configuring JAX compilation cache parameters
INFO:2025-05-20 13:08:38,816:jax._src.xla_bridge:867: Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
INFO 2025-05-20 13:08:38 [jax._src.xla_bridge] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
2025-05-20 13:08:38.897413: I external/xla/xla/service/service.cc:152] XLA service 0xb17d160 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2025-05-20 13:08:38.897471: I external/xla/xla/service/service.cc:160] StreamExecutor device (0): Orin, Compute Capability 8.7
2025-05-20 13:08:38.906223: I external/xla/xla/pjrt/pjrt_c_api_client.cc:130] PjRtCApiClient created.
STATUS 2025-05-20 13:08:50 [xax.task.mixins.artifacts] /root/ksim-gym/humanoid_walking_task/run_1
STATUS 2025-05-20 13:08:50 [xax.task.mixins.train] /root/ksim-gym/train.py
STATUS 2025-05-20 13:08:50 [xax.task.mixins.train] humanoid_walking_task
STATUS 2025-05-20 13:08:50 [xax.task.mixins.train] JAX devices: [CudaDevice(id=0)]
INFO 2025-05-20 13:08:51 [xax.task.mixins.train] Starting a new training run
PING 2025-05-20 13:08:53 [ksim.task.rl] Model size: 1,090,861 parameters
PING 2025-05-20 13:08:53 [ksim.task.rl] Optimizer size: 2,181,722 parameters

Status
✦ JAX devices: [CudaDevice(id=0)]
✦ humanoid_walking_task
✦ /root/ksim-gym/train.py
✦ /root/ksim-gym/humanoid_walking_task/run_1

Pings
✦ Optimizer size: 2,181,722 parameters
✦ Model size: 1,090,861 parameters
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/root/ksim-gym/train.py", line 658, in
HumanoidWalkingTask.launch(
File "/root/openvla-env/lib/python3.12/site-packages/xax/task/mixins/runnable.py", line 51, in launch
launcher.launch(cls, *cfgs, use_cli=use_cli)
File "/root/openvla-env/lib/python3.12/site-packages/xax/task/launchers/cli.py", line 40, in launch
SingleProcessLauncher().launch(task, *cfgs, use_cli=use_cli_next)
File "/root/openvla-env/lib/python3.12/site-packages/xax/task/launchers/single_process.py", line 30, in launch
run_single_process_training(task, *cfgs, use_cli=use_cli)
File "/root/openvla-env/lib/python3.12/site-packages/xax/task/launchers/single_process.py", line 20, in run_single_process_training
task_obj.run()
File "/root/openvla-env/lib/python3.12/site-packages/ksim/task/rl.py", line 1009, in run
self.run_training()
File "/root/openvla-env/lib/python3.12/site-packages/ksim/task/rl.py", line 2042, in run_training
constants, carry, state = self.initialize_rl_training(mj_model, rng)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/openvla-env/lib/python3.12/site-packages/ksim/task/rl.py", line 1990, in initialize_rl_training
env_states=self._get_env_state(
^^^^^^^^^^^^^^^^^^^^
File "/root/openvla-env/lib/python3.12/site-packages/ksim/task/rl.py", line 1801, in _get_env_state
randomization_dict, physics_state = randomization_fn(
^^^^^^^^^^^^^^^^^
File "/root/openvla-env/lib/python3.12/site-packages/ksim/task/rl.py", line 322, in apply_randomizations
physics_state = engine.reset(physics_model, curriculum_level, reset_rng)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/openvla-env/lib/python3.12/site-packages/xax/utils/jax.py", line 139, in wrapped
res = jitted_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: cuSolver internal error

For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.
(openvla-env) root@ubuntu:~/ksim-gym# python -m train
INFO 2025-05-20 13:14:09 [xax.task.mixins.compile] Setting JAX logging level to INFO
INFO 2025-05-20 13:14:09 [xax.task.mixins.compile] Setting JAX compilation cache directory to /root/.cache/jax/jaxcache
INFO 2025-05-20 13:14:09 [xax.task.mixins.compile] Configuring JAX compilation cache parameters
INFO:2025-05-20 13:14:09,150:jax._src.xla_bridge:867: Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
INFO 2025-05-20 13:14:09 [jax._src.xla_bridge] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
2025-05-20 13:14:09.219131: I external/xla/xla/service/service.cc:152] XLA service 0x118922f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2025-05-20 13:14:09.219184: I external/xla/xla/service/service.cc:160] StreamExecutor device (0): Orin, Compute Capability 8.7
2025-05-20 13:14:09.228180: I external/xla/xla/pjrt/pjrt_c_api_client.cc:130] PjRtCApiClient created.
STATUS 2025-05-20 13:14:20 [xax.task.mixins.artifacts] /root/ksim-gym/humanoid_walking_task/run_2
STATUS 2025-05-20 13:14:20 [xax.task.mixins.train] /root/ksim-gym/train.py
STATUS 2025-05-20 13:14:20 [xax.task.mixins.train] humanoid_walking_task
STATUS 2025-05-20 13:14:20 [xax.task.mixins.train] JAX devices: [CudaDevice(id=0)]
INFO 2025-05-20 13:14:21 [xax.task.mixins.train] Starting a new training run
PING 2025-05-20 13:14:23 [ksim.task.rl] Model size: 1,090,861 parameters
PING 2025-05-20 13:14:23 [ksim.task.rl] Optimizer size: 2,181,722 parameters

Status
✦ JAX devices: [CudaDevice(id=0)]
✦ humanoid_walking_task
✦ /root/ksim-gym/train.py
✦ /root/ksim-gym/humanoid_walking_task/run_2

Pings
✦ Optimizer size: 2,181,722 parameters
✦ Model size: 1,090,861 parameters
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/root/ksim-gym/train.py", line 658, in
HumanoidWalkingTask.launch(
File "/root/openvla-env/lib/python3.12/site-packages/xax/task/mixins/runnable.py", line 51, in launch
launcher.launch(cls, *cfgs, use_cli=use_cli)
File "/root/openvla-env/lib/python3.12/site-packages/xax/task/launchers/cli.py", line 40, in launch
SingleProcessLauncher().launch(task, *cfgs, use_cli=use_cli_next)
File "/root/openvla-env/lib/python3.12/site-packages/xax/task/launchers/single_process.py", line 30, in launch
run_single_process_training(task, *cfgs, use_cli=use_cli)
File "/root/openvla-env/lib/python3.12/site-packages/xax/task/launchers/single_process.py", line 20, in run_single_process_training
task_obj.run()
File "/root/openvla-env/lib/python3.12/site-packages/ksim/task/rl.py", line 1009, in run
self.run_training()
File "/root/openvla-env/lib/python3.12/site-packages/ksim/task/rl.py", line 2042, in run_training
constants, carry, state = self.initialize_rl_training(mj_model, rng)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/openvla-env/lib/python3.12/site-packages/ksim/task/rl.py", line 1990, in initialize_rl_training
env_states=self._get_env_state(
^^^^^^^^^^^^^^^^^^^^
File "/root/openvla-env/lib/python3.12/site-packages/ksim/task/rl.py", line 1801, in _get_env_state
randomization_dict, physics_state = randomization_fn(
^^^^^^^^^^^^^^^^^
File "/root/openvla-env/lib/python3.12/site-packages/ksim/task/rl.py", line 322, in apply_randomizations
physics_state = engine.reset(physics_model, curriculum_level, reset_rng)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/openvla-env/lib/python3.12/site-packages/xax/utils/jax.py", line 139, in wrapped
res = jitted_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: cuSolver internal error

For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions