You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ran podun successfully and it was on a v3-32 pod. But if I use a matrix dimension > 30000 it didn't ran (ie. it did not scale-up)
used command:
./podrun -ic -- ~/venv/bin/python f32simple.py
import jax.numpy as jnp
from jax import random
import jax
import time
from jax.lib import xla_bridge
print(xla_bridge.get_backend().platform)
N=25000 # Dim Mx
for i in range(6):
x = random.normal(random.PRNGKey(0), (N,N))
t2 = time.time()
y = jnp.matmul(x, x)
print(y[1][1]) # Equivalent to y.block_until_ready()
t1 = time.time()
print("Mx size = ", N, "\ttime elapsed", t1-t2 , "Mx in GB = %d GB" % ((y.size * y.itemsize)/1.0e+9))
N=N+2500
it works for 25000,27500 and not for 30000, as like a single t3-8 device.
in other words How can I successfully evaluate this larger matrix using podrun ?
Error is:
/home/gpkmohan_mbcet/f32simple.py:7: DeprecationWarning: jax.lib.xla_bridge.get_backend is deprecated; use jax.extend.backend.get_backend.
print(xla_bridge.get_backend().platform)
/home/gpkmohan_mbcet/f32simple.py:7: DeprecationWarning: jax.lib.xla_bridge.get_backend is deprecated; use jax.extend.backend.get_backend.
print(xla_bridge.get_backend().platform)
/home/gpkmohan_mbcet/f32simple.py:7: DeprecationWarning: jax.lib.xla_bridge.get_backend is deprecated; use jax.extend.backend.get_backend.
print(xla_bridge.get_backend().platform)
/home/gpkmohan_mbcet/f32simple.py:7: DeprecationWarning: jax.lib.xla_bridge.get_backend is deprecated; use jax.extend.backend.get_backend.
print(xla_bridge.get_backend().platform)
tpu
-19.041164
Mx size = 25000 time elapsed 2.4406259059906006 Mx in GB = 2 GB
295.55646
Mx size = 27500 time elapsed 2.626249074935913 Mx in GB = 3 GB
Traceback (most recent call last):
File "/home/gpkmohan_mbcet/f32simple.py", line 12, in
x = random.normal(random.PRNGKey(0), (N,N))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gpkmohan_mbcet/venv/lib/python3.12/site-packages/jax/_src/random.py", line 700, in normal
return _normal(key, shape, dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Error loading program: Attempting to reserve 6.72G at the bottom of memory. That was not possible. There are 6.48G free, 0B reserved, and 6.48G reservable.
For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.
tpu
-19.041164
Mx size = 25000 time elapsed 2.452914237976074 Mx in GB = 2 GB
295.55646
Mx size = 27500 time elapsed 2.6292035579681396 Mx in GB = 3 GB
Traceback (most recent call last):
File "/home/gpkmohan_mbcet/f32simple.py", line 12, in
x = random.normal(random.PRNGKey(0), (N,N))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gpkmohan_mbcet/venv/lib/python3.12/site-packages/jax/_src/random.py", line 700, in normal
return _normal(key, shape, dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Error loading program: Attempting to reserve 6.72G at the bottom of memory. That was not possible. There are 6.48G free, 0B reserved, and 6.48G reservable.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Dear Ayaka
I ran podun successfully and it was on a v3-32 pod. But if I use a matrix dimension > 30000 it didn't ran (ie. it did not scale-up)
used command:
./podrun -ic -- ~/venv/bin/python f32simple.py
it works for 25000,27500 and not for 30000, as like a single t3-8 device.
in other words How can I successfully evaluate this larger matrix using podrun ?
Error is:
/home/gpkmohan_mbcet/f32simple.py:7: DeprecationWarning: jax.lib.xla_bridge.get_backend is deprecated; use jax.extend.backend.get_backend.
print(xla_bridge.get_backend().platform)
/home/gpkmohan_mbcet/f32simple.py:7: DeprecationWarning: jax.lib.xla_bridge.get_backend is deprecated; use jax.extend.backend.get_backend.
print(xla_bridge.get_backend().platform)
/home/gpkmohan_mbcet/f32simple.py:7: DeprecationWarning: jax.lib.xla_bridge.get_backend is deprecated; use jax.extend.backend.get_backend.
print(xla_bridge.get_backend().platform)
/home/gpkmohan_mbcet/f32simple.py:7: DeprecationWarning: jax.lib.xla_bridge.get_backend is deprecated; use jax.extend.backend.get_backend.
print(xla_bridge.get_backend().platform)
tpu
-19.041164
Mx size = 25000 time elapsed 2.4406259059906006 Mx in GB = 2 GB
295.55646
Mx size = 27500 time elapsed 2.626249074935913 Mx in GB = 3 GB
Traceback (most recent call last):
File "/home/gpkmohan_mbcet/f32simple.py", line 12, in
x = random.normal(random.PRNGKey(0), (N,N))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gpkmohan_mbcet/venv/lib/python3.12/site-packages/jax/_src/random.py", line 700, in normal
return _normal(key, shape, dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Error loading program: Attempting to reserve 6.72G at the bottom of memory. That was not possible. There are 6.48G free, 0B reserved, and 6.48G reservable.
For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.
tpu
-19.041164
Mx size = 25000 time elapsed 2.452914237976074 Mx in GB = 2 GB
295.55646
Mx size = 27500 time elapsed 2.6292035579681396 Mx in GB = 3 GB
Traceback (most recent call last):
File "/home/gpkmohan_mbcet/f32simple.py", line 12, in
x = random.normal(random.PRNGKey(0), (N,N))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gpkmohan_mbcet/venv/lib/python3.12/site-packages/jax/_src/random.py", line 700, in normal
return _normal(key, shape, dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Error loading program: Attempting to reserve 6.72G at the bottom of memory. That was not possible. There are 6.48G free, 0B reserved, and 6.48G reservable.
Beta Was this translation helpful? Give feedback.
All reactions