Hello, We have dagster deployed in Kubernetes. It ...
# ask-community
j
Hello, We have dagster deployed in Kubernetes. It was working fine since August 2022, and then it stopped working in February. I've been trying to troubleshoot for two days. I have found that the code works fine when I run it locally. Something isn't working in Kubernetes and the dagster jobs won't run there anymore. I don't know how to troubleshoot and would be grateful for your help. Here is one error message from the dagster-daemon log, and the same message is found in the dagster-dagit log. I don't understand it and am not even sure it's relevant. I've spent hours and hours trying to troubleshoot this but I'm like a cave dweller trying to tinker with an electric car when it comes to figuring stuff out in the cloud. Actually, I suppose I'm not that bad, but it's certainly not my cup of tea and I'm definitely stuck on this issue.
Copy code
2023-03-01 21:43:57.251 GMT
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:

{
insertId: "kteiqgwfiefvsjr3"
labels: {6}
logName: <MY_PROJECT>
receiveTimestamp: "2023-03-01T21:43:58.049338769Z"
resource: {2}
severity: "INFO"
textPayload: "grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:"
timestamp: "2023-03-01T21:43:57.251681282Z"
}

2023-03-01 21:43:57.251 GMT
status = StatusCode.UNAVAILABLE

2023-03-01 21:43:57.251 GMT
details = "failed to connect to all addresses"

{
insertId: "224vy9b8py9nlnq2"
labels: {6}
logName: <MY_PROJECT>
receiveTimestamp: "2023-03-01T21:43:58.049338769Z"
resource: {2}
severity: "INFO"
textPayload: "	details = "failed to connect to all addresses""
timestamp: "2023-03-01T21:43:57.251741084Z"
d
Hi Jonah, is this using the Dagster helm chart?
What that error is telling me is that dagit and your daemon are having trouble reaching your user code pods: https://docs.dagster.io/deployment/guides/kubernetes/deploying-with-helm#deployment-architecture What I would probably do first is see what's going on with those pods in kubectl by running something like this:
kubectl get pods
and looking for a pod with the same name of one of your code locations, then doing something like
kubectl describe pod <that pod name>
or
kubectl logs <that pod name
and see if an error jumps out that might explain why its not able to start up
j
Yes, @daniel, it is this using the Dagster helm chart. Thank you for your advice. I will attempt.
One of the pods has an error.
kubectl logs user-code-dagster-user-deployments<location>
Traceback (most recent call last):
File "/usr/local/bin/dagster", line 5, in <module>
from dagster.cli import main
File "/usr/local/lib/python3.9/site-packages/dagster/__init__.py", line 239, in <module>
from dagster.core.storage.event_log import (
File "/usr/local/lib/python3.9/site-packages/dagster/core/storage/event_log/__init__.py", line 9, in <module>
from .polling_event_watcher import SqlPollingEventWatcher
File "/usr/local/lib/python3.9/site-packages/dagster/core/storage/event_log/polling_event_watcher.py", line 8, in <module>
from .sql_event_log import SqlEventLogStorage
File "/usr/local/lib/python3.9/site-packages/dagster/core/storage/event_log/sql_event_log.py", line 38, in <module>
from .schema import AssetKeyTable, SecondaryIndexMigrationTable, SqlEventLogStorageTable
File "/usr/local/lib/python3.9/site-packages/dagster/core/storage/event_log/schema.py", line 3, in <module>
from ..sql import MySQLCompatabilityTypes, get_current_timestamp
File "/usr/local/lib/python3.9/site-packages/dagster/core/storage/sql.py", line 6, in <module>
from alembic.command import downgrade, stamp, upgrade
File "/usr/local/lib/python3.9/site-packages/alembic/__init__.py", line 3, in <module>
from . import context # noqa
File "/usr/local/lib/python3.9/site-packages/alembic/context.py", line 1, in <module>
from .runtime.environment import EnvironmentContext
File "/usr/local/lib/python3.9/site-packages/alembic/runtime/environment.py", line 1, in <module>
from .migration import MigrationContext
File "/usr/local/lib/python3.9/site-packages/alembic/runtime/migration.py", line 15, in <module>
from .. import ddl
File "/usr/local/lib/python3.9/site-packages/alembic/ddl/__init__.py", line 1, in <module>
from . import mssql # noqa
File "/usr/local/lib/python3.9/site-packages/alembic/ddl/mssql.py", line 8, in <module>
from .base import AddColumn
File "/usr/local/lib/python3.9/site-packages/alembic/ddl/base.py", line 11, in <module>
from ..util.sqla_compat import _columns_for_constraint # noqa
File "/usr/local/lib/python3.9/site-packages/alembic/util/__init__.py", line 14, in <module>
from .messaging import err # noqa
File "/usr/local/lib/python3.9/site-packages/alembic/util/messaging.py", line 8, in <module>
from . import sqla_compat
File "/usr/local/lib/python3.9/site-packages/alembic/util/sqla_compat.py", line 16, in <module>
from sqlalchemy.sql.expression import _BindParamClause
ImportError: cannot import name '_BindParamClause' from 'sqlalchemy.sql.expression' (/usr/local/lib/python3.9/site-packages/sqlalchemy/sql/expression.py)
I bold-faced the part that seems relevant, but I don't know what to do about this error.
d
OK luckily I do! I think this came from sqlalchemy releasing a 2.0 version that was not backwards compatible. I'd expect one of the following to fix the problem: • Rebuild the image with an additional sqlalchemy<2 pin when installing dagster • Upgrade dagster to the latest version, where that pin has been applied automatically already
j
I need to confess that I didn't create the deployment. The guy who did is gone. With that said, I'm trying to upgrade dagster to version 1.1.20 and getting an error when I push the Docker image to the cloud.
ERROR: Could not find a version that satisfies the requirement dagster-postgres==1.1.20 (from versions: 0.5.9rc0, 0.5.9, 0.6.0
...
ERROR: No matching distribution found for dagster-postgres==1.1.20
If you have any thoughts on this error, greatly appreciate it!
d
Ah I think you want dagster-postgres==0.17.20 (dagster is 1.1.x, libraries are 0.17.x)
j
Hello @daniel Thanks for your help. I did finally fix the issue with sqlalchemy and got Dagster working. However, then I ran into a new problem that never existed before with the dagster service account. The service account is in the correct format (i deleted it and recreated it too), so I suspect there is some other issue that is manifesting as this error. However, I did finally get in touch with the engineer who set this up and will see what we can do. Thank you for your patience and help. Your insights and recommended fixes have all been spot on.
google.auth.exceptions.RefreshError: ('Failed to retrieve <http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true> from the Google Compute Engine metadata service. Status: 400 Response:\nb"Annotated service account must be in format of \'[SA-NAME]@[PROJECT-ID].<http://iam.gserviceaccount.com|iam.gserviceaccount.com>\', \'[SA_NAME]@appspot.gserviceaccount.com\' or \'[SA_NAME]@developer.gserviceaccount.com\'\\n"', <google_auth_httplib2._Response object at 0x7f0a84e40b20>)
226 Views