Hello We have dagster deployed in Kubernetes It was working dagster #ask-community

Hello, We have dagster deployed in Kubernetes. It ...

Jonah Liebert

03/01/2023, 9:55 PM

Hello, We have dagster deployed in Kubernetes. It was working fine since August 2022, and then it stopped working in February. I've been trying to troubleshoot for two days. I have found that the code works fine when I run it locally. Something isn't working in Kubernetes and the dagster jobs won't run there anymore. I don't know how to troubleshoot and would be grateful for your help. Here is one error message from the dagster-daemon log, and the same message is found in the dagster-dagit log. I don't understand it and am not even sure it's relevant. I've spent hours and hours trying to troubleshoot this but I'm like a cave dweller trying to tinker with an electric car when it comes to figuring stuff out in the cloud. Actually, I suppose I'm not that bad, but it's certainly not my cup of tea and I'm definitely stuck on this issue.

Copy code

2023-03-01 21:43:57.251 GMT
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:

{
insertId: "kteiqgwfiefvsjr3"
labels: {6}
logName: <MY_PROJECT>
receiveTimestamp: "2023-03-01T21:43:58.049338769Z"
resource: {2}
severity: "INFO"
textPayload: "grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:"
timestamp: "2023-03-01T21:43:57.251681282Z"
}

2023-03-01 21:43:57.251 GMT
status = StatusCode.UNAVAILABLE

2023-03-01 21:43:57.251 GMT
details = "failed to connect to all addresses"

{
insertId: "224vy9b8py9nlnq2"
labels: {6}
logName: <MY_PROJECT>
receiveTimestamp: "2023-03-01T21:43:58.049338769Z"
resource: {2}
severity: "INFO"
textPayload: "	details = "failed to connect to all addresses""
timestamp: "2023-03-01T21:43:57.251741084Z"

daniel

03/01/2023, 9:59 PM

Hi Jonah, is this using the Dagster helm chart?

daniel

03/01/2023, 10:01 PM

What that error is telling me is that dagit and your daemon are having trouble reaching your user code pods: https://docs.dagster.io/deployment/guides/kubernetes/deploying-with-helm#deployment-architecture What I would probably do first is see what's going on with those pods in kubectl by running something like this:

kubectl get pods

and looking for a pod with the same name of one of your code locations, then doing something like

kubectl describe pod <that pod name>

kubectl logs <that pod name

and see if an error jumps out that might explain why its not able to start up

Jonah Liebert

03/01/2023, 10:04 PM

Yes, @daniel, it is this using the Dagster helm chart. Thank you for your advice. I will attempt.

Jonah Liebert

03/01/2023, 10:28 PM

One of the pods has an error.

kubectl logs user-code-dagster-user-deployments<location>

Traceback (most recent call last):

File "/usr/local/bin/dagster", line 5, in <module>

from dagster.cli import main

File "/usr/local/lib/python3.9/site-packages/dagster/__init__.py", line 239, in <module>

from dagster.core.storage.event_log import (

File "/usr/local/lib/python3.9/site-packages/dagster/core/storage/event_log/__init__.py", line 9, in <module>

from .polling_event_watcher import SqlPollingEventWatcher

File "/usr/local/lib/python3.9/site-packages/dagster/core/storage/event_log/polling_event_watcher.py", line 8, in <module>

from .sql_event_log import SqlEventLogStorage

File "/usr/local/lib/python3.9/site-packages/dagster/core/storage/event_log/sql_event_log.py", line 38, in <module>

from .schema import AssetKeyTable, SecondaryIndexMigrationTable, SqlEventLogStorageTable

File "/usr/local/lib/python3.9/site-packages/dagster/core/storage/event_log/schema.py", line 3, in <module>

from ..sql import MySQLCompatabilityTypes, get_current_timestamp

File "/usr/local/lib/python3.9/site-packages/dagster/core/storage/sql.py", line 6, in <module>

from alembic.command import downgrade, stamp, upgrade

File "/usr/local/lib/python3.9/site-packages/alembic/__init__.py", line 3, in <module>

from . import context # noqa

File "/usr/local/lib/python3.9/site-packages/alembic/context.py", line 1, in <module>

from .runtime.environment import EnvironmentContext

File "/usr/local/lib/python3.9/site-packages/alembic/runtime/environment.py", line 1, in <module>

from .migration import MigrationContext

File "/usr/local/lib/python3.9/site-packages/alembic/runtime/migration.py", line 15, in <module>

from .. import ddl

File "/usr/local/lib/python3.9/site-packages/alembic/ddl/__init__.py", line 1, in <module>

from . import mssql # noqa

File "/usr/local/lib/python3.9/site-packages/alembic/ddl/mssql.py", line 8, in <module>

from .base import AddColumn

File "/usr/local/lib/python3.9/site-packages/alembic/ddl/base.py", line 11, in <module>

from ..util.sqla_compat import _columns_for_constraint # noqa

File "/usr/local/lib/python3.9/site-packages/alembic/util/__init__.py", line 14, in <module>

from .messaging import err # noqa

File "/usr/local/lib/python3.9/site-packages/alembic/util/messaging.py", line 8, in <module>

from . import sqla_compat

File "/usr/local/lib/python3.9/site-packages/alembic/util/sqla_compat.py", line 16, in <module>

from sqlalchemy.sql.expression import _BindParamClause

ImportError: cannot import name '_BindParamClause' from 'sqlalchemy.sql.expression' (/usr/local/lib/python3.9/site-packages/sqlalchemy/sql/expression.py)

I bold-faced the part that seems relevant, but I don't know what to do about this error.

daniel

03/01/2023, 10:32 PM

OK luckily I do! I think this came from sqlalchemy releasing a 2.0 version that was not backwards compatible. I'd expect one of the following to fix the problem: • Rebuild the image with an additional sqlalchemy<2 pin when installing dagster • Upgrade dagster to the latest version, where that pin has been applied automatically already

Jonah Liebert

03/02/2023, 6:48 PM

I need to confess that I didn't create the deployment. The guy who did is gone. With that said, I'm trying to upgrade dagster to version 1.1.20 and getting an error when I push the Docker image to the cloud.

ERROR: Could not find a version that satisfies the requirement dagster-postgres==1.1.20 (from versions: 0.5.9rc0, 0.5.9, 0.6.0

...

ERROR: No matching distribution found for dagster-postgres==1.1.20

If you have any thoughts on this error, greatly appreciate it!

daniel

03/02/2023, 6:48 PM

Ah I think you want dagster-postgres==0.17.20 (dagster is 1.1.x, libraries are 0.17.x)

Jonah Liebert

03/04/2023, 6:33 PM

Hello @daniel Thanks for your help. I did finally fix the issue with sqlalchemy and got Dagster working. However, then I ran into a new problem that never existed before with the dagster service account. The service account is in the correct format (i deleted it and recreated it too), so I suspect there is some other issue that is manifesting as this error. However, I did finally get in touch with the engineer who set this up and will see what we can do. Thank you for your patience and help. Your insights and recommended fixes have all been spot on.

google.auth.exceptions.RefreshError: ('Failed to retrieve <http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true> from the Google Compute Engine metadata service. Status: 400 Response:\nb"Annotated service account must be in format of \'[SA-NAME]@[PROJECT-ID].<http://iam.gserviceaccount.com|iam.gserviceaccount.com>\', \'[SA_NAME]@appspot.gserviceaccount.com\' or \'[SA_NAME]@developer.gserviceaccount.com\'\\n"', <google_auth_httplib2._Response object at 0x7f0a84e40b20>)

226 Views

Open in Slack

Previous Next