Hi, we have some issues with the latest version of...
# ask-community
m
Hi, we have some issues with the latest version of Dagster running on our ECS Cluster. I already updated the permissions to give dagster access to the secretsmanager, but now the daemon and dagit tasks won't start properly with the following messages:
Copy code
2021-12-07 17:08:59 - SensorDaemon - ERROR - Sensor daemon caught an error for sensor Generic_Usage_Scenario_Sensor : grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "
{
    "created": "@1638896939.637237528",
    "description": "Failed to pick subchannel",
    "file": "src/core/ext/filters/client_channel/client_channel.cc",
    "file_line": 3093,
    "referenced_errors": [
        {
            "created": "@1638896939.637236219",
            "description": "failed to connect to all addresses",
            "file": "src/core/lib/transport/error_utils.cc",
            "file_line": 163,
            "grpc_status": 14
        }
    ]
}
"
>
Stack Trace:
  File "/usr/local/lib/python3.9/site-packages/dagster/daemon/sensor.py", line 191, in execute_sensor_iteration
    repo_location = workspace.get_location(origin)
  File "/usr/local/lib/python3.9/site-packages/dagster/core/workspace/dynamic_workspace.py", line 36, in get_location
    location = existing_location if existing_location else origin.create_location()
  File "/usr/local/lib/python3.9/site-packages/dagster/core/host_representation/origin.py", line 271, in create_location
    return GrpcServerRepositoryLocation(self)
  File "/usr/local/lib/python3.9/site-packages/dagster/core/host_representation/repository_location.py", line 495, in __init__
    list_repositories_response = sync_list_repositories_grpc(self.client)
  File "/usr/local/lib/python3.9/site-packages/dagster/api/list_repositories.py", line 14, in sync_list_repositories_grpc
    deserialize_json_to_dagster_namedtuple(api_client.list_repositories()),
  File "/usr/local/lib/python3.9/site-packages/dagster/grpc/client.py", line 163, in list_repositories
    res = self._query("ListRepositories", api_pb2.ListRepositoriesRequest)
  File "/usr/local/lib/python3.9/site-packages/dagster/grpc/client.py", line 110, in _query
    response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
  File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
And for the dagit service:
Copy code
/usr/local/lib/python3.9/site-packages/dagster/core/workspace/context.py:538: UserWarning: Error loading repository location analytics-opal-pipelines:grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "
{
    "created": "@1638894475.880671826",
    "description": "Failed to pick subchannel",
    "file": "src/core/ext/filters/client_channel/client_channel.cc",
    "file_line": 3093,
    "referenced_errors": [
        {
            "created": "@1638894475.880670730",
            "description": "failed to connect to all addresses",
            "file": "src/core/lib/transport/error_utils.cc",
            "file_line": 163,
            "grpc_status": 14
        }
    ]
}
"
>
Stack Trace:
  File "/usr/local/lib/python3.9/site-packages/dagster/core/workspace/context.py", line 535, in _load_location
    location = self._create_location_from_origin(origin)
  File "/usr/local/lib/python3.9/site-packages/dagster/core/workspace/context.py", line 454, in _create_location_from_origin
    return origin.create_location()
  File "/usr/local/lib/python3.9/site-packages/dagster/core/host_representation/origin.py", line 271, in create_location
    return GrpcServerRepositoryLocation(self)
  File "/usr/local/lib/python3.9/site-packages/dagster/core/host_representation/repository_location.py", line 495, in __init__
    list_repositories_response = sync_list_repositories_grpc(self.client)
  File "/usr/local/lib/python3.9/site-packages/dagster/api/list_repositories.py", line 14, in sync_list_repositories_grpc
    deserialize_json_to_dagster_namedtuple(api_client.list_repositories()),
  File "/usr/local/lib/python3.9/site-packages/dagster/grpc/client.py", line 163, in list_repositories
    res = self._query("ListRepositories", api_pb2.ListRepositoriesRequest)
  File "/usr/local/lib/python3.9/site-packages/dagster/grpc/client.py", line 110, in _query
    response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
  File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
  warnings.warn(
Loading repository...
Serving on <http://0.0.0.0:3000> in process 1
/usr/local/lib/python3.9/site-packages/dagster/core/workspace/context.py:538: UserWarning: Error loading repository location analytics-***-pipelines:grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
m
@jordan
j
latest as in built from master or latest as in 0.13.10
m
0.13.10
m
are you in us-east-1 by chance?
m
no eu-central-1
1
j
are there any logs in the repository location’s ecs service?
m
yes, i see a version mismatch warning 0.13.8 -> 0.13.10. Let me check that. I though that i pinned everythin to 0.13.*
j
Hm. Either way - I think the older repository location should still be able to work with the newer dagit/daemon unless a backwards compatibility bug was introduced. If you can though, let’s see what happens if all of the versions match.
m
Copy code
/usr/local/lib/python3.9/site-packages/dagster/core/utils.py:78: UserWarning: Found version mismatch between `dagster` (0.13.8) and `dagster-aws` (0.13.10)
  warnings.warn(message)
j
To me, it doesn’t look like dagit or daemon are actually having trouble coming up - it looks like they’re having trouble communicating with the repository location’s grpc server. So either that server isn’t coming up correctly or something is off with the networking that’s preventing the other tasks from talking to it. Which should at least narrow down where we can look to fix the problem.
m
all of the 3 services are running in the same ECS cluster and are part of the same CF stack
j
Using the example here? https://github.com/dagster-io/dagster/blob/master/examples/deploy_ecs/README.md Or your own cloudformation stack?
m
based on that one, yes.
j
https://dagster.slack.com/archives/C01U954MEER/p1638898104477400?thread_ts=1638897222.475100&amp;cid=C01U954MEER I stand corrected - it’s true that the repository location can have a different version of dagster than your dagit/daemon services. But your dagster-aws version needs to match your dagster version within that image. Can you try rebuilding your images to get dagster-aws in sync with dagster and redeploying?
m
sure, right now i pinned everything to 0.13.9 to see if I can get it to run again
👍 1
a
did you see anything in the
user_code
service logs?
m
only the version mismatch, as discussed above
a
another possibility is desync between
workspace.yaml
and
docker-compose.yml
if youve changed either
m
we used the docker-compose only initially to create the CF template for us, since then we are only using CF. I have not touched the workspace.yaml since then (we initially deployed it around version 0.12.4)
ok pinning it to 0.13.9 works at least. i will try to pin it to 0.13.10 tomorrow. I'll keep you guys updated.
🙌 2
a
huh - was the version mismatch warning the only thing in the logs? I’m surprised there wasn’t an error
m
in the user_code service logs, yes.
ack 2
I unpinned dagster now (0.13.*) and it uses 0.13.10 everywhere. Now it seems to work. I have no idea what went wrong earlier. AWS was behaving weirdly though. The user_data image was loaded very late, which might caused that incompatable version issue (maybe, an old image was used). Could be a side effect of the AWS problems in US-east-1
🤔 2
m
i could believe that was an aws outage issue