Hi guysm running dagster locally for now on a VM a...
# ask-community
a
Hi guysm running dagster locally for now on a VM as systemd. For some reason, it seems to work for a couple hours and then every sensor call fails with
Copy code
dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNKNOWN

  File "/home/usr/.local/lib/python3.8/site-packages/dagster/_grpc/client.py", line 453, in start_run
    res = self._query(
  File "/home/usr/.local/lib/python3.8/site-packages/dagster/_grpc/client.py", line 157, in _query
    self._raise_grpc_exception(
  File "/home/usr/.local/lib/python3.8/site-packages/dagster/_grpc/client.py", line 140, in _raise_grpc_exception
    raise DagsterUserCodeUnreachableError(

The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNKNOWN
	details = "Exception calling application: [Errno 5] Input/output error"
	debug_error_string = "{"created":"@1680457553.144815667","description":"Error received from peer unix:/tmp/tmpqx4b0zzk","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Exception calling application: [Errno 5] Input/output error","grpc_status":2}"
>

  File "/home/usr/.local/lib/python3.8/site-packages/dagster/_grpc/client.py", line 155, in _query
    return self._get_response(method, request=request_type(**kwargs), timeout=timeout)
  File "/home/usr/.local/lib/python3.8/site-packages/dagster/_grpc/client.py", line 130, in _get_response
    return getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
  File "/home/usr/.local/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/usr/.local/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
I've tried changing python version, cleaning my venv, and trying to debug the python install on my VM. Our transform stage works flawlessly on my local dev machine, but when deployed to the VM I seem to get this all the time. Any ideas as to what could be going wrong here? Has anyone gotten something simliar
d
Hi Abhinav - what version of Dagster is this? It looks like the code server that Dagster spins up to run your code on it may have gone down. Some questions: Are there any logs in the process that might help explain what happened / why it crashed? If you reload the code location from the Code Locations tab in Dagit does the problem go away? An additional option for more reliability here would be to run the code server in a Docker container like in the example here: https://docs.dagster.io/deployment/guides/docker#deploying-dagster-to-docker - this lets you run the server in an entirely separate container that can be set up to automatically restart whenever it goes down
a
Hi Daniel, appreciate the quick response! Occasionally I get this warning that I think might be the culprit
Copy code
venv/lib/python3.8/site-packages/dagster/_grpc/server.py:1293: UserWarning: GrpcServerProcess is being destroyed without signalling to server that it should shut down. This may result in server processes living longer than they need to. To fix this, wrap the GrpcServerProcess in a contextmanager or call shutdown_server on it
The version of dagster I'm running is
Copy code
dagster --version
dagster, version 1.2.4
I'll try reloading the code location, I'm very new to dagster and appreciate the suggestion. I'll try deploying dagster in docker! Thank you for that suggestion
d
That does look relevant, yeah - how frequently do you see that warning?
Are there any other errors or warnings in the logs just before that message?
a
I see it often times quite a while into execution. Usually after failed software-defined jobs I think? There are no logs that I see as super relevant before then. On my VM, I see it immediately. On my local machine it seems to take a while for the service to degrade
But the pattern is definitely sporadic
d
Is there any chance your VM could be hitting a memory limit and processes could be getting killed?
a
No, my VM memory is massive (300GB, 32vCPU) and utilization is nowhere near peak for memory or CPU. I use dagster.yaml to limit sensor concurrency to a max of 4 parrallel workers. From my monitoring, it doesn't take more than 8vcpus at once
d
Got it, makes sense
a
Oh, actually I think I tracked down when it happens
It seems to happen when my sensor returns an empty result
d
huh, every time?
a
From what I can tell? I'm not sure. My sensor code is very simple and follows you're example pretty closely. I do use a resource in my sensor like such
Copy code
def make_s3_files_updated_sensor(job: JobDefinition) -> SensorDefinition:
    """Returns a sensor that launches the given job on s3 updates to a provided directory."""

    @sensor(name=f"{job.name}_on_files_updated", minimum_interval_seconds=300, job=job)
    def s3_files_updated_sensor(context: SensorEvaluationContext):
        since_key = context.cursor or None
        BUCKET_NAME = "<redacted>"
        with build_resources({"s3": s3_prod_resource}) as resources:
            new_s3_keys = get_s3_keys(
                BUCKET_NAME, prefix=MY_DIRECTORY, since_key=since_key, s3_session=resources.s3
            )
        <http://context.log.info|context.log.info>(f"new_s3_keys: {len(new_s3_keys)}")
        for key in filter(lambda k: k.endswith(".hyper"), new_s3_keys):
            yield RunRequest(
                run_key=key, run_config={"ops": {"upload_to_db": {"config": {"filename": key}}}}
            )
            context.update_cursor(key)

    return s3_files_updated_sensor
d
What's making you draw that connection? Just the times matching up? Or is there some other error/warning output in the logs related to an empty result
a
Just the times matching up. I got the warning on my local machine after our incoming s3 stream was exhausted. Stopped it and turned it on on my VM, and saw it immedietly give me the same warning
d
When you say it happens on your VM immediately - you mean that as soon as you deploy you start getting the DagsterUserCodeUnreachableError error?
my best guess so far is still that something in either the job or the sensor is somehow causing the whole process to crash (which isolating things in Docker can definitely help with) - but usually there would be some kind of telltale log line when the process crashes
a
Should I catch every error? This is only for the warning btw, our queue is empty so we aren't running any jobs. Scrolling up, I do see
Copy code
upload_to_db - STEP_FAILURE - Execution of step "upload_to_db" failed.

dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "upload_to_db"::

botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found
This is because I retried a job, but forget it left our job queue.
d
Your ops raising an exception wouldn't cause a problem like this, no
Ah I think I see what's causing that warning, and unfortunately I don't think it's related
I'd be curious if this problem goes away for you if you try dagster 1.2.3 rather than 1.2.4 though
a
Great! I'll try that and report back. Thanks so much for your time. This is really just for our early evaluation. We plan on moving forward with dagster and deploying it on our internal k8s cluster, for both dagit + daemon and DASK k8s (or celery-k8s) executors. So we plan to migrate off this very soon.
d
No problem - this has been helpful. I'm actually thinking that this part of your error report may be relevant:
Copy code
details = "Exception calling application: [Errno 5] Input/output error"
I don't see that in general when the code server becomes unavailable. You mentioned there are lots of resources but I wonder if it could be running out of disk or hitting some other I/O related limit?
a
I think*** I resolved the problem. I believe the problem was cause by a conflicting grpc-health checking dep. When inspecting my pip env again I came accross the following error
Copy code
ERROR: grpcio-health-checking 1.53.0 has requirement grpcio>=1.53.0, but you'll have grpcio 1.47.5 which is incompatible.
From there I rm -rf'd my pip env and did the following
Copy code
python -m venv venv && source venv/bin/activate && pip install -U pip wheel setuptools && pip install dagster dagit dagster-slack dagster-aws dagster-postgres
The grpcio dep seems to have been the offender
d
Huh... that does sounds like a good fix, but I'm having trouble understanding how it would cause sporadic crashes
I would expect that to prevent you from installing the library at all if it were a problem
a
I concur, this is just my current hypothesis, will report back if I see the error pop up, but this seemed to quell the warnings I was getting
The above did not work, I am still seeing the same warning as before.
d
I just sent out a fix for the warning - I don’t think it’s related to the problem that you’re seeing
u
Hello I was seeing if there was a fix for this? I am seeing the same error message. I have dagster deployed docker. When I run a job through dagit or through a dagit test sensor it works fine. But when it runs it through the daemon it always fails with DagsterUserCodeUnreachableError: Could not reach user code server. GRPC Error Code: KNOWN. Under the exception details it says Errno 5 Input/output error grpc_status:2. Please let me know how I can resolve this issue
d
Would it be possible to make a new post for this issue with all the details?
u
Sure