Hi guysm running dagster locally for now on a VM as systemd dagster #ask-community

Hi guysm running dagster locally for now on a VM a...

Abhinav Dhulipala

04/02/2023, 11:39 PM

Hi guysm running dagster locally for now on a VM as systemd. For some reason, it seems to work for a couple hours and then every sensor call fails with

Copy code

dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNKNOWN

  File "/home/usr/.local/lib/python3.8/site-packages/dagster/_grpc/client.py", line 453, in start_run
    res = self._query(
  File "/home/usr/.local/lib/python3.8/site-packages/dagster/_grpc/client.py", line 157, in _query
    self._raise_grpc_exception(
  File "/home/usr/.local/lib/python3.8/site-packages/dagster/_grpc/client.py", line 140, in _raise_grpc_exception
    raise DagsterUserCodeUnreachableError(

The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNKNOWN
	details = "Exception calling application: [Errno 5] Input/output error"
	debug_error_string = "{"created":"@1680457553.144815667","description":"Error received from peer unix:/tmp/tmpqx4b0zzk","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Exception calling application: [Errno 5] Input/output error","grpc_status":2}"
>

  File "/home/usr/.local/lib/python3.8/site-packages/dagster/_grpc/client.py", line 155, in _query
    return self._get_response(method, request=request_type(**kwargs), timeout=timeout)
  File "/home/usr/.local/lib/python3.8/site-packages/dagster/_grpc/client.py", line 130, in _get_response
    return getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
  File "/home/usr/.local/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/usr/.local/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)

I've tried changing python version, cleaning my venv, and trying to debug the python install on my VM. Our transform stage works flawlessly on my local dev machine, but when deployed to the VM I seem to get this all the time. Any ideas as to what could be going wrong here? Has anyone gotten something simliar

daniel

04/03/2023, 3:11 PM

Hi Abhinav - what version of Dagster is this? It looks like the code server that Dagster spins up to run your code on it may have gone down. Some questions: Are there any logs in the process that might help explain what happened / why it crashed? If you reload the code location from the Code Locations tab in Dagit does the problem go away? An additional option for more reliability here would be to run the code server in a Docker container like in the example here: https://docs.dagster.io/deployment/guides/docker#deploying-dagster-to-docker - this lets you run the server in an entirely separate container that can be set up to automatically restart whenever it goes down

Abhinav Dhulipala

04/03/2023, 7:15 PM

Hi Daniel, appreciate the quick response! Occasionally I get this warning that I think might be the culprit

Copy code

venv/lib/python3.8/site-packages/dagster/_grpc/server.py:1293: UserWarning: GrpcServerProcess is being destroyed without signalling to server that it should shut down. This may result in server processes living longer than they need to. To fix this, wrap the GrpcServerProcess in a contextmanager or call shutdown_server on it

The version of dagster I'm running is

Copy code

dagster --version
dagster, version 1.2.4

I'll try reloading the code location, I'm very new to dagster and appreciate the suggestion. I'll try deploying dagster in docker! Thank you for that suggestion

daniel

04/03/2023, 7:18 PM

That does look relevant, yeah - how frequently do you see that warning?

daniel

04/03/2023, 7:20 PM

Are there any other errors or warnings in the logs just before that message?

Abhinav Dhulipala

04/03/2023, 7:22 PM

I see it often times quite a while into execution. Usually after failed software-defined jobs I think? There are no logs that I see as super relevant before then. On my VM, I see it immediately. On my local machine it seems to take a while for the service to degrade

Abhinav Dhulipala

04/03/2023, 7:22 PM

But the pattern is definitely sporadic

daniel

04/03/2023, 7:22 PM

Is there any chance your VM could be hitting a memory limit and processes could be getting killed?

Abhinav Dhulipala

04/03/2023, 7:24 PM

No, my VM memory is massive (300GB, 32vCPU) and utilization is nowhere near peak for memory or CPU. I use dagster.yaml to limit sensor concurrency to a max of 4 parrallel workers. From my monitoring, it doesn't take more than 8vcpus at once

daniel

04/03/2023, 7:25 PM

Got it, makes sense

Abhinav Dhulipala

04/03/2023, 7:25 PM

Oh, actually I think I tracked down when it happens

Abhinav Dhulipala

04/03/2023, 7:25 PM

It seems to happen when my sensor returns an empty result

daniel

04/03/2023, 7:26 PM

huh, every time?

Abhinav Dhulipala

04/03/2023, 7:28 PM

From what I can tell? I'm not sure. My sensor code is very simple and follows you're example pretty closely. I do use a resource in my sensor like such

Copy code

def make_s3_files_updated_sensor(job: JobDefinition) -> SensorDefinition:
    """Returns a sensor that launches the given job on s3 updates to a provided directory."""

    @sensor(name=f"{job.name}_on_files_updated", minimum_interval_seconds=300, job=job)
    def s3_files_updated_sensor(context: SensorEvaluationContext):
        since_key = context.cursor or None
        BUCKET_NAME = "<redacted>"
        with build_resources({"s3": s3_prod_resource}) as resources:
            new_s3_keys = get_s3_keys(
                BUCKET_NAME, prefix=MY_DIRECTORY, since_key=since_key, s3_session=resources.s3
            )
        <http://context.log.info|context.log.info>(f"new_s3_keys: {len(new_s3_keys)}")
        for key in filter(lambda k: k.endswith(".hyper"), new_s3_keys):
            yield RunRequest(
                run_key=key, run_config={"ops": {"upload_to_db": {"config": {"filename": key}}}}
            )
            context.update_cursor(key)

    return s3_files_updated_sensor

daniel

04/03/2023, 7:29 PM

What's making you draw that connection? Just the times matching up? Or is there some other error/warning output in the logs related to an empty result

Abhinav Dhulipala

04/03/2023, 7:30 PM

Just the times matching up. I got the warning on my local machine after our incoming s3 stream was exhausted. Stopped it and turned it on on my VM, and saw it immedietly give me the same warning

daniel

04/03/2023, 7:36 PM

When you say it happens on your VM immediately - you mean that as soon as you deploy you start getting the DagsterUserCodeUnreachableError error?

daniel

04/03/2023, 7:37 PM

my best guess so far is still that something in either the job or the sensor is somehow causing the whole process to crash (which isolating things in Docker can definitely help with) - but usually there would be some kind of telltale log line when the process crashes

Abhinav Dhulipala

04/03/2023, 7:44 PM

Should I catch every error? This is only for the warning btw, our queue is empty so we aren't running any jobs. Scrolling up, I do see

Copy code

upload_to_db - STEP_FAILURE - Execution of step "upload_to_db" failed.

dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "upload_to_db"::

botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

This is because I retried a job, but forget it left our job queue.

daniel

04/03/2023, 7:44 PM

Your ops raising an exception wouldn't cause a problem like this, no

daniel

04/03/2023, 8:35 PM

Ah I think I see what's causing that warning, and unfortunately I don't think it's related

daniel

04/03/2023, 8:35 PM

I'd be curious if this problem goes away for you if you try dagster 1.2.3 rather than 1.2.4 though

Abhinav Dhulipala

04/03/2023, 8:42 PM

Great! I'll try that and report back. Thanks so much for your time. This is really just for our early evaluation. We plan on moving forward with dagster and deploying it on our internal k8s cluster, for both dagit + daemon and DASK k8s (or celery-k8s) executors. So we plan to migrate off this very soon.

daniel

04/03/2023, 8:45 PM

No problem - this has been helpful. I'm actually thinking that this part of your error report may be relevant:

Copy code

details = "Exception calling application: [Errno 5] Input/output error"

I don't see that in general when the code server becomes unavailable. You mentioned there are lots of resources but I wonder if it could be running out of disk or hitting some other I/O related limit?

Abhinav Dhulipala

04/03/2023, 9:00 PM

I think*** I resolved the problem. I believe the problem was cause by a conflicting grpc-health checking dep. When inspecting my pip env again I came accross the following error

Copy code

ERROR: grpcio-health-checking 1.53.0 has requirement grpcio>=1.53.0, but you'll have grpcio 1.47.5 which is incompatible.

From there I rm -rf'd my pip env and did the following

Copy code

python -m venv venv && source venv/bin/activate && pip install -U pip wheel setuptools && pip install dagster dagit dagster-slack dagster-aws dagster-postgres

The grpcio dep seems to have been the offender

daniel

04/03/2023, 9:05 PM

Huh... that does sounds like a good fix, but I'm having trouble understanding how it would cause sporadic crashes

daniel

04/03/2023, 9:06 PM

I would expect that to prevent you from installing the library at all if it were a problem

Abhinav Dhulipala

04/03/2023, 9:46 PM

I concur, this is just my current hypothesis, will report back if I see the error pop up, but this seemed to quell the warnings I was getting

Abhinav Dhulipala

04/03/2023, 10:39 PM

The above did not work, I am still seeing the same warning as before.

daniel

04/03/2023, 10:40 PM

I just sent out a fix for the warning - I don’t think it’s related to the problem that you’re seeing

Uddhav Kapadia

04/25/2023, 7:40 PM

Hello I was seeing if there was a fix for this? I am seeing the same error message. I have dagster deployed docker. When I run a job through dagit or through a dagit test sensor it works fine. But when it runs it through the daemon it always fails with DagsterUserCodeUnreachableError: Could not reach user code server. GRPC Error Code: KNOWN. Under the exception details it says Errno 5 Input/output error grpc_status:2. Please let me know how I can resolve this issue

daniel

04/25/2023, 7:41 PM

Would it be possible to make a new post for this issue with all the details?

Uddhav Kapadia

04/25/2023, 7:41 PM

Sure

5 Views

Open in Slack

Previous Next