Hey All, moving from docker deployment to PEX depl...
# ask-community
z
Hey All, moving from docker deployment to PEX deployment for serverless. However it’s blowing up 😧 with some weird error: Any sort of help would be appreciated, I’ve already tried tweaking it quite a bit — but I’m not great at devops 😅
Copy code
github_event = github_context.get_github_event(project_dir)
  File "/home/runner/.pex/unzipped_pexes/2e536a00a742406eeee720e381d89289fe105add/builder/github_context.py", line 92, in get_github_event
    return GithubEvent(project_dir)
  File "/home/runner/.pex/unzipped_pexes/2e536a00a742406eeee720e381d89289fe105add/builder/github_context.py", line 59, in __init__
    self.timestamp = float(git_metadata["timestamp"])
ValueError: could not convert string to float: ''
Error: Failed to deploy Python Executable. Try disabling fast deploys by setting `ENABLE_FAST_DEPLOYS: 'false'` in your .github/workflows/*yml.
Error: Process completed with exit code 1.
build yaml:
Copy code
name: Serverless Branch Deployments
on:
  pull_request:
    types: [opened, synchronize, reopened, closed]
    paths-ignore:
      - 'infrastructure/**'
concurrency:
  # Cancel in-progress runs on same branch
  group: ${{ github.ref }}
  cancel-in-progress: true

env:
  DAGSTER_CLOUD_URL: ${{ secrets.DAGSTER_CLOUD_URL }}
  ENABLE_FAST_DEPLOYS: "true"

jobs:
  parse_workspace:
    runs-on: ubuntu-latest
    outputs:
      build_info: ${{ steps.parse-workspace.outputs.build_info }}
      secrets_set: ${{ steps.parse-workspace.outputs.secrets_set }}
    steps:
      - uses: actions/checkout@v3
      - name: Parse cloud workspace
        id: parse-workspace
        uses: dagster-io/dagster-cloud-action/actions/utils/parse_workspace@v0.1
        with:
          dagster_cloud_file: dagster_cloud.yaml
      - name: Install Poetry
        run: pipx install poetry
      - uses: actions/setup-python@v4
        with:
          python-version: '3.8'
          cache: 'poetry'
      - run: poetry install
      - name: Run tests
        run: poetry run pytest --durations=5

  dagster_cloud_build_push:
    runs-on: ubuntu-latest
    needs: parse_workspace
    name: Dagster Serverless Deploy
    strategy:
      fail-fast: false
      matrix:
        location: ${{ fromJSON(needs.parse_workspace.outputs.build_info) }}
    steps:
      - name: Checkout
        uses: actions/checkout@v3
        with:
          ref: ${{ github.sha }}
      - name: Build and deploy Python executable
        if: env.ENABLE_FAST_DEPLOYS == 'true'
        uses: dagster-io/dagster-cloud-action/actions/build_deploy_python_executable@pex-v0.1
        with:
          dagster_cloud_file: "$GITHUB_WORKSPACE/dagster_cloud.yaml"
          build_output_dir: "$GITHUB_WORKSPACE/build"
          python_version: "3.8"
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
Full error message:
Copy code
Running ['/home/runner/work/_actions/dagster-io/dagster-cloud-action/pex-v0.1/generated/gha/builder.pex', '-m', 'builder.deploy', '/home/runner/work/company-dagster/company-dagster/dagster_cloud.yaml', '/home/runner/work/company-dagster/company-dagster/build', '--python-version=3.8', '--upload-pex', '--update-code-location', '--deps-cache-from=org/company-dagster/main', '--no-build-sdists']
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/runpy.py:127: RuntimeWarning: 'builder.deploy' found in sys.modules after import of package 'builder', but prior to execution of 'builder.deploy'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
ERROR:root:git command failed: b''
b'fatal: bad object b3faf5142242a23a953bc672f51b6d1ca2ea2cdb\n'
ERROR:root:git command failed: b''
b'fatal: bad object b3faf5142242a23a953bc672f51b6d1ca2ea2cdb\n'
ERROR:root:git command failed: b''
b'fatal: bad object b3faf5142242a23a953bc672f51b6d1ca2ea2cdb\n'
ERROR:root:git command failed: b''
b'fatal: bad object b3faf5142242a23a953bc672f51b6d1ca2ea2cdb\n'
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/runner/.pex/unzipped_pexes/2e536a00a742406eeee720e381d89289fe105add/__main__.py", line 103, in <module>
    bootstrap_pex(__entry_point__, execute=__execute__, venv_dir=__venv_dir__)
  File "/home/runner/.pex/unzipped_pexes/2e536a00a742406eeee720e381d89289fe105add/.bootstrap/pex/pex_bootstrapper.py", line 599, in bootstrap_pex
    pex.PEX(entry_point).execute()
  File "/home/runner/.pex/unzipped_pexes/2e536a00a742406eeee720e381d89289fe105add/.bootstrap/pex/pex.py", line 551, in execute
    sys.exit(self._wrap_coverage(self._wrap_profiling, self._execute))
  File "/home/runner/.pex/unzipped_pexes/2e536a00a742406eeee720e381d89289fe105add/.bootstrap/pex/pex.py", line 458, in _wrap_coverage
    return runner(*args)
  File "/home/runner/.pex/unzipped_pexes/2e536a00a742406eeee720e381d89289fe105add/.bootstrap/pex/pex.py", line 489, in _wrap_profiling
    return runner(*args)
  File "/home/runner/.pex/unzipped_pexes/2e536a00a742406eeee720e381d89289fe105add/.bootstrap/pex/pex.py", line 572, in _execute
    return self.execute_interpreter()
  File "/home/runner/.pex/unzipped_pexes/2e536a00a742406eeee720e381d89289fe105add/.bootstrap/pex/pex.py", line 657, in execute_interpreter
    return self.execute_module(module)
  File "/home/runner/.pex/unzipped_pexes/2e536a00a742406eeee720e381d89289fe105add/.bootstrap/pex/pex.py", line 783, in execute_module
    runpy.run_module(module_name, run_name="__main__", alter_sys=True)
  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/runpy.py", line 207, in run_module
    return _run_module_code(code, init_globals, run_name, mod_spec)
  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/runner/.pex/unzipped_pexes/2e536a00a742406eeee720e381d89289fe105add/builder/deploy.py", line 530, in <module>
    cli()
  File "/home/runner/.pex/installed_wheels/78086359bc4a576338dbcaacad4a42784cdd0755b6327b984812fe0913265abf/click-8.1.3-py3-none-any.whl/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/runner/.pex/installed_wheels/78086359bc4a576338dbcaacad4a42784cdd0755b6327b984812fe0913265abf/click-8.1.3-py3-none-any.whl/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/runner/.pex/installed_wheels/78086359bc4a576338dbcaacad4a42784cdd0755b6327b984812fe0913265abf/click-8.1.3-py3-none-any.whl/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/runner/.pex/installed_wheels/78086359bc4a576338dbcaacad4a42784cdd0755b6327b984812fe0913265abf/click-8.1.3-py3-none-any.whl/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/runner/.pex/unzipped_pexes/2e536a00a742406eeee720e381d89289fe105add/builder/deploy.py", line 313, in cli
    deploy_main(
  File "/home/runner/.pex/unzipped_pexes/2e536a00a742406eeee720e381d89289fe105add/builder/deploy.py", line 361, in deploy_main
    load_github_event(os.path.dirname(dagster_cloud_file))
  File "/home/runner/.pex/unzipped_pexes/2e536a00a742406eeee720e381d89289fe105add/builder/deploy.py", line 219, in load_github_event
    github_event = github_context.get_github_event(project_dir)
  File "/home/runner/.pex/unzipped_pexes/2e536a00a742406eeee720e381d89289fe105add/builder/github_context.py", line 92, in get_github_event
    return GithubEvent(project_dir)
  File "/home/runner/.pex/unzipped_pexes/2e536a00a742406eeee720e381d89289fe105add/builder/github_context.py", line 59, in __init__
    self.timestamp = float(git_metadata["timestamp"])
ValueError: could not convert string to float: ''
I’ve tried: • Removing the poetry install & pytest steps • Changing the dagster_cloud_file around • Swapping python versions • Digging around the dagter actions repo looking for anything, but wasn’t able to find anything too obvious. Could there be something weird going on since I used to use the non-pex deploy? Thanks again in advanced!
s
Hi Zach, the non-pex deploy should not change how the pex deploy works. One issue I see is that the deploy script is unable to run
git
to get the timestamp of the latest commit in the checked out directory:
Copy code
ERROR:root:git command failed: b''
b'fatal: bad object b3faf5142242a23a953bc672f51b6d1ca2ea2cdb\n'
The git command that is failing is
git -C {project_dir} log -1 --format=%cd --date=unix
I'm not sure why this fails though. I looked at your workflow definition and it looks slightly different than our quickstart repo. Specifically the
actions/checkout
for the fast deploys uses a
project-repo/
subdirectory in our example. I suspect we might depend on this subdirectory even though we shouldn't. Could you please try the following: 1. Set
runs-on: ubuntu-20.04
instead of
ubuntu-latest
2. If that doesn't help, add
path: project-repo
to your
actions/checkout
that and also adjust the dagster_cloud_file to include it. Let me know if neither of these work.
👀 1
z
Thanks for the response Shalabh, sadly I’m still having some issues: • Changing to ubuntu-20.04 did not seem to change anything. • Checking out to
project-repo
and and using this seems to have worked. It’s also possible that I was using
github.sha
instead of
github.head_ref
? I swapped to sha originally while using the docker deployment method since the workflow we have for PRs often leads to head_refs being moved/deleted! However; now the pex builds start but fail. Error here:
Copy code
INFO:root:Running ['/home/runner/work/_actions/dagster-io/dagster-cloud-action/pex-v0.1/generated/gha/builder.pex', '/home/runner/work/_actions/dagster-io/dagster-cloud-action/pex-v0.1/src/create_or_update_comment.py'] in '/home/runner/work/company-dagster/company-dagster/project-repo'

Error: Some locations failed to load after being synced by the agent:

Error loading my_code_location: {'__typename': 'PythonError', 'message': 'dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server\n', 'stack': [' File "/dagster-cloud/dagster_cloud/workspace/user_code_launcher/user_code_launcher.py", line 1166, in _reconcile\n new_dagster_servers[to_update_key] = self._start_new_dagster_server(\n', ' File "/dagster-cloud/dagster_cloud/workspace/user_code_launcher/user_code_launcher.py", line 1440, in _start_new_dagster_server\n self._create_pex_server(deployment_name, location_name, desired_entry, multipex_server)\n', ' File "/dagster-cloud/dagster_cloud/workspace/user_code_launcher/user_code_launcher.py", line 1418, in _create_pex_server\n multipex_client.create_pex_server(\n', ' File "/dagster-cloud/dagster_cloud/pex/grpc/client.py", line 37, in create_pex_server\n res = self._query(\n', ' File "/dagster-cloud/dagster_cloud/pex/grpc/client.py", line 91, in _query\n raise DagsterUserCodeUnreachableError("Could not reach user code server") from e\n']}

ERROR:root:Error updating code location 'my_code_location'
company Traceback (most recent call last):

File "/home/runner/.pex/unzipped_pexes/2e536a00a742406eeee720e381d89289fe105add/builder/deploy.py", line 508, in run_code_location_update

code_location.wait_for_load(

File "/home/runner/.pex/unzipped_pexes/2e536a00a742406eeee720e381d89289fe105add/builder/code_location.py", line 38, in wait_for_load

workspace.wait_for_load(

File "/home/runner/.pex/installed_wheels/ddcba3add552a0aa584b2d01012f144f5aa648f8d1df736ef9205acc482bd9a7/dagster_cloud_cli-1.1.9-py3-none-any.whl/dagster_cloud_cli/commands/workspace/__init__.py", line 183, in wait_for_load

raise ui.error(error_string)

click.exceptions.Exit

INFO:root:Running ['/home/runner/work/_actions/dagster-io/dagster-cloud-action/pex-v0.1/generated/gha/builder.pex', '/home/runner/work/_actions/dagster-io/dagster-cloud-action/pex-v0.1/src/create_or_update_comment.py'] in '/home/runner/work/company-dagster/company-dagster/project-repo'
s
Hi Zach,
It’s also possible that I was using
github.sha
instead of
github.head_ref
?
Yes it is possible - the checkout action does a shallow clone so it's possible the latest commit was not available. I suggest you leave the
ubuntu-20.04
in there because the workflows are less complex for that version. Still - this is progress - it appears the code got built and uploaded but didn't work as expected in dagster cloud. I have a few questions: • Are you deploying this in a PR? • If possible can you share your requirements.txt and setup.py (I am only interested in the dependencies list) I will also look at our server logs to see why this might be failing.
Oddly I don't see any errors in the server logs yet and the tasks seem to be up, but they took longer than expected to spin up. You could try clicking Redeploy for these code locations in dagster cloud. Let's try this before troubleshooting the dependencies.
z
• Yes, this is in a branch PR • We don’t have a setup.py, our workflow at the time being is poetry -> export to requirements.txt I’ll share this in the next message due to the length. • Strangely, I was able to get them to load in the branch deployment after reloading the code location. I got another error message that may be useful:
Copy code
dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
  File "/dagster-cloud/dagster_cloud/workspace/user_code_launcher/user_code_launcher.py", line 1166, in _reconcile
    new_dagster_servers[to_update_key] = self._start_new_dagster_server(
  File "/dagster-cloud/dagster_cloud/workspace/user_code_launcher/user_code_launcher.py", line 1440, in _start_new_dagster_server
    self._create_pex_server(deployment_name, location_name, desired_entry, multipex_server)
  File "/dagster-cloud/dagster_cloud/workspace/user_code_launcher/user_code_launcher.py", line 1418, in _create_pex_server
    multipex_client.create_pex_server(
  File "/dagster-cloud/dagster_cloud/pex/grpc/client.py", line 37, in create_pex_server
    res = self._query(
  File "/dagster-cloud/dagster_cloud/pex/grpc/client.py", line 91, in _query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.DEADLINE_EXCEEDED details = "Deadline Exceeded" debug_error_string = "{"created":"@1675266007.505295068","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}" >
  File "/dagster-cloud/dagster_cloud/pex/grpc/client.py", line 88, in _query
    response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
  File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
I just re-ran the entire GHA workflow and it successfully finish 🎉
s
I just re-ran the entire GHA workflow and it successfully finish
Excellent! I think the earlier failure was due to the AWS task spin up taking longer than expected. Once you see a successful deployment you can ignore the earlier
Could not reach user code server
error. You could try committing a code change to see fast re-deploys in action. Let us know if you encounter other issues with the pex deploys.
z
Will do, thanks!
👋 I think I just found a bug / feature need when I tried to deploy this into prod (weirdly, it didn’t happen in branch deployment Trying to sort out why it passed in branch but not in main). While trying to deploy to main we got an error:
Copy code
File "/home/runner/work/company-dagster/company-dagster/build/.pex/venvs/b18ced93454e9fc173e35b9c997a5357c6ff5a51/d28e4f77ea23cfee2f7fe16fae92e311e742545c/lib/python3.8/site-packages/pip/_vendor/packaging/markers.py", line 215, in _get_env
      raise UndefinedEnvironmentName(
  pip._vendor.packaging.markers.UndefinedEnvironmentName: 'python_full_version' does not exist in evaluation environment.
  Failed to resolve for platform manylinux2014_x86_64-cp-38-cp38. Resolve requires evaluation of unknown environment marker: 'python_full_version' does not exist in evaluation environment.
This seems to be a known issue with pex, a quote from there documentation reads: Constraints: when
--platform
is used the environment marker`python_full_version` will not be available if
PYVER
is not given as a three component dotted version since
python_full_version
is meant to have 3 digits (e.g.,
3.8.10
). If a
python_full_version
environment marker is encountered during a resolve, an
UndefinedEnvironmentName
exception will be raised. To remedy this, either specify the full version in the platform (e.g,
linux_x86_64-cp-3.8.10-cp38
) or use
--complete-platform
instead. However, as seen above our requirements.txt makes heavy use of
python_full_version
. Ideally, this would continue to be supported.
s
Hi Zach - noted the feature request and thanks for the very informative report. I will look into changing the
--platform
- we're extra careful with that because it may break builds. I'm also surprised this worked in your branch. Did you have a similar
requirements.txt
in your branch when you deployed? (Can poetry export with just
python_version
?)
z
Yeah, makes sense. the requirements.txt should be the same. I’m working on getting a workaround deployed that just uses an extra pre-commit step on requirements.txt to strip the extra information so we just have package[features]<compare><version>. Poetry does not seem to have a way to edit export behavior, except for adding or removing hashes (which were already removed). The pre-commit hook looks like this if anyone else has this issue as well:
Copy code
-   repo: local
    hooks:
    - id: strip_extra_info_from_requirements
    # Needed due to a current issue in the pex command used by dagster.
      name: Strip extra info from requirements.txt
      entry: .strip_extra_info_from_reqs.sh
      language: script
      pass_filenames: false
.strip_extra_info_from_reqs.sh:
Copy code
#!/bin/bash
# Get a new requirements file
poetry export --without-hashes --format=requirements.txt -o  requirements.txt

# Remove windows depenencies. we dont support windows.
grep -v "Windows" requirements.txt  > requirements2.txt &&  mv requirements2.txt requirements.txt

# Remove 'extra info'
grep -Eo '^([^;\ ]+)' requirements.txt > requirements2.txt &&  mv requirements2.txt requirements.txt
(EDIT: Had to add remove windows) Still need to deploy all of this to make sure it all works, it’s probably not the best solution but will get our pipeline working again
s
Did you switch to
ubuntu-20.04
in your main branch?
z
hmm, seems like I did forget to edit the deploy.yml along with the branch_deployments.yaml for that change, let me see if that changes anything
Okay after way too much work with github actions, I’ve finally got this part working. It seems like now after adding the strip function the only issue remaining is that I get the
dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
error that results in failing to update code locations. Manually reloading these code locations in dagster cloud, then manually rerunning the github actions deploy fixes the issue. I’m hoping this happens less in the future, but I’ll keep an eye on it over the next few days.
s
Thanks for the feedback. It appears our timeouts might be too tight, since reloading the code locations work (the tasks are up by that time). I have some idea why pex failed earlier due to
python_full_version
. It might be due to
ubuntu-20.04
and
ubuntu-22.04
(aka
ubuntu-latest
) having slightly different workflows. Basically on
ubuntu-20.04
pex can fallback to the local interpreter for dependency resolution since we use
--resolve-local-platforms
and that might already work with the
python_full_version
.
z
Ahh that makes sense. We don’t really even need python_full_version as far as I can tell, and in this case it was just poetry giving us a requirements folder that was more flexible / permissive in versions than we needed. Is there any sort of rough timeline for a fix that we can expect?
s
I tested using
python_full_version
and was able to reproduce the issue on
ubuntu-latest
. Switching to
ubuntu-20.04
fixed the issue, so I'd say it already works. If you use
ubuntu-20.04
you can remove the script to strip those markers. This also explains why the branch worked. GitHub changed their default runner from 20.04 to 22.04 sometime in December 2022. 22.04 provides a newer Python that cannot build source dependencies for our target container, so we don't use the local python interpreter to resolve dependencies. For our workflows
ubuntu-20.04
works better and we have switched our default quickstarts to 20.04 now, but your workflows were probably cloned before that change.
z
Apologies, I meant for the time out issue when updating code locations. I’ve also since moved all of our workflows to use
ubuntu-20.04
s
Ah I see. Are you still seeing timeouts? I expect the first time switching to PEX may take longer for the tasks to spin up and that needs to be adjusted for in the timeout, but subsequent code updates should be fast (I hope). If you see timeouts anymore please let me know.
z
Yes, I notice that the builds themselves are much faster on second launch (EG: building pex files, uploading pex files, updating branch deployment all happen in just a minute or two however it then goes to “update code locations”, and stalls for a while at “waiting for agent to sync changes to <code_location>. This can take a few minutes”. It then loops over that for a few minutes or so before erroring out with the time out above.
s
I'll look into this. Can you please share a link to the branch deployment you are seeing this in? Are you aware of something expensive that is computed during imports?
❤️ 1
z
Thanks in advanced Shalabh! I’ll DM you the link. As far as if there’s anything expensive computed during imports — maybe? We have a few constants that initialize with info from AWS systems manager, etc. We also have a “Common” folder that contains lots of functions and classes that we use elsewhere (EG: Custom io managers, resources, and other tooling). Here’s a snippet:
Copy code
import datetime
from dataclasses import dataclass
from pathlib import Path

from dagster import (
    MonthlyPartitionsDefinition,
    PartitionedConfig,
    ResourceDefinition,
    make_values_resource,
)
from dagster_aws.s3 import s3_resource
from dagster_aws.s3.io_manager import s3_pickle_io_manager
from dagster_aws.secretsmanager import secretsmanager_resource
from dagster_databricks import databricks_pyspark_step_launcher
from dagster_pyspark import pyspark_resource
from pyarrow.fs import LocalFileSystem, SubTreeFileSystem

from common.io_managers import parquet_pyarrow_iomanager
from common.io_managers.delta_spark import branching_delta_io_manager
from common.resources import ssh_resource_from_secretsmanager  # type: ignore[attr-defined]
from common.utils.aws import SSMParameterBuilder

monthly_partition = MonthlyPartitionsDefinition("202208", fmt="%Y%m", end_offset=1, day_offset=8)
region = "us-east-1"


@dataclass
class DDBParameters(SSMParameterBuilder["DDBParameters"]):
    KEYPAIR_NAME: str
    PROD_BUCKET: str
    PROD_DB_NAME: str
    STAGING_BUCKET: str
    STAGING_DB_NAME: str
    TEMP_BUCKET: str


ssm_param_path = "/DATAENG/PROJECT_NAME"
parameters = DDBParameters.from_ssm_parameters(ssm_param_path, region)
If it would help I’d be willing to screen share some of this with you or give more specific code snippets. We aren’t super in love with the current set up of our pipelines, but they were mainly done by me when I was very new to the project 😅
s
Hi Zach - following up on this timeout error, it looks like we were running into some rate limits with how many branch deployments are being updated concurrently. If you have any code locations you are not actively using in your branch deployments, can you try removing them? We will also look at adjusting the limits and timeouts at our end.
❤️ 1
z
Afternoon Shalabh, Thank you so much for following up 🙂 I’m not quite sure how we would go about removing code locations, and in fact plan on increasing the number of code locations even higher in the future, (currently 4, and in the future easily 10-12) to help separate different requirements, and data products. That being said, we don’t need all of them to be updated in a branch deployment at once. Is there a way to configure the workflow queue to only update changed code locations? This could help us lower this to a place where most branches only need to update one code location at a time. Looking at the action, this doesn’t seem to be a parameter into the github action. Finally, we use stacked PRs, (this is something that the dagster team has actually written blogs on). This means that we often have 3/4 branches open at a time per engineer with small changes. This alone may be causing the issue that’s causing it to be slow. (EG: 5 PRs stacked with 4 code locations, = 20 code locations to update?) I’ll try using workflow queues to reduce this down for now. However; I am a bit worried about how early we hit rate limits here. Apologies in advanced for some potential difficult questions 😅. We are a team of only 4 engineers at the moment and have started migrating our legacy data assets to be orchestrated with dagster. With the move from
repositories
to
Definitions
, it seems to be like it’s “best practice” to use code locations for this sort of logical separation (as AFAIK and as the docs suggest you can only have one
definitions
per code location). Long term, will dagster serverless & pex deployments be able to keep up with this change? Is this a serverless specific issue, and would it possibly be resolved by moving to a hybrid deployment? We expect to 4x our number of engineers and data assets/code locations over the next year or so and need to know how we should plan some of our workflows so that they are effective and scalable.
d
Hey Zach - we'll take a closer look at this and I'm sure we can sort out the performance issues here. One thing in your most recent post though - I actually don't think you should move 1 code location with multiple repositories into N code locations, each with 1 Definitions object - if repositories are working okay for you right now, I would just continue using them until Definitions is better set up to allow for that use case of a single Python environment/image with multiple grouped sets of definitions. repository isn't deprecated or anything so there's no risk of it going away any time soon - and even without this specific perf issue you're running into that I'm sure we can resolve, there's just a lot more overhead with building a separate image for each repository, for example, if you don't need to.
❤️ 1
(If you have different Python environments though or want to be able to deploy different sets of code independently, then different code locations are 100% the way to go there)
❤️ 1
Can I also confirm where exactly you are seeing these timeout errors when they happen? Is it always within the github action itself failing to deploy, or is it also later within dagit when you go to try to use the branch deployment?
last question - am I correct in taking away from this thread that you only ran into these timeout issues with ENABLE_FAST_DEPLOYS on (PEX deployment)? "I'm not sure" is an acceptable answer
z
Thanks Daniel, I see if I can maybe refactor down to one code location in this case. I was under the incorrect impression that repositories were or would be soon deprecated. The failure always occurs in github actions in the “Update Code Locations” step of the branch and prod deployments when fast deploy is enabled. After they’ve failed going to, moving to dagster-cloud shows the error message:
Copy code
dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
  File "/dagster-cloud/dagster_cloud/workspace/user_code_launcher/user_code_launcher.py", line 1167, in _reconcile
    new_dagster_servers[to_update_key] = self._start_new_dagster_server(
  File "/dagster-cloud/dagster_cloud/workspace/user_code_launcher/user_code_launcher.py", line 1441, in _start_new_dagster_server
    self._create_pex_server(deployment_name, location_name, desired_entry, multipex_server)
  File "/dagster-cloud/dagster_cloud/workspace/user_code_launcher/user_code_launcher.py", line 1419, in _create_pex_server
    multipex_client.create_pex_server(
  File "/dagster-cloud/dagster_cloud/pex/grpc/client.py", line 37, in create_pex_server
    res = self._query(
  File "/dagster-cloud/dagster_cloud/pex/grpc/client.py", line 91, in _query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.DEADLINE_EXCEEDED details = "Deadline Exceeded" debug_error_string = "{"created":"@1675802607.414275462","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}" >
  File "/dagster-cloud/dagster_cloud/pex/grpc/client.py", line 88, in _query
    response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
  File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
Clicking “Reload” on the error message triggers an update of the code location, and this usually works (Once it failed when I clicked update on all of them). Since changing to the workflow queue (IE: Only 4 code locations should ever be updated at a time now), the issue remains. This issue only appears with
ENABLE_FAST_DEPLOYS=true
, for the past few months we’ve been using the docker deploy with serverless without issue, but originally changed to PEX with the hope that it would reduce our github minutes usage 🙂.
I’d also just like to add that I really appreciate @Shalabh Chaturvedi’s work & willingness to help here 🙂
🙏 2
d
Strong agree there 🙂 We have very plausible fixes in the works for both of these issues (the github action timing and out and that error message afterwards). We'll test them out over the next day or two and report back when we believe they're fixed. In the meantime, we're pretty sure both problems are indeed related to ENABLE_FAST_DEPLOYS, so disabling that in the short term could work to unblock you until the fixes are ready
The fixes for both of these issues should now be live if you're on the latest dagster release in your github repo (1.1.19) - let us know if you're still seeing either of the timeouts you described here
👀 1
z
I’m getting still getting an error, but it seems to be related to a different issue —
no module named github
in the pex builder.
d
Hmm @Shalabh Chaturvedi was that one on your radar? I recall a similar issue that we fixed related to the github3 package being missing?
z
Full error fwiw:
Copy code
ERROR:root:Could not update PR comment: b''
  b'Traceback (most recent call last):\n  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/runpy.py", line 194, in _run_module_as_main\n    return _run_code(code, main_globals, None,\n  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/runpy.py", line 87, in _run_code\n    exec(code, run_globals)\n  File "/home/runner/.pex/unzipped_pexes/6ec7c7737a35d90dd0d16f9fa346443030502884/__main__.py", line 103, in <module>\n    bootstrap_pex(__entry_point__, execute=__execute__, venv_dir=__venv_dir__)\n  File "/home/runner/.pex/unzipped_pexes/6ec7c7737a35d90dd0d16f9fa346443030502884/.bootstrap/pex/pex_bootstrapper.py", line 608, in bootstrap_pex\n    pex.PEX(entry_point).execute()\n  File "/home/runner/.pex/unzipped_pexes/6ec7c7737a35d90dd0d16f9fa346443030502884/.bootstrap/pex/pex.py", line 560, in execute\n    sys.exit(self._wrap_coverage(self._wrap_profiling, self._execute))\n  File "/home/runner/.pex/unzipped_pexes/6ec7c7737a35d90dd0d16f9fa346443030502884/.bootstrap/pex/pex.py", line 467, in _wrap_coverage\n    return runner(*args)\n  File "/home/runner/.pex/unzipped_pexes/6ec7c7737a35d90dd0d16f9fa346443030502884/.bootstrap/pex/pex.py", line 498, in _wrap_profiling\n    return runner(*args)\n  File "/home/runner/.pex/unzipped_pexes/6ec7c7737a35d90dd0d16f9fa346443030502884/.bootstrap/pex/pex.py", line 581, in _execute\n    return self.execute_interpreter()\n  File "/home/runner/.pex/unzipped_pexes/6ec7c7737a35d90dd0d16f9fa346443030502884/.bootstrap/pex/pex.py", line 681, in execute_interpreter\n    return self.execute_content(arg, content)\n  File "/home/runner/.pex/unzipped_pexes/6ec7c7737a35d90dd0d16f9fa346443030502884/.bootstrap/pex/pex.py", line 774, in execute_content\n    return cls.execute_ast(name, program, argv0=argv0)\n  File "/home/runner/.pex/unzipped_pexes/6ec7c7737a35d90dd0d16f9fa346443030502884/.bootstrap/pex/pex.py", line 792, in execute_ast\n    exec_function(program, globals_map)\n  File "/home/runner/.pex/installed_wheels/da7b3711d724baa7fbaf88a524c5a0e90bee8fa0db1bc973a649a1905a457ef9/pex-2.1.122-py2.py3-none-any.whl/pex/compatibility.py", line 109, in exec_function\n    exec (ast, globals_map, locals_map)\n  File "/home/runner/work/_actions/dagster-io/dagster-cloud-action/pex-v0.1/src/create_or_update_comment.py", line 2, in <module>\n    from github import Github\nModuleNotFoundError: No module named \'github\'\n'
s
I do see this error in branch deployment actions logs. The PR comment will be missing but otherwise the deployment should go through and the action should succeed. Are you seeing the actions fail?
z
Yep, the action ultimately fails:
Copy code
Failed to deploy Python Executable. Try disabling fast deploys by setting `ENABLE_FAST_DEPLOYS: 'false'` in your .github/workflows/*yml.
That being said, it does seem to be deploying to the branch correctly
s
I believe the above exception is logged but ignored - can you post the rest of the logs? Possibly there is another issue.
👀 1
z
Okay, looked into it a bit, apparently some change caused one of our code locations just not to load at all 😞 Seemed to be related to changing behavior of an experimental feature Loader seems to be working great now! Great job folks 🎉
261 Views