Alexander Whillas
11/01/2022, 9:31 PMMikhail Knyazev
11/02/2022, 9:41 AMPrag
11/02/2022, 10:57 PMImportError: dlopen [...] lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have (x86_64), need (arm64e)))
Prag
11/02/2022, 11:26 PMZach
11/03/2022, 2:52 PMAndy Chen
11/03/2022, 2:58 PMAlexander Whillas
11/03/2022, 8:16 PMQwame
11/03/2022, 8:53 PMload_assets_from_dbt_manifest
to select nodes from dbt
, exposures are excluded. In dagster_dbt.asset_defs
here, only nodes, sources and metrics are selected and then passed to the dbt core's selector
here. The thing is that the selector function's _is_graph_member expects the manifest to have exposures, otherwise, it will try to find the exposure keys from the nodes, which of course will lead to a keyError,
def _is_graph_member(self, unique_id: UniqueId) -> bool:
if unique_id in self.manifest.sources:
source = self.manifest.sources[unique_id]
return source.config.enabled
elif unique_id in self.manifest.exposures:
return True
elif unique_id in self.manifest.metrics:
metric = self.manifest.metrics[unique_id]
return metric.config.enabled
node = self.manifest.nodes[unique_id]
return not node.empty and node.config.enabled
I think adding exposures to this list here will make the load_assets_from_dbt_manifest
not fail when a project uses exposures.
manifest = Manifest(
# dbt expects dataclasses that can be accessed with dot notation, not bare dictionaries
nodes={unique_id: _DictShim(info) for unique_id, info in manifest_json["nodes"].items()}, # type: ignore
sources={
unique_id: _DictShim(info) for unique_id, info in manifest_json["sources"].items() # type: ignore
},
metrics={
unique_id: _DictShim(info) for unique_id, info in manifest_json["metrics"].items() # type: ignore
},
exposures={
unique_id: _DictShim(info) for unique_id, info in manifest_json["exposures"].items() # type: ignore
} # This line is missing
)
I can submit a PR if needed.won
11/05/2022, 1:02 PM/overview/schedules
is still good for that.
witch one of them is active?
maybe instead of count of all schedules in repo, show `active`/`all` info?Manish Khatri
11/06/2022, 1:54 PMdagit
UI
dagit==1.0.16
dagster==1.0.16
The visual DAG doesn’t match the declared code when it comes to the visual aspect of start_after
. It seems the default value of ins=
within the package defined @op
takes precedence of what is overridden in the defined @job
.
from dagster_dbt import dbt_run_op #dagster bundled library
from dagster_stitch import stitch_sync_op. #custom
from dagster_tableau import tableau_refresh_op #custom
@job(
description="Ingest data, DBT transform, do Tableau Refresh.",
resource_defs={
"stitch": stitch_resource_all,
"dbt": dbt_all_models_config,
"tableau": tableau_resource_all,
},
)
def workflow_mvp():
tableau_refresh_op_mvp(
start_after=dbt_run_op(
start_after=[
stitch_sync_op1(),
stitch_sync_op2(),
]
)
)
In this instance I would expect the visual DAG to show:
1. dbt_run_op
should start_after sync_op1
and sync_op2
.
2. tableau_refresh
should start_after dbt_rup_op
It should be noted that the “Info” pane on the right of the Graph for the respective ops shows the correct values/relationships
I believe the visual DAG drawn is using the ins={"start_after": In(Nothing)}
default declared in the @op
in the provided dagster-dbt
library (https://github.com/dagster-io/dagster/blob/master/python_modules/libraries/dagster-dbt/dagster_dbt/ops.py#L8) and not the overriding data provided in the @job
definition. This makes the visual DAG an incorrect reflection of what we have declared in our code, which I would say is a bug sadpandaZach P
11/07/2022, 2:00 PMChris Histe
11/07/2022, 10:56 PMExplode graphs
button in dagit for an AssetGroup that was created from a @graph
using AssetsDefinition.from_graph
. It’s only possible for jobs as of now.Zach
11/08/2022, 1:49 AMNicolas Parot Alvarez
11/08/2022, 4:30 PMAssetSensor
on the update of the asset table
.Zachary Bluhm
11/10/2022, 2:10 PMnickvazz
11/11/2022, 9:11 PMruns
UI based off of the root_run_id
tag?Dagster Jarred
11/11/2022, 9:33 PMDavid Diagama
11/12/2022, 3:29 PMMatt Clarke
11/14/2022, 10:54 PMSon Giang
11/16/2022, 10:15 AMcelery_worker
and celery_executor
with default_run_launcher
. One problem is that the User Code GRPC Server
usually runs out of memory. As I check, seems like the User Code GRPC Server
has to load all the pipeline submitting from the daemon, but somehow it is very memory intensive. I try to run with 3 jobs run, a graph is about 4 nodes x 5 dynamic partition mapping, it costs about 1.1gb of memory (the memory when initializing the API Server is 250mb). Which means around 300mb is allocated for each pipeline.
Also when the User Code GRPC Server runs out of memory, all the tasks submitting cannot get to the queues, causing all the running/starting jobs to hang, the only solution is to terminate and restart all. I wonder if we can handle the OOM gracefully to address this problem?David Jayatillake
11/16/2022, 10:51 AMCharlie Bini
11/17/2022, 5:37 PMdagster._core.scheduler.scheduler.DagsterSchedulerError: Unable to reach the user code server for schedule datagun_clever_schedule. Schedule will resume execution once the server is available.
File "/dagster/dagster/_scheduler/scheduler.py", line 505, in launch_scheduled_runs_for_schedule_iterator
raise DagsterSchedulerError(
Muhammad Jarir Kanji
11/19/2022, 3:49 PMDaniel Galea
11/21/2022, 9:18 PMArthur
11/22/2022, 7:21 PMNicolas Parot Alvarez
11/24/2022, 3:09 PMH
value, that would be nice to have for Dagster schedule cron definitions.
To allow periodically scheduled tasks to produce even load on the system, the symbol(for “hash”) should be used wherever possible. For example, usingH
for a dozen daily jobs will cause a large spike at midnight. In contrast, using0 0 * * *
would still execute each job once a day, but not all at the same time, better using limited resources.H H * * *
Thesymbol can be used with a range. For example,H
means some time between 12:00 AM (midnight) to 7:59 AM. You can also use step intervals withH H(0-7) * * *
, with or without ranges.H
TheThe idea of using a hash of the job name is very simple and ingenuous, but not necessarily great at spreading the load. In a second stage, maybe Dagster could have something more advanced that would actually look for the times of least activity inside the allowed time range. This ideal schedule time could be set by the first run and kept for subsequent runs, or it could be recomputed every time.symbol can be thought of as a random value over a range, but it actually is a hash of the job name, not a random function, so that the value remains stable for any given project. https://www.jenkins.io/doc/book/pipeline/syntax/#cron-syntaxH
Casper Weiss Bang
11/28/2022, 9:54 AMschrockn
11/28/2022, 8:45 PMZach P
11/29/2022, 11:42 PMAlexander Whillas
11/30/2022, 12:17 AM