Szymon Piskorz
03/06/2023, 10:54 AMAndrea Giardini
03/06/2023, 10:58 AMkubectl describe pod $podname
?Andrea Giardini
03/06/2023, 10:58 AMkubectl describe job $jobname
Szymon Piskorz
03/06/2023, 10:58 AMSzymon Piskorz
03/06/2023, 10:59 AMjovyan@piskorzs-v2-0:~/work/GitRepos/platform$ kubectl get pods -n dagster
NAME READY STATUS RESTARTS AGE
dagster-daemon-7b576b46f9-r4ppg 1/1 Running 0 38h
dagster-dagit-5b6b8946bc-5mwlr 1/1 Running 0 38h
dagster-dagster-user-deployments-k8s-dagster-poc-simon-65fjdpnd 1/1 Running 0 38h
dagster-dagster-user-deployments-k8s-example-user-code-3-6lb78c 1/1 Running 0 38h
dagster-postgresql-0 1/1 Running 0 38h
Szymon Piskorz
03/06/2023, 10:59 AMAndrea Giardini
03/06/2023, 10:59 AMSzymon Piskorz
03/06/2023, 10:59 AMSzymon Piskorz
03/06/2023, 10:59 AMjovyan@piskorzs-v2-0:~/work/GitRepos/platform$ kubectl get jobs -n dagster
No resources found in dagster namespace.
Andrea Giardini
03/06/2023, 10:59 AMSzymon Piskorz
03/06/2023, 11:01 AMAndrea Giardini
03/06/2023, 11:01 AMAndrea Giardini
03/06/2023, 11:01 AMSzymon Piskorz
03/06/2023, 11:02 AMAndrea Giardini
03/06/2023, 11:03 AMSzymon Piskorz
03/06/2023, 11:03 AMjovyan@piskorzs-v2-0:~/work/GitRepos/platform$ kubectl logs -n dagster dagster-daemon-7b576b46f9-r4ppg
Telemetry:
As an open source project, we collect usage statistics to inform development priorities. For more
information, read <https://docs.dagster.io/install#telemetry>.
We will not see or store solid definitions, pipeline definitions, modes, resources, context, or
any data that is processed within solids and pipelines.
To opt-out, add the following to $DAGSTER_HOME/dagster.yaml, creating that file if necessary:
telemetry:
enabled: false
Welcome to Dagster!
If you have any questions or would like to engage with the Dagster team, please join us on Slack
(<https://bit.ly/39dvSsF>).
2023-03-06 05:45:32 +0000 - dagster.daemon - INFO - Instance is configured with the following daemons: ['BackfillDaemon', 'SchedulerDaemon', 'SensorDaemon']
2023-03-06 05:45:32 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
...
2023-03-06 10:56:53 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
2023-03-06 10:57:53 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
2023-03-06 10:58:53 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
2023-03-06 10:59:54 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
2023-03-06 11:00:54 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
2023-03-06 11:01:55 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
2023-03-06 11:02:56 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
Andrea Giardini
03/06/2023, 11:04 AMk get job
right after?Andrea Giardini
03/06/2023, 11:04 AMSzymon Piskorz
03/06/2023, 11:05 AMSzymon Piskorz
03/06/2023, 11:07 AMjovyan@piskorzs-v2-0:~/work/GitRepos/platform$ kubectl get job --all-namespaces
NAMESPACE NAME COMPLETIONS DURATION AGE
ingress-nginx ingress-nginx-admission-create 1/1 5s 53d
ingress-nginx ingress-nginx-admission-patch 1/1 6s 53d
Szymon Piskorz
03/06/2023, 11:07 AMSzymon Piskorz
03/06/2023, 11:10 AMjovyan@piskorzs-v2-0:~/work/GitRepos/platform$ kubectl get sa -n dagster dagster -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
annotations:
<http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: arn:aws:iam::083749379286:role/gtm-core-eks-uat-euc1-cfn-eksclustergtmsadagsterda-F9H0IFEFTLSB
<http://kubectl.kubernetes.io/last-applied-configuration|kubectl.kubernetes.io/last-applied-configuration>: |
{"apiVersion":"v1","kind":"ServiceAccount","metadata":{"annotations":{"<http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>":"arn:aws:iam::083749379286:role/gtm-core-eks-uat-euc1-cfn-eksclustergtmsadagsterda-F9H0IFEFTLSB"},"labels":{"<http://app.kubernetes.io/name|app.kubernetes.io/name>":"dagster","aws.cdk.eks/prune-c8c325efef07e37e6691673585c3559fcc4effbb9a":"","git-commit-sha":"2c2c774"},"name":"dagster","namespace":"dagster"}}
creationTimestamp: "2023-03-01T14:00:59Z"
labels:
<http://app.kubernetes.io/name|app.kubernetes.io/name>: dagster
aws.cdk.eks/prune-c8c325efef07e37e6691673585c3559fcc4effbb9a: ""
git-commit-sha: 2c2c774
name: dagster
namespace: dagster
resourceVersion: "29874861"
uid: 3dbd6b3e-c724-4b75-b112-07f832b649c7
secrets:
- name: dagster-token-skh82
Szymon Piskorz
03/06/2023, 11:29 AMapiVersion: v1
kind: ServiceAccount
metadata:
labels:
<http://app.kubernetes.io/instance|app.kubernetes.io/instance>: dagster
<http://app.kubernetes.io/managed-by|app.kubernetes.io/managed-by>: Helm
<http://app.kubernetes.io/name|app.kubernetes.io/name>: dagster
<http://app.kubernetes.io/version|app.kubernetes.io/version>: 1.1.11
<http://helm.sh/chart|helm.sh/chart>: dagster-1.1.11
name: dagster
The only real difference I see is that some of the labels are not there, do you have a mechanism that might rely on those that could potentially fail the job run?Szymon Piskorz
03/06/2023, 12:05 PMSzymon Piskorz
03/06/2023, 2:14 PM1.11.11
.
The job/pod both delete themselves after unsuccessful schedule. Which eliminates any sign of error happening both on kubectl
and dagit
sides. The only way to troubleshoot it was to have a watch
command scan for kubectl get jobs -n dagster
every 0.01 second and if there was something describe it. Later we discovered that those were also available in the kubectl get events -n dagster
jovyan@piskorzs-v2-0:~/work/GitRepos/platform/k8s/eks/assets/k8s_manifests/dagster$ kubectl get events -n dagster
LAST SEEN TYPE REASON OBJECT MESSAGE
3m26s Warning FailedScheduling pod/dagster-run-03a9455d-9a35-4579-bf43-32f13b644a4a-lv79g 0/10 nodes are available: 1 node(s) had taint {hcp-linking-pipelines: }, that the pod didn't tolerate, 3 node(s) had taint {deployments-control-plane: }, that the pod didn't tolerate, 6 node(s) had taint {kf-notebooks: }, that the pod didn't tolerate.
3m26s Normal SuccessfulCreate job/dagster-run-03a9455d-9a35-4579-bf43-32f13b644a4a Created pod: dagster-run-03a9455d-9a35-4579-bf43-32f13b644a4a-lv79g
6s Warning FailedScheduling pod/dagster-run-28b3240c-3f83-4c86-b004-4f4b68df2ab9-qrd6x 0/10 nodes are available: 1 node(s) had taint {hcp-linking-pipelines: }, that the pod didn't tolerate, 3 node(s) had taint {deployments-control-plane: }, that the pod didn't tolerate, 6 node(s) had taint {kf-notebooks: }, that the pod didn't tolerate.
6s Normal SuccessfulCreate job/dagster-run-28b3240c-3f83-4c86-b004-4f4b68df2ab9 Created pod: dagster-run-28b3240c-3f83-4c86-b004-4f4b68df2ab9-qrd6x
4m10s Warning FailedScheduling pod/dagster-run-9f566c73-519f-4b55-ae52-130ac9df35a8-qpzqz 0/10 nodes are available: 1 node(s) had taint {hcp-linking-pipelines: }, that the pod didn't tolerate, 3 node(s) had taint {deployments-control-plane: }, that the pod didn't tolerate, 6 node(s) had taint {kf-notebooks: }, that the pod didn't tolerate.
4m10s Normal SuccessfulCreate job/dagster-run-9f566c73-519f-4b55-ae52-130ac9df35a8 Created pod: dagster-run-9f566c73-519f-4b55-ae52-130ac9df35a8-qpzqz
43m Normal SuccessfullyReconciled targetgroupbinding/k8s-dagster-dagsterd-ad19985f97 Successfully reconciled
I don’t think the deletion of pod and job is the desired behaviour, K8s version of dagster should use the Backoff mechanism instead because there is no good way of telling what has happened with a pod after it had been deleted.Szymon Piskorz
03/06/2023, 2:19 PM1.1.11
we did not want to introduce additional complexity with version bump till we found the cause.daniel
03/06/2023, 3:10 PMSzymon Piskorz
03/07/2023, 3:06 PMdaniel
03/07/2023, 3:08 PMSzymon Piskorz
03/07/2023, 3:08 PMSzymon Piskorz
03/07/2023, 3:08 PMdaniel
03/07/2023, 3:09 PMdaniel
03/07/2023, 3:12 PMdaniel
03/07/2023, 3:12 PMSzymon Piskorz
03/07/2023, 3:18 PMdescribe pod
and get events
commands:Szymon Piskorz
03/07/2023, 3:18 PMdaniel
03/07/2023, 3:19 PMSzymon Piskorz
03/07/2023, 3:19 PMSzymon Piskorz
03/07/2023, 3:21 PMSzymon Piskorz
03/07/2023, 3:22 PMSzymon Piskorz
03/07/2023, 3:22 PMkubectl logs
Szymon Piskorz
03/07/2023, 3:23 PMSzymon Piskorz
03/07/2023, 3:24 PMdaniel
03/07/2023, 3:24 PMdaniel
03/07/2023, 3:25 PMdaniel
03/07/2023, 3:28 PMSzymon Piskorz
03/07/2023, 3:30 PMSzymon Piskorz
03/07/2023, 3:30 PMSzymon Piskorz
03/07/2023, 3:31 PMSzymon Piskorz
03/07/2023, 3:31 PMSzymon Piskorz
03/07/2023, 3:31 PMSzymon Piskorz
03/07/2023, 3:32 PMdaniel
03/07/2023, 3:33 PMdaniel
03/07/2023, 3:33 PMSzymon Piskorz
03/07/2023, 3:33 PMSzymon Piskorz
03/07/2023, 3:33 PMdaniel
03/07/2023, 3:33 PMSzymon Piskorz
03/07/2023, 3:34 PMSzymon Piskorz
03/07/2023, 3:34 PMdaniel
03/07/2023, 3:34 PMSzymon Piskorz
03/07/2023, 3:34 PMSzymon Piskorz
03/07/2023, 3:34 PMSzymon Piskorz
03/07/2023, 3:35 PMdaniel
03/07/2023, 3:35 PMdaniel
03/07/2023, 3:36 PMSzymon Piskorz
03/07/2023, 3:37 PMSzymon Piskorz
03/07/2023, 3:37 PMSzymon Piskorz
03/07/2023, 3:37 PMSzymon Piskorz
03/07/2023, 3:37 PMSzymon Piskorz
03/07/2023, 3:37 PMdaniel
03/07/2023, 3:37 PMSzymon Piskorz
03/07/2023, 3:38 PMSzymon Piskorz
03/07/2023, 3:38 PMdaniel
03/07/2023, 3:39 PMSzymon Piskorz
03/07/2023, 3:40 PMSzymon Piskorz
03/07/2023, 3:40 PMSzymon Piskorz
03/07/2023, 4:16 PMdagster-user-deployments:
deployments:
- name: "k8s-dagster-poc-simon"
image:
repository: "<http://083749379286.dkr.ecr.eu-central-1.amazonaws.com/common/dagster_poc/refactored_project_repo|083749379286.dkr.ecr.eu-central-1.amazonaws.com/common/dagster_poc/refactored_project_repo>"
tag: f1ac2a0d5698166ce065cdbb5bfb9b8fdacc4d7a
pullPolicy: Always
dagsterApiGrpcArgs:
- "--python-file"
- "./repo.py"
port: 3030
envSecrets:
- name: redshift-dbt-secrets-envs
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: <http://topology.kubernetes.io/zone|topology.kubernetes.io/zone>
operator: In
values:
- eu-central-1a
- eu-central-1b
tolerations:
- key: "deployments-control-plane"
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
deployments-control-plane: "true"
resources:
limits:
cpu: 1000m
memory: 2Gi
requests:
cpu: 100m
memory: 128Mi
Does this mean the pod with code repo will have those resources or every single run will have them? In default values file it is mentioned that if not specified the K8s Scheduler values will be used but I am still unsure whether thats for code repo or for run of the actual assets.daniel
03/07/2023, 4:18 PMSzymon Piskorz
03/07/2023, 4:19 PMSzymon Piskorz
03/16/2023, 2:21 AM