hey does by any chance anyone can suggest somethin...
# ask-community
p
hey does by any chance anyone can suggest something
Copy code
This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (sqlite3.OperationalError) database is locked
...
but we’re using postgres and i don’t see any locks there one thing to mention that we’re using celery with big amount of parallel tasks
1
🤖 1
j
How often do you see this?
p
it depends on number of celery instance ~10 instances with 8 workers in each everything is fine 30+ instances it crashes everytime
j
Strange. This is dagster-celery? Or dagster-celery-k8s?
I’m going to guess that part of your instance isn’t pointing at postgres, and is storing things on the host machines with sqlite (the default)
p
it’s a dagster and celery just installed separately
j
Could you share your
dagster.yaml
(with secrets removed)
p
we don’t have it yeah you’re right it’s dagster-k8s here’s dagster-values
Copy code
dagster-user-deployments:
  enabled: true
  deployments:
    - name: data-ops
      image:
        repository: us-central1-docker.pkg.dev/-dev/repository-dev/dagster
        tag: latest
        pullPolicy: Always
      dagsterApiGrpcArgs:
        - -m
        - data_ops
      port: 8000
      env:
        MELTANO_ENVIRONMENT: prod
        GOOGLE_APPLICATION_CREDENTIALS: /etc/service-key/credentials
      volumes:
        - name: service-key
          secret:
            secretName: service-key
      volumeMounts:
        - name: service-key
          mountPath: /etc/service-key
          readOnly: true

generateCeleryConfigSecret: false
global:
  celeryConfigSecretName: "dagster-celery-1111-config-secret"


flower:
  enabled: true

  image:
    repository: "<http://docker.io/mher/flower|docker.io/mher/flower>"
    tag: "0.9.5"
    pullPolicy: Always

  service:
    type: ClusterIP
    annotations: {}
    port: 5555

  # Support Node, affinity and tolerations for Flower pod assignment. See:
  # <https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector>
  # <https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity>
  # <https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/>
  annotations: {}
  nodeSelector: {}
  affinity: {}
  tolerations: []
  podSecurityContext: {}
  securityContext: {}

  # Override the default K8s scheduler
  # schedulerName: ~

  resources: {}
  # If you want to specify resources, uncomment the following lines, adjust them as necessary,
  # and remove the curly braces after 'resources:'.
  # limits:
  #   cpu: 100m
  #   memory: 128Mi
  # requests:
  #   cpu: 100m
  #   memory: 128Mi

  # Liveness probe detects when to restart flower.
  # <https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes>
  livenessProbe:
    tcpSocket:
      port: "flower"
    # initialDelaySeconds: 60
    periodSeconds: 20
    failureThreshold: 3


  startupProbe:
    tcpSocket:
      port: "flower"
    periodSeconds: 10
    failureThreshold: 6

postgresql:
  enabled: false
  postgresqlHost: 11.111.111.111
  postgresqlUsername: dagster
  postgresqlPassword: 1111
  postgresqlDatabase: dagster
  service:
    port: 5432
j
So you’re managing your own celery workers?
p
yes
j
(outside of the dagster helm chart)
What dagster executor are you using
The default (multiprocess) and just calling celery task submit yourself?
p
yep
j
So the celery workers don’t touch dagster at all? (they don’t have the lib installed, etc.)
And where does the
database is locked
message get logged? The run pods? The webserver?
p
so, i’m not actually really experienced with it, current setup was made by someone I’m just trying to get it working but i’ll try to explain so we have a job and an op in it op creates list of celery tasks running method with celery
@app.task(autoretry_for=(,), retry_kwargs={'max_retries': 1, 'countdown': 10})
then we’re just waiting for all tasks from list to finish
Copy code
result = group(self.tasks).apply_async()

        # Wait for the tasks to complete
        while not result.ready():
...
and then pipeline continues we have a celery deployment which is managed by KEDA celery uses same docker image as we’re using for dagster deployment
database is locked occurs in 2 places in celery pod and in
dagster-run
job
j
Do the celery tasks invoke dagster?
I’d guess that the celery workers don’t have the dagster configuration to point at postgres, and are instead writing the sqlite locally
p
nope celery actually runs meltano
is there a way i can check that ?
if it has or not
j
If you look inside the webserver pod there will be a
dagster.yaml
that configures it (the helm chart sets it). The same file would be needed inside the celery workers
Is there more of a stack trace for
Copy code
This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (sqlite3.OperationalError) database is locked
...
?
p
Copy code
[2023-08-09 20:31:02,955: INFO/ForkPoolWorker-9] This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (sqlite3.OperationalError) database is locked
[2023-08-09 20:31:02,955: INFO/ForkPoolWorker-9] [SQL: UPDATE runs SET last_heartbeat_at=? WHERE runs.id = ?]
[2023-08-09 20:31:02,955: INFO/ForkPoolWorker-9] [parameters: ('2023-08-09 20:30:49.066415', 5)]
[2023-08-09 20:31:02,956: INFO/ForkPoolWorker-9] (Background on this error at: <https://sqlalche.me/e/14/e3q8>) (Background on this error at: <https://sqlalche.me/e/14/7s2a>)
it actually fails either with update heartbeat or insert event_logs
i’ve increased lock time on pg instance and amount of locks
j
Yeah it seems like the celery workers are using dagster (or else they wouldn’t be trying to make these queries right?) but are using sqlite instead of postgres
p
yeah i’ve also was thinking same will try adding
dagster.yaml
there
j
If you set
DAGSTER_HOME
env var to a directory inside the celery workers, and copy the
dagster.yaml
from the webserver to there (can use the configmap), I’d expect it to work
I’m still a little confused about the setup with the celery tasks, but if it works sounds good haha
p
haha i’m also confiused because i don’t know what to do with it all 😄 i will try your solution thanks but if not which celery tasks setup you would recommend ? we expect to have up to 500 celery instances preferably controlled by some autoscaler
@johann thanks for help ! unfortunately you’re solution haven’t helped, and i was also not able to find any mentions of sqlite on pods thorowing that error but what helped is tuning db (i actually did that before but for some reason it hasn’t saved it ) so increasing those two, i’ve increased performance, it started processing much faster and it fixed the issue with locks
Copy code
max_locks_per_transaction
lock_timeout
idk should i open an issue that it’s showing wrong db connection in error text ?
j
Tuning Postgres solved this??
p
yep increasing amount of locks per transaction
sorry it’s not related to dagster at all problem is with meltano they both have sqlite and
runs
tables with similar structure so if anyone will have similar issue check meltano
j
Makes more sense, thanks for the followup