Hi All, I am have some issues with my deployment ...
# deployment-ecs
m
Hi All, I am have some issues with my deployment of dagit on ECS. Right after deployment or after switching the database to a new postgres RDS everything runs smooth and very fast. However after I run some big pipelines (6K - 10K solids) I am getting performance issues while interacting with dagit. This is for example when I want to load the overview all runs or schedules, but also when I want to see the overview / playground tab of a specific pipeline. Often these pages do not load anymore and I see the next error: 
[Network error] JSON.parse: unexpected character at line 1 column 1 of the JSON data
. When I replace the database, everything is fine again.  I noticed that during these errors the Dagit task has a maximum CPU Utilisation of 100% and an average of ~70%. For memory it is 80% (max) and 40% (avg). Also the loadbalancer is sometimes killing the Dagit task due to a Request Time Out. I tried to give the dagit task more resources (cpu: 1024 & memory: 2048) and to increase the unhealthy threshold and timeout of the load balancer, but both did not solve the problem. It seems as the other resources (RDS, Daemon & Pipelines) have enough resources. I am running Dagster 12.5 and adapted the cloud formation template generated by the docker compose for ECS: • The tasks run in private subnet • A load balancer gives access to dagit. • A postgres RDS is used instead of the postgres service. • I increased the CPU (512) and memory (1024) of the dagit task. • The pipelines run in a custom ECS task. • I have set "--db-statement-timeout 600000” when starting Dagit. Does anybody have a suggestion on how I can solve this?
j
Do you have the full stack trace of that error?
m
Where can I find the full stack trace? I can only see the error message in the UI. There are also no logs in Cloudwatch / Ecs logs (except for the welcome message & telemetry).
j
There might be some more useful information hidden in the request that we can find via the browser’s dev tools. Try poking around in Chrome dev tools -> Network tab -> Find the failed request -> Look at request body/response
m
Thanks! It seems to fail at some graphql queries. The first one is:
Copy code
query JobMetadataQuery($params: PipelineSelector!, $runsFilter: PipelineRunsFilter) {
  pipelineOrError(params: $params) {
    ... on Pipeline {
      id
      schedules {
        id
        mode
        ...ScheduleSwitchFragment
        __typename
      }
      sensors {
        id
        mode
        ...SensorSwitchFragment
        __typename
      }
      __typename
    }
    __typename
  }
  pipelineRunsOrError(filter: $runsFilter, limit: 5) {
    ... on PipelineRuns {
      results {
        id
        ...RunMetadataFragment
        __typename
      }
      __typename
    }
    __typename
  }
}

fragment ScheduleSwitchFragment on Schedule {
  id
  name
  scheduleState {
    id
    status
    __typename
  }
  __typename
}

fragment SensorSwitchFragment on Sensor {
  id
  jobOriginId
  name
  sensorState {
    id
    status
    __typename
  }
  __typename
}

fragment RunMetadataFragment on PipelineRun {
  id
  status
  assets {
    id
    key {
      path
      __typename
    }
    __typename
  }
  ...RunTimeFragment
  __typename
}

fragment RunTimeFragment on PipelineRun {
  id
  status
  stats {
    ... on PipelineRunStatsSnapshot {
      id
      enqueuedTime
      launchTime
      startTime
      endTime
      __typename
    }
    ... on PythonError {
      ...PythonErrorFragment
      __typename
    }
    __typename
  }
  __typename
}

fragment PythonErrorFragment on PythonError {
  __typename
  message
  stack
  cause {
    message
    stack
    __typename
  }
}
After exactly 1 minute of waiting it returns a 504 gateway timeout.
Copy code
query PipelineExplorerRootQuery($pipelineSelector: PipelineSelector, $snapshotId: String, $rootHandleID: String!, $requestScopeHandleID: String) {
  pipelineSnapshotOrError(
    snapshotId: $snapshotId
    activePipelineSelector: $pipelineSelector
  ) {
    ... on PipelineSnapshot {
      id
      name
      ...PipelineExplorerFragment
      solidHandle(handleID: $rootHandleID) {
        ...PipelineExplorerSolidHandleFragment
        __typename
      }
      solidHandles(parentHandleID: $requestScopeHandleID) {
        handleID
        solid {
          name
          __typename
        }
        ...PipelineExplorerSolidHandleFragment
        __typename
      }
      __typename
    }
    ... on PipelineNotFoundError {
      message
      __typename
    }
    ... on PipelineSnapshotNotFoundError {
      message
      __typename
    }
    ... on PythonError {
      message
      __typename
    }
    __typename
  }
}

fragment PipelineExplorerFragment on IPipelineSnapshot {
  name
  description
  ...SidebarTabbedContainerPipelineFragment
  __typename
}

fragment SidebarTabbedContainerPipelineFragment on IPipelineSnapshot {
  name
  ...SidebarPipelineInfoFragment
  __typename
}

fragment SidebarPipelineInfoFragment on IPipelineSnapshot {
  name
  description
  modes {
    id
    ...SidebarModeInfoFragment
    __typename
  }
  __typename
}

fragment SidebarModeInfoFragment on Mode {
  id
  name
  description
  resources {
    name
    description
    configField {
      configType {
        ...ConfigTypeSchemaFragment
        recursiveConfigTypes {
          ...ConfigTypeSchemaFragment
          __typename
        }
        __typename
      }
      __typename
    }
    __typename
  }
  loggers {
    name
    description
    configField {
      configType {
        ...ConfigTypeSchemaFragment
        recursiveConfigTypes {
          ...ConfigTypeSchemaFragment
          __typename
        }
        __typename
      }
      __typename
    }
    __typename
  }
  __typename
}

fragment ConfigTypeSchemaFragment on ConfigType {
  ... on EnumConfigType {
    givenName
    __typename
  }
  ... on RegularConfigType {
    givenName
    __typename
  }
  key
  description
  isSelector
  typeParamKeys
  ... on CompositeConfigType {
    fields {
      name
      description
      isRequired
      configTypeKey
      __typename
    }
    __typename
  }
  ... on ScalarUnionConfigType {
    scalarTypeKey
    nonScalarTypeKey
    __typename
  }
  __typename
}

fragment PipelineExplorerSolidHandleFragment on SolidHandle {
  handleID
  solid {
    name
    ...PipelineGraphSolidFragment
    __typename
  }
  __typename
}

fragment PipelineGraphSolidFragment on Solid {
  name
  ...SolidNodeInvocationFragment
  definition {
    name
    ...SolidNodeDefinitionFragment
    __typename
  }
  __typename
}

fragment SolidNodeInvocationFragment on Solid {
  name
  isDynamicMapped
  inputs {
    definition {
      name
      __typename
    }
    isDynamicCollect
    dependsOn {
      definition {
        name
        type {
          displayName
          __typename
        }
        __typename
      }
      solid {
        name
        __typename
      }
      __typename
    }
    __typename
  }
  outputs {
    definition {
      name
      __typename
    }
    dependedBy {
      solid {
        name
        __typename
      }
      definition {
        name
        type {
          displayName
          __typename
        }
        __typename
      }
      __typename
    }
    __typename
  }
  __typename
}

fragment SolidNodeDefinitionFragment on ISolidDefinition {
  __typename
  name
  metadata {
    key
    value
    __typename
  }
  inputDefinitions {
    name
    type {
      displayName
      __typename
    }
    __typename
  }
  outputDefinitions {
    name
    isDynamic
    type {
      displayName
      __typename
    }
    __typename
  }
  ... on SolidDefinition {
    configField {
      configType {
        key
        description
        __typename
      }
      __typename
    }
    __typename
  }
  ... on CompositeSolidDefinition {
    inputMappings {
      definition {
        name
        __typename
      }
      mappedInput {
        definition {
          name
          __typename
        }
        solid {
          name
          __typename
        }
        __typename
      }
      __typename
    }
    outputMappings {
      definition {
        name
        __typename
      }
      mappedOutput {
        definition {
          name
          __typename
        }
        solid {
          name
          __typename
        }
        __typename
      }
      __typename
    }
    __typename
  }
}
These also fail:
and
Copy code
query InstanceWarningQuery {
  instance {
    ...InstanceHealthFragment
    __typename
  }
}

fragment InstanceHealthFragment on Instance {
  daemonHealth {
    id
    ...DaemonHealthFragment
    __typename
  }
  __typename
}

fragment DaemonHealthFragment on DaemonHealth {
  id
  allDaemonStatuses {
    id
    daemonType
    required
    healthy
    lastHeartbeatErrors {
      __typename
      ...PythonErrorFragment
    }
    lastHeartbeatTime
    __typename
  }
  __typename
}

fragment PythonErrorFragment on PythonError {
  __typename
  message
  stack
  cause {
    message
    stack
    __typename
  }
}
@jordan do you have any advice on how to proceed? Should I somehow allow dagit to wait longer than 1 minute on the graphql response? Or is 1 minute already very long and does this point to a malfunctioning dagit service or rds?
j
I suspect this is more related to the size of the pipelines than anything particular to ECS. Even with your extended postgres timeout, are you sure the postgres query isn’t timing out?
I also imagine the graphql response could be quite large - perhaps continuing to increase the memory on the dagit task might help?
m
Hi! Thanks for your help. It turned out to be the load balancer. It was a combination of killing the connection after 60 seconds of idle time and health checks that were to tight.