< max> still getting resource unavailable ```compute logs mo dagster #announcements

<@UDJ0NL1LY> still getting resource unavailable ``...

Sam Rausser

06/04/2020, 2:39 PM

@max still getting resource unavailable

Copy code

compute_logs:
  module: dagster.core.storage.local_compute_log_manager
  class: NoOpComputeLogManager
  config:
    base_dir: /tmp/datafarm

alex

06/04/2020, 2:46 PM

hmm based on the stack trace it looks like the config updated wasn’t respected how is the instance grabbed in

datafarm/utils/pipeline_partition_runner.py

and what is the lifecylce of the process calling it?

Sam Rausser

06/04/2020, 2:47 PM

instance = DagsterInstance.get()

Sam Rausser

06/04/2020, 2:48 PM

with

DAGSTER_HOME

set

alex

06/04/2020, 2:49 PM

messages are pickled to disk and the pipeline is called

how is the pipeline called?

alex

06/04/2020, 2:49 PM

like is it happening in a new process, thread, in line?

Sam Rausser

06/04/2020, 2:50 PM

single thread/in line

alex

06/04/2020, 2:52 PM

hmm ok - and did the process that spawns the thread restart when you deployed the new config?

alex

06/04/2020, 2:52 PM

if could share a code snippet for how the thread for pipeline execution is spun up that would be useful

Sam Rausser

06/04/2020, 2:56 PM

like this?

Copy code

ingestion_pipeline = ExecutionTargetHandle.for_pipeline_module(
    'datafarm.pipelines',
    pipeline_config.name,
).build_pipeline_definition()

environment_dict = pipeline_config.pipeline_partition.environment_dict_for_partition(partition)
tags = pipeline_config.pipeline_partition.tags_for_partition(partition)
instance = DagsterInstance.get()

<http://logger.info|logger.info>(f'{pipeline_config.short_name} Started')

execute_pipeline(
    ingestion_pipeline,
    environment_dict=environment_dict,
    tags=tags,
    mode=EP_ENV,
    instance=instance,
)

alex

06/04/2020, 2:57 PM

i mean the code that sets up the thread where i assume this code above is invoked

alex

06/04/2020, 2:58 PM

for context - we had an issue like this internally due to a dictionary of the

thread

objects that we were failing to delete and since we held that reference all of the files in the thread were held open indefinitely

Sam Rausser

06/04/2020, 2:58 PM

Copy code

while self._keep_running:
    with self._statsd.timer('extract_kafka_messages'):
        start = time.time()
        kafka_messages = list()
        while (
                len(kafka_messages) <= max_messages_per_run
                and
                time.time() - start < consumer_poll_duration_ms / 1000
                and
                self._keep_running
        ):
            message_set = consumer.poll(
                timeout_ms=poll_timeout_ms,
            )
            for _, messages in message_set.items():
                if not self._keep_running:
                    # dont get more messages if
                    # shutdown signal received
                    break

                for message in messages:
                    kafka_messages.append(message)

    if not self._keep_running:
        # dont run pipeline if shutdown signal received
        break

    if kafka_messages:
        self.run_pipeline(kafka_messages)
        with self._statsd.timer('commit_kafka_messages'):
            consumer.commit()

Copy code

def run_pipeline(self, kafka_messages):
        with open(self._save_path, 'wb') as f:
            pickle.dump(kafka_messages, f)

        PipelinePartitionRunner.run_pipeline(
            partition=self._partition,
            pipeline_config=self._pipeline_config,
            logger=self._logger,
            sentry=self._sentry,
            statsd=self._statsd,
        )

I dont use any threads

Sam Rausser

06/04/2020, 2:59 PM

the process is invoked via python and then runs in line

max

06/04/2020, 2:59 PM

and the instance is retrieved when you call

PipelinePartitionRunner.run_pipeline

Sam Rausser

06/04/2020, 3:00 PM

yes

Sam Rausser

06/04/2020, 3:00 PM

should i only be instantiating that once?

alex

06/04/2020, 3:07 PM

shouldnt matter. it is odd that it didnt seem to pick up the new config

alex

06/04/2020, 3:08 PM

the process is invoked via python

what is the deploy set up? do you have ssh access to the machine?

Sam Rausser

06/04/2020, 3:09 PM

yup

Sam Rausser

06/04/2020, 3:09 PM

a process starts the consumer script and it's long living

alex

06/04/2020, 3:11 PM

can you verify that only one copy of this process is running on the machine (assuming thats expected) and then look at the open file descriptors for the pid

alex

06/04/2020, 3:11 PM

ls /proc/$pid/fd

Sam Rausser

06/04/2020, 3:13 PM

Copy code

ls /proc/112763/fd
0    104  110  117  123  13   136  142  149  155  161  168  174  180  187  193  2    205  211  218  224  24  30  37  43  5   56  62  69  75  81  88  94
1    105  111  118  124  130  137  143  15   156  162  169  175  181  188  194  20   206  212  219  225  25  31  38  44  50  57  63  7   76  82  89  95
10   106  112  119  125  131  138  144  150  157  163  17   176  182  189  195  200  207  213  22   226  26  32  39  45  51  58  64  70  77  83  9   96
100  107  113  12   126  132  139  145  151  158  164  170  177  183  19   196  201  208  214  220  227  27  33  4   46  52  59  65  71  78  84  90  97
101  108  114  120  127  133  14   146  152  159  165  171  178  184  190  197  202  209  215  221  228  28  34  40  47  53  6   66  72  79  85  91  98
102  109  115  121  128  134  140  147  153  16   166  172  179  185  191  198  203  21   216  222  229  29  35  41  48  54  60  67  73  8   86  92  99
103  11   116  122  129  135  141  148  154  160  167  173  18   186  192  199  204  210  217  223  23   3   36  42  49  55  61  68  74  80  87  93

alex

06/04/2020, 3:14 PM

lol thats not useful - uhhh

lsof -p 112763

Sam Rausser

06/04/2020, 3:16 PM

pretty big

Sam Rausser

06/04/2020, 3:16 PM

Untitled

alex

06/04/2020, 3:17 PM

that seems reasonable - no red flags there

alex

06/04/2020, 3:22 PM

well there does seem to be a lot of open sockets

<http://batch22sj.prod.easypo.net:50112->klogs1sj.prod.easypo.net:40172|batch22sj.prod.easypo.net:50112->klogs1sj.prod.easypo.net:40172>

Sam Rausser

06/04/2020, 3:22 PM

one thing to note, yesterday i was getting both the

BlockingIO

and the

can't start new thread error

now i'm only getting the

BlockingIO

error

alex

06/04/2020, 3:22 PM

but 455 shouldnt get us close to the limit of 1024

Sam Rausser

06/04/2020, 3:23 PM

speak of the devil and it shall appear, just got the

can't start new thread error

alex

06/04/2020, 3:25 PM

whats the trace for the new thread error?

alex

06/04/2020, 3:25 PM

nevermind found it in other message

👍 1

alex

06/04/2020, 3:27 PM

I still don’t understand why the NoOpComputeManager doesn’t seem to be being used

alex

06/04/2020, 3:28 PM

worth double checking

$DAGSTER_HOME

alex

06/04/2020, 3:28 PM

you should have

storage

and

history

directories in there full of stuff if its working as expected

Sam Rausser

06/04/2020, 3:29 PM

Copy code

echo $DAGSTER_HOME
/srv/datafarm

Sam Rausser

06/04/2020, 3:30 PM

and

dagster.yaml

is in

/srv/datafarm

Sam Rausser

06/04/2020, 3:31 PM

would the storage and history be the

local_artifact_storage

alex

06/04/2020, 3:31 PM

what else do you have configured in the

dagster.yaml

Sam Rausser

06/04/2020, 3:31 PM

Untitled

alex

06/04/2020, 3:32 PM

ah you re direct to

/tmp/datform

alex

06/04/2020, 3:32 PM

but that directory is full of stuff

Sam Rausser

06/04/2020, 3:32 PM

i have to cause we're only allowed to write to

/tmp

Sam Rausser

06/04/2020, 3:33 PM

in prod

alex

06/04/2020, 3:33 PM

ya thats cool

alex

06/04/2020, 3:33 PM

alright what else can we check

alex

06/04/2020, 3:33 PM

pstree

Sam Rausser

06/04/2020, 3:33 PM

but there's no

storage

history

/tmp/datafarm

Sam Rausser

06/04/2020, 3:34 PM

pstree

alex

06/04/2020, 3:36 PM

pstree -p 112763

pstree 112763

Sam Rausser

06/04/2020, 3:38 PM

looks like that one got restarted

Copy code

pstree -p 92629
ingest(92629)─┬─stdin2epilog(92635)
              ├─stdin2epilog(92636)
              └─{ingest}(92658)

alex

06/04/2020, 3:39 PM

pstree -s python

Sam Rausser

06/04/2020, 3:40 PM

what is

-s

trying to do?

Sam Rausser

06/04/2020, 3:40 PM

Copy code

pstree -s python
pstree: invalid option -- 's'
Usage: pstree [ -a ] [ -c ] [ -h | -H PID ] [ -l ] [ -n ] [ -p ] [ -u ]
              [ -A | -G | -U ] [ PID | USER ]
       pstree -V
Display a tree of processes.

    -a     show command line arguments
    -A     use ASCII line drawing characters
    -c     don't compact identical subtrees
    -h     highlight current process and its ancestors
    -H PID highlight this process and its ancestors
    -G     use VT100 line drawing characters
    -l     don't truncate long lines
    -n     sort output by PID
    -p     show PIDs; implies -c
    -u     show uid transitions
    -U     use UTF-8 (Unicode) line drawing characters
    -V     display version information
    -Z     show SELinux security contexts
    PID    start at this PID; default is 1 (init)
    USER   show only trees rooted at processes of this user

alex

06/04/2020, 3:41 PM

man alright im running out of ideas

alex

06/04/2020, 3:41 PM

i guess it could be if you are only allowed to write to

/tmp

there could be other strict restrictions on number of threads per process or number of file descriptors

alex

06/04/2020, 3:42 PM

given the code you’ve shared - it looks like the whole program will stop when execute_pipeline throws these errors?

Sam Rausser

06/04/2020, 3:42 PM

yes, they bubble to the top

Sam Rausser

06/04/2020, 3:44 PM

why is the

NoOpComputeLogManager

trying to start new threads/subprocesses ?

alex

06/04/2020, 3:45 PM

it shouldn’t be

alex

06/04/2020, 3:46 PM

i have no idea how you are getting to dagster/core/storage/compute_log_manager.py, line 57

alex

06/04/2020, 3:46 PM

it should bail at line 52

Sam Rausser

06/04/2020, 3:47 PM

and we know the

dagster.yaml

file is being picked up properly as it is writing the runs to /tmp/datafarm

alex

06/04/2020, 3:48 PM

ya that and i assume you are seeing new runs and stuff in the database you are pointing at

Sam Rausser

06/04/2020, 3:50 PM

yup, its chugging along great for the most part. guess i'll just pass on these two exceptions and retry ¯\_(ツ)_/¯

alex

06/04/2020, 3:50 PM

cc @prha

alex

06/04/2020, 3:53 PM

i am guessing you will get stuck with repeat failures if you try that

alex

06/04/2020, 3:53 PM

what would be good would be to try to capture some of the information like above when the exception happens

alex

06/04/2020, 3:55 PM

open file descriptors and threads since even though the local compute log manager shouldn’t be on - I also see no reason for us to be exhausting these resources unless we are leaking something

Sam Rausser

06/04/2020, 3:58 PM

kk, i'll add the

lsof

output to the exceptions logged to sentry

Sam Rausser

06/04/2020, 5:03 PM

i can side step all of this if i need to by running in ephemeral mode (not passing a dagster instance) right?

alex

06/04/2020, 5:05 PM

we were able to repro the NoOpComputeLog manager bug and I think you would still hit that in “ephemeral” mode since thats the one it uses

alex

06/04/2020, 5:05 PM

on trajectory to have a fix out today in

0.7.16

Sam Rausser

06/04/2020, 5:05 PM

sweeeet

alex

06/04/2020, 5:06 PM

still want to know why were exhausting resources… BUT should get you back in a good state

Sam Rausser

06/04/2020, 5:07 PM

yeah, i added lsof subprocess on those two exceptions

alex

06/04/2020, 5:07 PM

awesome send over any info if you hit it

Sam Rausser

06/04/2020, 5:12 PM

will do

alex

06/04/2020, 5:17 PM

the bug only affects loading from config path so you could do an ephemeral instance if that is helpful for the next handful of hours

Sam Rausser

06/04/2020, 5:17 PM

Untitled

Sam Rausser

06/04/2020, 5:17 PM

this is all i got from lsof

prha

06/04/2020, 11:42 PM

@Sam Rausser Just released

0.7.16

which should resolve the issue of

NoOpComputeLogManager

behaving like the default compute log manager…. The config is slightly changed, as you no longer need to specify a

base_dir

in the config.

Sam Rausser

06/04/2020, 11:44 PM

awesome, i'll give it a spin

Sam Rausser

06/05/2020, 12:10 AM

Untitled

Sam Rausser

06/05/2020, 12:10 AM

what did i do wrong?

Sam Rausser

06/05/2020, 12:11 AM

Untitled

3 Views

Open in Slack

Previous Next