Brian Pohl
02/07/2023, 12:33 AMprint
statements at the end of my script never ran. it just stopped somewhere in the middle. surely enough, not all of my files were loaded, confirming that we never reached the end of my script. i will paste the last few lines of the logs in the thread 🧵
i had an issue like this recently with an unrelated Python script - i think memory leaks cause the Python kernel to crash. when this happens, the Python process ends but does not raise any error messages. the exit code for the process is 247.daniel
02/07/2023, 3:04 AMBrian Pohl
02/07/2023, 10:21 PMk8s_job_executor
.
i just updated my original message, as i managed to run an experiment that caused my Python kernel to crash as it did not too long ago. i found that the exit code here is 247. if that were to happen in my Kubernetes pod, would Dagster pick it up as a failure?
here's a screenshot of my terminal at the end of my experiment. again, the script i was running here is not related to Dagster or the script that failed, but i suspect the same thing happened. the script just stops running with no error reported.daniel
02/21/2023, 10:34 PMBrian Pohl
02/21/2023, 10:34 PMdaniel
02/21/2023, 10:35 PMBrian Pohl
02/21/2023, 10:36 PMdaniel
02/21/2023, 10:37 PMBrian Pohl
02/21/2023, 10:40 PMdaniel
02/21/2023, 10:40 PMBrian Pohl
02/21/2023, 10:40 PMcould this have happened while a resource was cleaning itself up or something?hmmm like what? can you give an example of that?
daniel
02/21/2023, 10:41 PMBrian Pohl
02/21/2023, 10:42 PMdaniel
02/21/2023, 10:43 PM@resource
@contextmanager
def db_connection():
try:
db_conn = get_db_connection()
yield db_conn
finally:
# crash here
Brian Pohl
02/21/2023, 10:46 PMdbt_cli_resource
, snowflake_resource
, s3_resource
, and s3_pickle_io_manager
.
regarding it continuing after the finish, i wish it had. but we noticed that not all the files were loaded into our Typesense cluster, and surely enough when i pulled the logs, it made sense that the data stopped loading right where the logs had suddenly stopped.daniel
02/21/2023, 10:48 PMBrian Pohl
02/21/2023, 10:49 PMdef ingest_data(export_dict, host, collection_name, api_key):
'''Downloads data from s3 based on the configurations in export_dict. Then iterates
through each row of the downloaded data, writing them as documents into the specified
Typesense collection.'''
s3, response = get_files(export_dict)
client = setup_typesense_client(host, api_key)
start = time.time()
if 'Contents' in response.keys():
for item in response['Contents']:
print('downloading file: ' + item['Key'])
s3.download_file(EXPORT_BUCKET, item['Key'], 'tmp/address.csv'). # overwrites address.csv with each downloaded file
print('finished downloading file')
with open('tmp/address.csv') as f:
reader = csv.DictReader(f)
records = []
i = 0
# iterate through CSV, create Documents, write them into Typesense in batches of 100,000
for row in reader:
record = create_document(row)
i += 1
records.append(record)
if len(records) == 100000:
# Write this batch of 100,000
print("Sending", len(records))
client.collections[collection_name].documents.import_(records, {'action': 'upsert'})
records = []
# Write any leftovers that didn't make it up to 100,000
print("Sending", len(records))
client.collections[collection_name].documents.import_(records, {'action': 'upsert'})
print('done in: ' + str(time.time() - start))
@op(required_resource_keys={'input_values','s3'})
def typesense_address_handler(context, logged_export_timestamp):
# there is a lot of variable declaration stuff, and then:
# Download exported S3 data and load it into Typesense.
ingest_data(address_export_dict, ts_host, ts_collection, ts_key)
Sending
and downloading file
print statements are happening. but it never prints done in:
daniel
02/21/2023, 10:50 PMBrian Pohl
02/21/2023, 10:51 PM