Hi! I work at Hugging Face, and we are on our way ...
# introductions
s
Hi! I work at Hugging Face, and we are on our way to enhance the Hub dataset pages. We already show some details about the datasets, such as the 100 first rows, and we want to preprocess more insights to be displayed on the dataset page. The pre-processing steps are currently managed with ad-hoc code. We are investigating Dagster to manage the jobs/dependencies/storage/etc. stuff, which would help us focus more on the data processing itself. The project is open source, btw. Dagster is a very nice tool, and after reading the doc and doing tests, I still have conceptual questions, so... possibly I’ll be looking for help. Thanks for your patience!
wave anim 9
s
I wouldn't count them out -- personally, I think assets are the most compelling reason to use Dagster, and operating from an asset-first mindset keeps things much simpler in the long run. I started off going bananas on dynamic graphs and ops, and have moved to using assets for nearly everything. You can still use assets to generate changing outputs -- for example, you could have an
email_publisher
asset that gets materialized with different configurations. But there are certainly trade-offs
s
Oh OK, thanks. I’ll ask my question in #dagster-support with that in mind then
s
Hey Sylvain- I’m the lead eng on Dagster and a fan of Hugging Face. Let me know if it would be helpful to chat
❤️ 1
s
Sure! I would love to. I’m not a data engineer and many concepts are new to me. I asked my first question in #dagster-support, but overall I’m still wondering how well Dagster (or other similar tools like Airflow) are adapted to my problem, and if so: how to organize my code when it will depend on Dagster (for example: how to trigger jobs, via the GraphQL API, directly in Python, etc.)