Ah, it seems like you're imagining Hamilton to be used to structure sort of "sma...

krawczstef · on March 8, 2023

> Ah, it seems like you're imagining Hamilton to be used to structure sort of "small" pieces, which might then be orchestrated into a Big Picture thing by another tool, something in the Airflow/Dagster/Argo/Flyte class. Or perhaps a paid service offered by DAGWorks in the future...

Yep. Hamilton is good at modeling the "micro". You can also express the "macro" via Hamilton, and then later determine how to "cut" it up for execution on airflow/dagster/etc.

> One, that's reasonable, and as you say there are code organization and testing benefits. I would emphasize that that's the recommended pattern. I would also work to establish, and document, the details of how folks should go about doing that, and provide solid examples. (BTW your "air quality analysis" example is quite good, being far from trivial yet still example-sized in complexity.)

Yep thanks for the feedback. Documentation is something we're slowly chipping away at; that's good feedback regarding the example. I think I'll take that phrasing "far from trivial yet still example-sized in complexity" as a goal for some examples.

> Two, ehhhhhhh I'm a little skeptical of most teams' ability to factor their projects that well.

Agreed. Though we hope the focus on "naming" and forced "python module curation" help nudge people into better patterns than just appending to that SQL/Pandas script :)

> Folks will want to re-use outputs that are seen as useful, especially if they are expensive to compute. This causes DAG scope to grow and grow and grow. DBT in particular is vulnerable to this, and I have been told of 1000-model projects, which is just yuck. This isn't a problem you have to solve right now, but it's worth thinking about.

Yep. Agreed. I think Hamilton's model scales a bit better than DBTs - one team at Stitch Fix manages over 4000 feature transforms in a single code base. Some of that I think comes from the fact that you can think in columns, tables, or arbitrary objects with Hamilton, and you have some extra flexibility with materialization (e.g. don't need that column, don't compute it). But as you point out, for expensively computed things, you likely don't want to re-materialize them. To that end, right now, you can get at this manually. E.g. ask Hamilton what's required to compute a result, and if you have it cached/stored, retrieve and pass in as an override. We could also do more framework-y things and do more global caching/connecting with data stores to prevent unneeded re-computation...

> As a motivating example, what if someone wanted to take the p-value output by the air quality example, and use that as an input into [some other thing]? What would be the "right" way to express that?

The Hamilton way would be to express that dependency as a function in all cases. But, yes, do you recompute, or do you share the result (assuming I understood your point here)? Good question, and it's something we've been thinking about, and would love more design partnership on ;) -- since I think the answer changes a lot depending on the size of the company, and the size of the data. There are nice things about not having to share intermediate data, but then there are not. I'm bullish though, that with Hamilton we have the choice to go either way. The Hamilton DAG logically doesn't change, it's really how computation/dependencies are satisfied.

@Elijah anything I missed?

elijahbenizzy · on March 8, 2023

Think you got it! Re factoring projects well, its interesting, but I think that there's some good strategies here. What's worked for us is working backwards -- starting with the artifact you want and progressively defining how you get there until you get the data you need to load.

Thanks btw for all the feedback! This is great.