How relevant are Airflow and similar to those of us who aren't operating at unic...

colinwilson · on March 1, 2017

I would say it can be used at it's simplest as a replacement for cron. It supports running programs on a schedule and you can set concurrency rules, SLAs, and triggers around even single commands or programs. You also get nice graphs of task run time and email alerts if jobs take longer than the SLA you've set.

sidlls · on March 1, 2017

My opinion is: not necessarily, and probably unlikely. Airflow and Luigi are overkill unless you have a certain level of complexity in your system.

For perspective, the company I work at has tried both (as in we built products using each, and the one with Luigi is still in use). We operate on data in the < 10TB space used primarily for machine learning applications. Luigi and Airflow both introduced complexity that simply wasn't useful relative to our data flow. They both ended up getting in the way more than they helped and introduced developer overhead that wasn't justifiable.

However they are both very nice tools, and it's easy to see how they can help reduce complexity overall with very large numbers of distributed mostly-static task graphs. If that's how you consume/transform your files and data, either tool might be worth looking into.

caravel · on March 1, 2017

Many [most] of the companies using Airflow are small-ish scale, maybe less than a half a dozen people writing jobs that need to be scheduled. Airflow will bring clarity even to modest efforts, and will allow to scale the processes as needed.

In terms of architecture, it's pretty straightforward to setup Airflow on one box and run all the services there until you have to grow out of it and scale out to having multiple workers.

gdulli · on March 1, 2017

Airflow is more about scaling your number of processes, not the size of your data. It's relevant if you want to have a higher-level system for managing processes independent of what those processes are.

amalag · on March 1, 2017

I have used Talend in the past, sounds like a fit. But this seems to fit a different need around job management.

caravel · on March 1, 2017

I'd argue for something programmatic over drag and drop for reasons described in this other article I wrote: https://medium.freecodecamp.com/the-rise-of-the-data-enginee...

gregn610 · on March 1, 2017

Talend makes my teeth grind. I don't understand why an ETL tool uses a strongly typed language for a foundation. The number of fun productive hours I've spent swapping chars to varchars, int to decimal & vice versa. In 2017 computers can read a registration plate from a blurry photograph and spot a criminal in a stadium, but a user puts apostrophe in a CSV file and schmoo leaks everywhere.

busterarm · on March 1, 2017

I've been building ETL tools in my day to day work for the past 18 months and weak types are a disaster. In all of my tools as soon as I extract data, the first thing I do is establish its type. I have great, reusable tools for type conversation that are set by rules for load.

Any CSV you're working with should be properly escaped anyway or you're bound for a world of pain.

frugalmail · on March 1, 2017

Maybe we have drastically different use cases, but dynamic and weak typing are a disaster in the data space. Not sure why people build production systems in such languages.

If it matters anywhere it matters in the data space. We don't see the disconnect between decimal and int, but when you're expecting a character and you get varchar, (not sure about the apostrophe case, but I suspect your talking about quotes and embedded commas) and the number of fields or composition of fields changes (e.g. col1:sttring "jack, dorsey" col2:int 156 and the parser sees col1:string "jack" col2: "dorsey, 156" you want to know that is broken ASAP.

gregn610 · on March 1, 2017

If you enjoy making sure your peas don't touch the carrots, then sure, strong typing is great in the data space. But when you get woken up because a spreadsheet comes through as 11.0 rather than 11 or you have to type

    Double.parseDouble(x) == Double.parseDouble(y) 
    /* instead of pythons */ 
    x == y

27 times to get a feed file parsed, then I'd say that the tools are lacking basic useability. And especially primitive in light of the handwriting reading, photo tagging, go playing,supermario winning possibilities of ML.

pas · on March 2, 2017

That means you don't have a clean "domain model" (or business logic layer) and a data filtering layer that creates the domain objects. You should apply business logic on the pure objects/entities, and you should make sure that the filtering handles the representational problems (parsing, data integrity, partial data, and so on).

And ideally if something makes it through the data filtering layer into the logic layer and does not make sense there, then that should be handled. And that's where strong types help. It forces you to handle these cases, even if that means logging/alerting/ignoring, but at least you'll have to make a decision when you write the logic, instead of 3AM in the morning.

vira · on March 1, 2017

Strong typing helps. Keep in mind enterprise ETL tools are designed to move data from oltp to olap databases, which are often strongly typed as well.

gregn610 · on March 1, 2017

When a tool insist on you cast char fields to varchars before you can test two fields for equality, or keeps changing all your decimals to floats, how is that helping? I'm saying that if the underlying language was loosely typed, those kind of productitvy saps & bug fountains would not happen. In the few instance where you cared about type, a loosely typed language can usually offer something.

Last time I checked, enterprise ETL tools were sold as capable of a lot more than simple OLAP to OLTP. I find the reality provided is somewhat underwhelming. Given that facebook can tell the difference between a photo of Dave and one of Jim, why do I have to manually provide a mask for every single date field flowing through an enterprise?

busterarm · on March 1, 2017

I don't use enterprise ETL tools. I pretty much write my own every time and occasionally I'll supplement that with things like Kiba.

frugalmail · on March 1, 2017

We're part of a large enterprise but we use Azkaban and Scala for our data movement. And strong typing is a must in our group.