Any Book Recommendation for Apache Beam Python?

SpiderMangauntlet@fediverser.communick.dev · 1 year ago

Any Book Recommendation for Apache Beam Python?

Mayor18@fediverser.communick.dev · 1 year ago

Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing

This is the only good book that will allow you to understand how Beam works. It’s written by one of the engineers working on Beam.

Also, the Python SDK is not great if you are looking for pipelines that scale well and are going to work with non-google sources and sinks, like Kafka, PG, Clickhouse. We tried it but it’s expensive to run and not very reliable. Haven’t tried the Java SDK tho… Maybe it’s better.

At the company I work for, we switched to Apache Flink on Java. Works better and is very reliable and consistent.

marsupiq@fediverser.communick.dev · 1 year ago

I am not aware of a particular book that covers this, but I can say that “Building Machine Learning Pipelines” by Hapke/Nelson has some sections on Apache Beam (chapters 2 and 11).

It’s basically a book on the Google Machine Learning ecosystem (TFX, Kubeflow), which is based on Beam.

SpiderMangauntlet@fediverser.communick.dev · 1 year ago

Funny enough, I was going through that book, and that is the one that motivated me to get a deeper understanding of beam itself.

marsupiq@fediverser.communick.dev · 1 year ago

TBH I don’t know how much time I would invest in learning it. I think it’s an amazing piece of engineering, but Google is pretty much losing ground everywhere. I noticed people used to speak of the “three major cloud providers AWS, Azure and GCP”, these days it’s only AWS and Azure. And TensorFlow is losing to PyTorch. I don’t know anyone using TFX, people use MLflow, Metaflow, Great Expectations/pandera, … Kubernetes has the reputation of being too complex, people would rather use ECS. Same goes for Kubeflow, people use Airflow or Step Functions. And people don’t care about having unified batch/stream processing, they would rather exchange data on S3 or use Kinesis. Spark is dead anyway, people would rather use dask…

kharigardner@fediverser.communick.dev · 1 year ago

Beam is kinda outdated, but look at the resources in r/dataengineering

marsupiq@fediverser.communick.dev · 1 year ago

Why would you say it’s outdated? I don’t see any other technology that has a similar scope as Beam.

kharigardner@fediverser.communick.dev · 1 year ago

Correction: meant to say niche. As a DE, it’s not one of those tools I think about in the space. I’ve heard of it and played around with it, but it’s definitely not something you hear often being used.

And it’s Python API is very very immature and poorly documented.

marsupiq@fediverser.communick.dev · 1 year ago

Okay, that I would agree with.

Any Book Recommendation for Apache Beam Python?

Any Book Recommendation for Apache Beam Python?

Reddit - Dive into anything