Beam is a job orchestration and data processing framework from Apache foundation, from a bit of reading, it seems similar to Airflow.

Looking for any good book recommendation on Apache beam Python SDK. I know I can read the documentation, but it seems it is quite scattered and involves a lot of navigation to build a foundation or mental model.

Some books that I did come across, focus on the Java SDK which is not what I want. Hence wondering if anyone can recommend any book focussing on the python API.

  • marsupiq@fediverser.communick.devB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I am not aware of a particular book that covers this, but I can say that “Building Machine Learning Pipelines” by Hapke/Nelson has some sections on Apache Beam (chapters 2 and 11).

    It’s basically a book on the Google Machine Learning ecosystem (TFX, Kubeflow), which is based on Beam.

      • marsupiq@fediverser.communick.devB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        TBH I don’t know how much time I would invest in learning it. I think it’s an amazing piece of engineering, but Google is pretty much losing ground everywhere. I noticed people used to speak of the “three major cloud providers AWS, Azure and GCP”, these days it’s only AWS and Azure. And TensorFlow is losing to PyTorch. I don’t know anyone using TFX, people use MLflow, Metaflow, Great Expectations/pandera, … Kubernetes has the reputation of being too complex, people would rather use ECS. Same goes for Kubeflow, people use Airflow or Step Functions. And people don’t care about having unified batch/stream processing, they would rather exchange data on S3 or use Kinesis. Spark is dead anyway, people would rather use dask…