Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Docker and Python

Docker and Python

Making them play nicely and securely for Data Science and Machine learning

Tania Allard

July 23, 2020
Tweet

More Decks by Tania Allard

Other Decks in Technology

Transcript

  1. TANIA ALLARD, PHD
    Making them play nicely and securely for Data Science and Machine
    Learning
    DOCKER AND PYTHON
    Sr. Developer Advocate @Microsoft. ixek | https://bit.ly/europython-ml-docker

    View Slide

  2. @ixek
    @trallard
    trallard.dev

    View Slide

  3. https://bit.ly/europython-ml-
    docker
    THESE SLIDES

    View Slide

  4. WHAT YOU’LL LEARN TODAY
    -Why using Docker?
    -Docker for Data Science and Machine Learning
    -Security and performance
    -Do not reinvent the wheel, automate
    -Tips and trick to use Docker
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  5. WHY
    DOCKER?

    View Slide

  6. DEV LIFE WITHOUT DOCKER OR CONTAINERS
    Your application
    How are your users or colleagues meant to know what dependencies they need?
    Import Error:
    no module name
    x, y, x
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  7. WHAT IS DOCKER?
    A tool that helps you to create, deploy and run your applications or
    projects by using containers.
    This is a container
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  8. HOW DO CONTAINERS HELP ME?
    They provide a solution to the
    problem of how to get software to
    run reliably when moved from one
    computing environment to another
    Your laptop
    Test environment
    Staging environment
    Production environment
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  9. DEV LIFE WITH CONTAINERS
    Your application
    Libraries, dependencies,
    runtime environment,
    configuration files
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  10. THAT SOUNDS A LOT LIKE A VIRTUAL MACHINE
    Each app is
    containerised
    INFRASTRUCTURE
    HOST OPERATING SYSTEM
    DOCKER
    APP
    APP
    APP
    APP
    APP
    At the app level:
    Each runs as an isolated process
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  11. THAT SOUNDS A LOT LIKE A VIRTUAL MACHINE
    CONTAINERS
    INFRASTRUCTURE
    HOST OPERATING SYSTEM
    DOCKER
    APP
    APP
    APP
    APP
    APP
    INFRASTRUCTURE
    HYPERVISOR
    APP
    GUEST OS
    VIRTUAL MACHINE
    VIRTUAL MACHINE
    At the hardware level
    Full OS + app +
    binaries +
    libraries
    APP
    GUEST OS
    VIRTUAL MACHINE
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  12. -Image: archive with all the
    data needed to run the app
    -When you run an image it
    creates a container
    IMAGE VS CONTAINER
    Docker
    image
    $ docker run
    Latest
    1.0.2
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  13. -Complex setups / dependencies
    -Reliance on data / databases
    -Fast evolving projects (iterative R&D process)
    -Docker is complex and can take a lot of time to upskill
    -Are containers secure enough for my data / model /algorithm?
    COMMON PAIN POINTS IN DS AND ML

    View Slide

  14. DOCKER FOR DATA
    SCIENCE AND
    MACHINE LEARNING

    View Slide

  15. HOW IS IT DIFFERENT FROM WEB APPS FOR EXAMPLE?
    https://twitter.com/dstufft/status/1095164069802397696
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  16. -Not every deliverable is an app
    -Not every deliverable is a model either
    -Heavily relies on data
    -Mixture of wheels and compiled packages
    -Security access levels - for data and software
    -Mixture of stakeholders: data scientists, software engineers, ML engineers
    HOW IS IT DIFFERENT FROM WEB APPS FOR EXAMPLE?
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  17. Dockerfiles are used to create
    Docker images by providing a set
    of instructions to install software,
    configure your image or copy
    files
    BUILDING DOCKER IMAGES
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  18. Base image
    Main instructions
    Entry command
    DISSECTING DOCKER IMAGES
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  19. INSTALL PANDAS
    INSTALL REQUESTS
    DISSECTING DOCKER IMAGES
    INSTALL FLASK
    BASE
    IMAGE
    Each instruction creates
    A layer
    (like an onion)
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  20. CHOOSING THE BEST BASE IMAGE
    https://github.com/docker-library/docs/tree/master/python
    If building from scratch use the
    official Python images
    https://hub.docker.com/_/python
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  21. THE JUPYTER DOCKER STACK
    Need Conda, notebooks and
    scientific Python ecosystem?
    Try Jupyter Docker stacks
    https://jupyter-docker-stacks.readthedocs.io/
    ubuntu@SHA
    base-notebook
    minimal-notebook
    scipy-notebook r-notebook
    tensorflow-notebook datascience-notebook pyspark-notebook
    all-spark-notebook
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  22. - Always know what you are
    expecting
    -Provide context with LABELS
    -Split complex RUN statements
    and sort them
    -Prefer COPY to add files
    BEST PRACTICES
    https://docs.docker.com/develop/develop-images/dockerfile_best-practices/
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  23. - Leverage build cache
    -Install only necessary
    packages
    SPEED UP YOUR BUILD
    https://docs.docker.com/develop/develop-images/dockerfile_best-practices/
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  24. - Leverage build cache
    -Install only necessary packages
    -Explicitly ignore files
    https://docs.docker.com/develop/develop-images/dockerfile_best-practices/
    SPEED UP YOUR BUILD AND PROOF
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  25. -You can use bind mounts to directories
    (unless you are using a database)
    -Avoid issues by creating a non-root
    user
    https://docs.docker.com/develop/develop-images/dockerfile_best-practices/
    MOUNT VOLUMES TO ACCESS DATA
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  26. SECURITY AND
    PERFORMANCE

    View Slide

  27. Lock down your container:
    - Run as non-root user (Docker
    runs as root by default)
    - Minimise capabilities
    MINIMISE PRIVILEGE - FAVOUR LESS
    PRIVILEGED USER
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  28. Remember Docker images are like onions. If you copy keys in an intermediate layer they
    are cached.
    Keep them out of your Dockerfile.
    DON’T LEAK SENSITIVE INFORMATION
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  29. -Fetch and manage secrets in
    an intermediate layer
    -Not all your dependencies will
    have been packed as wheels
    so you might need a compiler -
    build a compile and a runtime
    image
    -Smaller images overall
    USE MULTI STAGE BUILDS

    View Slide

  30. USE MULTI STAGE BUILDS
    Compile-image
    Docker
    image
    Runtime-image
    Copy virtual
    Environment
    $ docker build --pull --rm -f “Dockerfile"\
    -t trallard:data-scratch-1.0 "."
    Docker
    image

    View Slide

  31. USE MULTI STAGE BUILDS
    Docker
    image
    Runtime-image
    FINAL IMAGE
    trallard:data-scratch-1.0

    View Slide

  32. AUTOMATE

    View Slide

  33. PROJECT TEMPLATES
    Need a standard project template?
    Use cookie cutter data science
    Or cookie cutter docker science
    https://github.com/docker-science/cookiecutter-docker-science
    https://drivendata.github.io/cookiecutter-data-science/

    View Slide

  34. DO NOT REINVENT
    THE WHEEL
    Leverage the existence and usage
    of tools like repo2docker.
    Already configured and optimised
    for Data Science / Scientific
    computing.
    https://repo2docker.readthedocs.io/en/latest
    $ conda install jupyter repo2docker
    $ jupyter-repo2docker “.”
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  35. DO NOT REINVENT
    THE WHEEL
    Leverage the existence and usage
    of tools like repo2docker.
    Already configured and optimised
    for Data Science / Scientific
    computing.
    https://repo2docker.readthedocs.io/en/latest ixek | https://bit.ly/europython-ml-docker

    View Slide

  36. DELEGATE TO YOUR
    CONTINUOUS INTEGRATION
    TOOL
    Set Continuous integration
    (Travis, GitHub Actions, whatever
    you prefer).
    And delegate your build - also
    build often.
    https://repo2docker.readthedocs.io/en/latest ixek | https://bit.ly/europython-ml-docker

    View Slide

  37. THIS WORKFLOW
    Docker
    image
    Docker
    image
    -Code in version control
    -Trigger on tag / Also scheduled trigger
    -Build image
    -Push image
    ixek | https://bit.ly/europython-ml-docker

    View Slide

  38. TOP TIPS

    View Slide

  39. 1. Rebuild your images frequently - get security updates for system packages
    2. Never work as root / minimise the privileges
    3. You do not want to use Alpine Linux (go for buster, stretch or the Jupyter
    stack)
    4. Always know what you are expecting: pin / version EVERYTHING (use pip-
    tools, conda, poetry or pipenv)
    5. Leverage build cache
    TOP TIPS

    View Slide

  40. 6. Use one Dockerfile per project
    7. Use multi-stage builds - need to compile code? Need to reduce your image size?
    8. Make your images identifiable (test, production, R&D) - also be careful when
    accessing databases and using ENV variables / build variables
    9. Do not reinvent the wheel! Use repo2docker
    10.Automate - no need to build and push manually
    11.Use a linter
    TOP TIPS

    View Slide

  41. THANK YOU
    @ixek
    @trallard
    trallard.dev

    View Slide