Wednesday, June 7, 2023
HomeUncategorizedBuild a Python ecosystem for efficient and reliable development

Build a Python ecosystem for efficient and reliable development

Tl;dr: This blog post describes how we developed an efficient and reliable Python ecosystem using Pants, an open source build system, and addresses Coinbase manages Python applications at scale.

by Coinbase Computing Platform team

Python is one of the most used programming languages ​​for data scientists, machine learning practitioners and blockchain researchers at Coinbase. Over the past few years, we have witnessed the growth of Python applications designed to solve many challenging problems in the cryptocurrency world, such as Airflow data pipelines, blockchain analytics tools, machine learning applications, etc. Wait. According to our internal data, the number of Python applications has nearly doubled since Q3 2022. According to our internal data, there are approximately 1,500 data processing pipelines and services developed in Python today. At the time of this writing, the total number of builds is about 500 per week. We expect wider adoption as more Python-centric frameworks (like Ray, Modin, DASK, etc.) are adopted into our data ecosystem.

  • Engineering success largely comes from choosing the right tools. Building a large-scale Python ecosystem to support our growing engineering needs can present several challenges, including using a reliable build system, flexible dependency management, rapid software releases, and consistent code quality checks. However, these challenges can be addressed by integrating Pants, a build system developed by Toolchain Labs, into the Coinbase build infrastructure. We chose it as the Python build system for the following reasons:

  • The pants are ergonomic and user friendly,
  • Pants understands many build-related commands such as “test”, “lint”, “fmt”, “typecheck” and “package”
  • Pants are designed to use real-world Python for first-class use cases, including handling 3rd party dependencies. In fact, parts of Pants itself are written in Python (the rest in Rust).

  • and Pants requires less metadata and BUILD file boilerplate than other tools, thanks to dependency inference, sensible defaults, and automatic generation of BUILD files. Bazel requires a lot of handwritten BUILD boilerplate.
  • Pants is easily extensible, has a powerful plugin API, and uses idiomatic Python 3 asynchronous code, so users can have a natural flow of control in their plugins.
  • Pants has true OSS governance in which any organization can play an equal role.
  • The pants have a mild learning curve. It has much less friction than other tools. One-click installation experience of tools, simple configuration files, and moderate maintenance costs.

  • Python is the most powerful tool for machine learning and data science applications One of the popular programming languages. However, prior to the Python-first build system Pants, our internal investment in the Python ecosystem was low compared to Golang and Ruby – Coinbase’s primary choice for writing services and web applications.

  • According to Coinbase’s monorepo usage statistics, today’s Python Only 4% are used due to lack of build system support. Before 2021, most Python projects were in multiple repositories without a unified build infrastructure – leading to the following problems:
  • Challenges of code sharing: The process for engineers to update shared libraries is complicated. Changes to the code are posted to the internal PyPI server before proving to be more stable. Upgrading to a new version of a library, without adequate testing, may break dependencies that use the library without a fixed version.
  • Lack of simplified release process: Code changes Often complex cross-repository updates and releases are required. There is no automated workflow to perform integration and staging testing of related changes. The lack of coherent observability and reliability results in huge engineering overhead.

  • Inconsistent development experience: Development experience varies widely as each repository has its own way of setting up virtual environments, code quality checking, building and deploying, etc.
  • We decided to build PyNest for Coinbase’s data organization – A new Python “monorepo”. We do not intend to use PyNest as a single repository for the entire company, but rather for projects within the data organization.
  • Building a company-wide monorepo requires an elite team. We don’t have enough staff to replicate the monorepos success story on Facebook, Twitter and Google.

  • Python is mainly used for data organization in companies. It is important to set the correct scope so that we can focus on data prioritization and not be distracted by ad hoc needs. The PyNest build infrastructure can be reused by other teams to speed up their Python repositories.
  • It is best to combine interdependent projects (see Dependency graph for ML Platform projects) into a single repository to prevent unintentional circular dependencies.
  • Figure 1. Dependency diagram for a Machine Learning Platform (MLP) project.

  • While monorepo promised a new world of productivity, it turned out not to be a long-term solution for Coinbase. Golang monorepo is a lesson, after a year of use there are problems like huge codebase, failed IDE integrations, slow CI/CD, outdated dependencies etc
  • Open source projects should be kept in a separate repository.
  • Down Diagram showing Coinbase’s repository architecture, where the green blocks represent the new Python ecosystem we’re building. Inter-repository operability is achieved through the service layer, including code artifacts and schema registries.

    Figure 2. Coinbase’s repository architecture

  • #Third party dependencies# Third-party dependencies ├── 3rdparty │ ├── Dependency 1│ │ ├── build│ │ ├── requirements.txt│ │ └── resolve1.lock # lockfile│ ││ └── Dependency 2│ │ ├── Build│ │ ├── required. text │ │ └── resolve2.lock# Shared library├── lib# Top level project folder ├── project1 # Project name │ ├── src │ │ └── python│ │ ├── data block │ │ │ ├── Build│ │ │ ├─ ─ owner │ │ │ ├── gateway. py│ │ │ …│ │ └── Notebook │ │ ├── build│ │ ├ ── Owner │ │ ├── etl_job.py │ │ …│ └── test│ └── p ython │ ├── data block │ │ ├── BUILD│ │ ├── gateway_test.py│ │ …│ └── notebook │ ├── build│ ├── etl_job_test.py │ …├── Item 2# Dockerfile├── dockerfiles# Tools for lint, formatting, etc. ├── Tools# Buildkite CI Workflow ├ ── .buildkite

    │ ├── pipeline .yml

    │ └── Hook

    #Pants Library

    ├── Pants

    ├──pants.toml

    └── pants. ci.toml

    • Figure 3. Pynest repository structure

      Below is a list of the main elements of the repository and their explanations .

    1. Third Party

    Third party dependencies are placed in this file clip. Pants will parse the requirements.txt file and automatically generate a “python_requirement” target for each dependency. Pants’ multiple lock files feature supports multiple versions of the same dependency. This feature makes it possible for projects to have conflicts in direct or transitive dependencies. Pants generates lock files to pin every dependency and ensure reproducible builds. More explanation of Pants Multilock is in the Dependency Management section.

  • 2. library

  • All items are Accessible shared libraries. Projects in PyNest can import source code directly. For projects outside of PyNest, the library can be accessed via pip installing the wheel file from the internal PyPI server.
  • 3. Project Folder

  • individual items are located in this folder. The folder path format is “{project_name}/{src or test}/python/{namespace}”. The source root is configured as “src/python” or “test/python” and the namespace below is used to isolate modules.
  • 4. Code Owner File

    Code Owners files ( OWNERS ) are added to folders to define the person or team responsible for the code in the folder tree. The CI workflow call script compiles all OWNERS files into CODEOWNERS files under “.github/”. Code owner approval rules require all pull requests to be approved by at least one group of code owners before being merged.

    5. Tools

    The Tools folder contains configuration files for code quality tools such as flake8, black, isort, mypy, etc. These files are referenced by Pants to configure the linter.

    6. Buildkite Workflow

    Coinbase uses Buildkite as a CI platform. Buildkite workflows and hook definitions are defined in this folder. A CI workflow defines something like

    steps like Check if the dependency lock file needs to be updated.

  • Execute lints and code quality tools.
  • Build source code and docker image.
  • to run unit and integration tests.
    to generate a code coverage report. 7.Dockerfiles
  • Dockerfiles are defined in this folder. The docker image is built by a CI workflow and deployed by Codeflow – Coinbase’s on-premise platform.

    8. Pants This folder contains Pants scripts and configuration files (pants.toml, pants.ci.toml). This article describes how we built PyNest using the Pants build system. In our next blog post, we will explain dependency management and CI/CD.

    RELATED ARTICLES

    LEAVE A REPLY

    Please enter your comment!
    Please enter your name here

    LAST NEWS

    Featured NEWS