Data Science

Python for Data Science in 2026: Libraries That Actually Matter Now

The Rust-in-Python shift is real. Here's what's actually worth learning, what's losing ground, and the 2026 stack that production teams run.

Meritshot6 min read
PythonData ScienceLibrariesPolarsDuckDBMachine Learning2026
Back to Blog

The Python data science ecosystem has always moved fast. But 2025 and 2026 have produced something more fundamental than fast-moving — a structural shift in what the stack looks like, why it looks that way, and what you actually need to know to be productive in production.

The headline: Rust is inside your Python tools now, whether you know it or not. This isn't a language migration. Python remains the interface. But the engines doing the work — the query engines, the data manipulation layers, the formatters, the package installers — have been rewritten in Rust, and the performance difference is substantial enough to change what's feasible.

Python code on screen showing data science library usage

The Rust-in-Python Pattern

The pattern is consistent across the ecosystem: take a Python tool that has performance limits, rewrite the performance-critical parts in Rust while keeping the Python API, and ship something that is 10–100x faster for large workloads.

This has happened with:

  • Data manipulation: Polars replaced Pandas at the engine level
  • SQL query: DuckDB runs analytical queries against DataFrames, files, and databases at columnar query engine speeds
  • Package management: uv replaced pip and venv at install time
  • Code formatting and linting: Ruff replaced Black + isort + Flake8
  • Type checking: ty (from Astral, the uv team) is a Rust-based Python type checker

The practical consequence: tools that were slow enough to require architectural workarounds are now fast enough to use directly. DuckDB can query a 10GB Parquet file in-process in seconds. Polars can filter a 100M-row DataFrame without memory bloat. uv installs a complex environment in under ten seconds.

The Hybrid Stack: DuckDB + Polars + Pandas

Pandas is not dead. It remains the right choice for:

  • Small-to-medium datasets where its API familiarity is worth more than the performance difference
  • Integration with libraries that require Pandas DataFrames specifically
  • Quick exploratory work where code speed doesn't matter

But at scale, the modern stack looks different:

DuckDB handles SQL-first operations: joins, aggregations, window functions, and reading Parquet/CSV/JSON files directly without loading into memory. Its integration with Polars and Arrow means data can move between them without copying. For analytical work on large datasets, DuckDB is often the right first step.

Polars handles DataFrame-style manipulation: filtering, transformation, feature engineering. It's faster than Pandas on virtually every operation above a million rows, uses less memory, and supports lazy evaluation that optimizes the full operation chain before executing.

Pandas handles last-mile compatibility: when a library requires a Pandas DataFrame, convert at the boundary. The conversion is cheap via the Arrow interchange format.

Data processing performance comparison chart showing Pandas vs Polars vs DuckDB

The Tooling Revolution: uv, Ruff, ty

These three tools are infrastructure changes that affect every Python project, not just data science:

uv replaces pip + venv + virtualenv + conda (for most use cases). Written in Rust by the Astral team. Installs packages 10–100x faster than pip. Manages virtual environments, lockfiles, and Python version management. Creates reproducible environments from a uv.lock file. For new projects, there is no longer a reason to use pip directly.

Ruff replaces Black + isort + Flake8 + most pylint checks. A single tool, written in Rust, that lints and formats Python code faster than running Black alone. For teams that had separate pre-commit hooks for Black, isort, and a linter, Ruff collapses all of them into one faster tool.

ty (from Astral) is a Rust-based type checker currently in active development. It runs faster than mypy and pyright on large codebases. It's not yet production-standard for all use cases, but it's the direction the ecosystem is moving.

ML Libraries: The 2026 Map

PyTorch is the unambiguous leader for deep learning in 2026. TensorFlow is still present in legacy systems but has lost the research-to-production pipeline. New projects start in PyTorch. This is settled.

scikit-learn remains essential for tabular data. Random forests, gradient boosting (XGBoost, LightGBM, CatBoost), and classical ML algorithms still win on tabular tasks where deep learning doesn't add value. The data scale where deep learning beats classical ML on structured tabular data is higher than most teams realize.

HuggingFace Transformers is the standard for working with pretrained language and vision models. Fine-tuning, inference, model loading — all goes through Transformers for the majority of production teams.

Pydantic v2 (Rust-based) has become the standard for data validation. It's used not just for API schemas but for validating ML pipeline data at boundaries — dataset schemas, model output schemas, configuration contracts.

Pandera extends Pydantic-style validation to DataFrames. If your ML pipeline's quality issues are partly caused by unexpected data shapes entering a stage, Pandera is the direct fix.

The LLM Era Libraries

vLLM for production LLM inference when you're self-hosting models. Continuous batching, PagedAttention, high throughput. If you're running open-source models at scale, vLLM is the production standard.

smolagents (from HuggingFace) for building agents. Lightweight, composable, doesn't require a framework to function. Preferred over LangChain for teams that have experienced LangChain's abstraction costs.

Instructor / Outlines for structured output from LLMs. When you need a model to return JSON, these libraries enforce the schema and handle retry logic. Cleaner than prompting for JSON and hoping.

FastAPI remains the standard for ML model serving. No serious challenge has emerged.

What's Losing Ground

  • TensorFlow: Legacy only. New projects should not start here.
  • Conda: uv handles environment management better for most pure Python data science work. Conda remains relevant for complex binary dependencies, but is no longer the default choice.
  • pip + Poetry: uv replaces both for new projects.
  • Black + isort + Flake8 as separate tools: Ruff is the consolidation layer.
  • Pandas at scale: Polars + DuckDB for anything above a few million rows.

The Seven Anti-Patterns

  1. Running Pandas operations on datasets that Polars would handle in a fraction of the time
  2. Using pip in new projects when uv exists
  3. Starting a new deep learning project in TensorFlow because old tutorials use it
  4. Installing Black, isort, and Flake8 separately when Ruff covers all three
  5. Calling an LLM and parsing JSON from a text response when Instructor exists
  6. Using a 70B model for a task a 7B model handles correctly (with the right prompting and structure)
  7. Building a custom agent framework when smolagents or a similar lightweight library already does what you need

The 2026 Python stack is genuinely better than it was two years ago. Faster, more composable, more production-ready. The teams that update their mental model of what the stack looks like will spend less time fighting tooling and more time solving actual problems.


Meritshot's Data Science curriculum is updated with 2026-grade tooling — uv, Polars, DuckDB, Ruff, and the LLM-era libraries — so learners build on the stack production teams actually run.

Recommended