DummyXarray¶

A lightweight xarray-like object for building dataset metadata specifications before creating actual xarray datasets.

Overview¶

dummyxarray allows you to define the structure of your dataset including dimensions, coordinates, variables, and metadata before actually creating the xarray.Dataset with real data. This is particularly useful for:

Dataset Planning: Define your dataset structure before generating data
Template Generation: Create reusable dataset templates
CF Compliance: Ensure metadata follows CF conventions with automatic validation
Zarr Workflow: Define chunking and compression strategies upfront
Metadata Validation: Catch dimension mismatches early
Reproducible Workflows: Track and replay all operations

Key Features¶

Core Functionality¶

✅ Metadata-first design - Define structure before data
✅ xarray compatibility - Convert to/from xarray.Dataset
✅ Automatic dimension inference - Infer from data shape
✅ xarray-style attribute access - ds.time, ds.temperature
✅ Rich repr - Interactive exploration in notebooks

CF Compliance (Phase 1)¶

✅ Axis detection - Automatic X/Y/Z/T axis inference
✅ CF validation - Check for CF convention compliance
✅ Standard names - Support for CF standard_name vocabulary
✅ Dimension ordering - Validate T, Z, Y, X ordering

History & Provenance¶

✅ Operation tracking - Record all dataset modifications
✅ History export - Export as Python, JSON, or YAML
✅ History visualization - Text, DOT, or Mermaid diagrams
✅ Provenance tracking - Track what changed (added/removed/modified)
✅ History replay - Recreate datasets from operation history

Data Generation & I/O¶

✅ Smart data generation - Populate with realistic random data
✅ Multiple formats - Export to YAML, JSON, Zarr, NetCDF
✅ Template support - Save/load dataset specifications
✅ Encoding support - dtype, chunks, compression settings ✅ Intake catalogs - Export and import Intake catalog YAML files

Multi-File Dataset Support (Phase 2)¶

✅ Multi-file datasets - Open multiple NetCDF files as one dataset
✅ Automatic frequency inference - Detect time frequency from coordinates
✅ Time-based grouping - Group datasets by decades, years, months
✅ File tracking - Track which files contain which data ranges
✅ Metadata-only - No data loading, only metadata operations

Architecture¶

✅ Modular design - Mixin-based architecture for maintainability
✅ Well-tested - 188 tests with comprehensive coverage
✅ Type-safe - Clear API with validation

Quick Example¶

from dummyxarray import DummyDataset

# Create a CF-compliant dataset
ds = DummyDataset()
ds.assign_attrs(Conventions="CF-1.8", title="Climate Model Output")

# Add dimensions and coordinates
ds.add_dim("time", 12)
ds.add_dim("lat", 180)
ds.add_dim("lon", 360)

ds.add_coord("time", dims=["time"], attrs={"units": "days since 2000-01-01"})
ds.add_coord("lat", dims=["lat"], attrs={"units": "degrees_north"})
ds.add_coord("lon", dims=["lon"], attrs={"units": "degrees_east"})

# Add variable with encoding
ds.add_variable(
    "temperature",
    dims=["time", "lat", "lon"],
    attrs={"standard_name": "air_temperature", "units": "K"},
    encoding={"dtype": "float32", "chunks": (6, 32, 64)}
)

# Infer CF axis attributes (X, Y, Z, T)
ds.infer_axis()
ds.set_axis_attributes()

# Validate CF compliance
result = ds.validate_cf()
print(f"Warnings: {len(result['warnings'])}")

# Populate with realistic data
ds.populate_with_random_data(seed=42)

# Export or convert
ds.save_yaml("template.yaml")
xr_dataset = ds.to_xarray()
ds.to_zarr("output.zarr")

Documentation¶

Getting Started¶

Installation Guide - Set up dummyxarray
Quick Start - Hands-on introduction

User Guide¶

Basic Usage - Core concepts and workflows
CF Compliance - Working with CF conventions
CF Standards - CF standard names and vocabulary
Multi-File Datasets - Work with multiple NetCDF files
History Tracking - Track and replay operations
Validation - Validate dataset structure
Encoding - Configure chunking and compression
YAML Export - Save and load specifications
Intake Catalogs - Export and import Intake catalogs
STAC Catalogs - STAC Item and Collection support
Spatial Metadata - Geospatial extent and validation
Geospatial Workflows - Real-world STAC examples
ncdump Import - Import from ncdump output

API Reference¶

DummyDataset - Main dataset class
DummyArray - Array class for variables and coordinates

Project Architecture¶

Design Overview - Mixin-based architecture
Testing - Test structure and fixtures

Project Status¶

Phase 1 Complete: CF compliance, history tracking, and modular architecture
Phase 2 Complete: Multi-file datasets, time-based grouping, CF standards
Phase 3 Complete: STAC catalog support, spatial metadata, geospatial workflows
Future: CMIP table integration and spatial grouping

Contributions and feedback are welcome!