Intake Catalogs¶
dummyxarray provides comprehensive support for Intake catalogs, allowing you to both export dataset specifications to Intake catalog format and import existing Intake catalogs back into DummyDataset objects. This enables complete round-trip compatibility with the Intake data cataloging ecosystem.
Overview¶
Intake is a data cataloging system that provides a unified interface for discovering and accessing data. dummyxarray's Intake catalog support allows you to:
- Export DummyDataset structures to Intake catalog YAML files
- Import Intake catalogs to recreate DummyDataset objects
- Preserve complete metadata including dimensions, coordinates, variables, and encoding
- Integrate with the broader Intake ecosystem for data discovery and sharing
Exporting to Intake Catalogs¶
Basic Export¶
from dummyxarray import DummyDataset
# Create a dataset
ds = DummyDataset()
ds.add_dim("time", 12)
ds.add_dim("lat", 180)
ds.add_dim("lon", 360)
ds.add_coord("time", dims=["time"], attrs={"units": "days since 2000-01-01"})
ds.add_variable(
"temperature",
dims=["time", "lat", "lon"],
attrs={"units": "K", "standard_name": "air_temperature"},
encoding={"dtype": "float32", "chunks": [6, 32, 64]}
)
# Generate catalog YAML string
catalog_yaml = ds.to_intake_catalog()
print(catalog_yaml)
Customized Export¶
# Export with custom parameters
catalog_yaml = ds.to_intake_catalog(
name="climate_data",
description="Climate model output with temperature and precipitation",
driver="zarr",
data_path="data/climate_model_output.zarr",
chunks={"time": 6} # Additional driver arguments
)
Save to File¶
# Save catalog directly to file
ds.save_intake_catalog(
"catalog.yaml",
name="climate_data",
description="Climate model output",
driver="zarr",
data_path="data/climate.zarr"
)
Catalog Structure¶
The generated Intake catalog includes:
metadata:
version: 1
description: Intake catalog for climate_data
dataset_attrs:
title: Climate Model Output
institution: Example Climate Center
Conventions: CF-1.8
sources:
climate_data:
description: Climate model output with temperature and precipitation
driver: zarr
args:
urlpath: data/climate_model_output.zarr
metadata:
dimensions:
time: 12
lat: 180
lon: 360
coordinates:
time:
dims: [time]
attrs:
units: days since 2000-01-01
variables:
temperature:
dims: [time, lat, lon]
attrs:
units: K
standard_name: air_temperature
encoding:
dtype: float32
chunks: [6, 32, 64]
Importing from Intake Catalogs¶
Load from File¶
# Load from catalog file
loaded_ds = DummyDataset.from_intake_catalog("catalog.yaml", "climate_data")
# Or use the convenience method
loaded_ds = DummyDataset.load_intake_catalog("catalog.yaml", "climate_data")
Load from Dictionary¶
import yaml
# Load catalog YAML and parse to dictionary
with open("catalog.yaml") as f:
catalog_dict = yaml.safe_load(f)
# Create DummyDataset from dictionary
loaded_ds = DummyDataset.from_intake_catalog(catalog_dict, "climate_data")
Automatic Source Selection¶
# If catalog contains only one source, you can omit the source name
single_source_ds = DummyDataset.from_intake_catalog("single_source_catalog.yaml")
Round-Trip Workflow¶
Create a complete round-trip workflow:
from dummyxarray import DummyDataset
import tempfile
import yaml
# 1. Create original dataset
original_ds = DummyDataset()
original_ds.assign_attrs(
title="Climate Model Output",
institution="Example Climate Center",
Conventions="CF-1.8"
)
original_ds.add_dim("time", 12)
original_ds.add_dim("lat", 180)
original_ds.add_dim("lon", 360)
original_ds.add_coord("time", dims=["time"], attrs={"units": "days since 2000-01-01"})
original_ds.add_variable(
"temperature",
dims=["time", "lat", "lon"],
attrs={"units": "K"},
encoding={"dtype": "float32"}
)
# 2. Export to catalog
catalog_yaml = original_ds.to_intake_catalog(
name="climate_data",
description="Climate model output",
driver="zarr"
)
# 3. Save to temporary file
with tempfile.NamedTemporaryFile(mode='w', suffix='.yaml', delete=False) as f:
catalog_path = f.name
f.write(catalog_yaml)
# 4. Load from catalog
restored_ds = DummyDataset.from_intake_catalog(catalog_path, "climate_data")
# 5. Verify round-trip integrity
assert restored_ds.dims == original_ds.dims
assert set(restored_ds.variables.keys()) == set(original_ds.variables.keys())
assert restored_ds.attrs["title"] == original_ds.attrs["title"]
print("Round-trip successful!")
Advanced Features¶
Multiple Sources in Catalog¶
When working with catalogs containing multiple data sources:
# Catalog with multiple sources
multi_source_catalog = {
"metadata": {"version": 1},
"sources": {
"temperature": {
"driver": "zarr",
"args": {"urlpath": "data/temperature.zarr"},
"metadata": {"dimensions": {"time": 12, "lat": 180, "lon": 360}}
},
"precipitation": {
"driver": "zarr",
"args": {"urlpath": "data/precipitation.zarr"},
"metadata": {"dimensions": {"time": 12, "lat": 180, "lon": 360}}
}
}
}
# Must specify which source to load
temp_ds = DummyDataset.from_intake_catalog(multi_source_catalog, "temperature")
precip_ds = DummyDataset.from_intake_catalog(multi_source_catalog, "precipitation")
Driver Configuration¶
Different data formats and drivers:
# NetCDF driver
ds.to_intake_catalog(
name="netcdf_data",
driver="netcdf",
data_path="data/output.nc",
engine="netcdf4"
)
# Xarray driver with custom arguments
ds.to_intake_catalog(
name="xarray_data",
driver="xarray",
data_path="data/*.nc",
combine="by_coords",
parallel=True
)
Metadata Preservation¶
All dataset metadata is preserved in the catalog:
# Dataset attributes become catalog metadata
ds.assign_attrs(
title="My Dataset",
institution="My Organization",
project="Climate Research",
version="1.0"
)
# After round-trip, attributes are preserved
loaded_ds = DummyDataset.from_intake_catalog("catalog.yaml", "my_data")
assert loaded_ds.attrs["title"] == "My Dataset"
assert loaded_ds.attrs["institution"] == "My Organization"
# Catalog-specific attributes are also added
assert loaded_ds.attrs["intake_catalog_source"] == "my_data"
assert loaded_ds.attrs["intake_driver"] == "zarr"
Error Handling¶
The import functionality includes comprehensive error handling:
try:
# File not found
ds = DummyDataset.from_intake_catalog("nonexistent.yaml")
except FileNotFoundError as e:
print(f"Catalog file not found: {e}")
try:
# Invalid catalog format
ds = DummyDataset.from_intake_catalog({"invalid": "structure"})
except ValueError as e:
print(f"Invalid catalog: {e}")
try:
# Source not found in multi-source catalog
ds = DummyDataset.from_intake_catalog(multi_source_catalog, "nonexistent_source")
except ValueError as e:
print(f"Source not found: {e}")
Integration with Intake Ecosystem¶
The generated catalogs are fully compatible with the Intake ecosystem:
import intake
# Load catalog with Intake
catalog = intake.open_catalog("catalog.yaml")
# Access data source
data_source = catalog.climate_data
# Get metadata
print(data_source.description)
print(data_source.metadata)
# Load actual data (when available)
# ds = data_source.read()
Best Practices¶
- Descriptive Names: Use meaningful source names that reflect the data content
- Complete Metadata: Include comprehensive dataset attributes for better discoverability
- Consistent Paths: Use relative paths with
{{ CATALOG_DIR }}template for portability - Driver Selection: Choose appropriate drivers for your data format and access patterns
- Version Control: Track catalog files alongside your code for reproducibility
Examples¶
See the Intake Catalog Example for a complete working demonstration of round-trip catalog functionality.