Encoding¶
Configure how your data is stored with encoding parameters.
What is Encoding?¶
Encoding parameters control how data is written to disk formats like Zarr or NetCDF. This includes:
- Data type (dtype)
- Chunking strategy
- Compression
- Fill values
- Scale factors
Basic Encoding¶
from dummyxarray import DummyDataset
import numpy as np
ds = DummyDataset()
ds.add_dim("time", 100)
data = np.random.rand(100)
ds.add_variable(
"temperature",
dims=["time"],
data=data,
encoding={
"dtype": "float32", # Store as 32-bit float
}
)
Chunking¶
Chunking is crucial for performance with large datasets:
ds = DummyDataset()
ds.add_dim("time", 365)
ds.add_dim("lat", 180)
ds.add_dim("lon", 360)
data = np.random.rand(365, 180, 360)
ds.add_variable(
"temperature",
dims=["time", "lat", "lon"],
data=data,
encoding={
"dtype": "float32",
"chunks": (30, 90, 180), # Chunk size for each dimension
}
)
Compression¶
Add compression to reduce file size:
import zarr
ds.add_variable(
"temperature",
dims=["time", "lat", "lon"],
data=data,
encoding={
"dtype": "float32",
"chunks": (30, 90, 180),
"compressor": zarr.Blosc(cname='zstd', clevel=3, shuffle=2)
}
)
Fill Values¶
Specify fill values for missing data:
ds.add_variable(
"temperature",
dims=["time"],
data=data,
encoding={
"dtype": "float32",
"_FillValue": -9999.0
}
)
Scale and Offset¶
Use scale_factor and add_offset for packed data:
ds.add_variable(
"temperature",
dims=["time"],
data=data,
encoding={
"dtype": "int16",
"scale_factor": 0.01,
"add_offset": 273.15,
"_FillValue": -32768
}
)
Encoding for Coordinates¶
Coordinates can also have encoding:
time_data = np.arange(365)
ds.add_coord(
"time",
dims=["time"],
data=time_data,
attrs={"units": "days since 2020-01-01"},
encoding={
"dtype": "int32",
"calendar": "standard"
}
)
Best Practices¶
Chunking Strategy¶
- Time series data: Large chunks in time, smaller in space
- Spatial data: Smaller chunks in time, larger in space
- General rule: Aim for chunk sizes of 10-100 MB
# For time series analysis
encoding={"chunks": (365, 45, 90)} # Full year, quarter lat/lon
# For spatial analysis
encoding={"chunks": (1, 180, 360)} # Single time, full spatial
Data Types¶
- Use
float32instead offloat64when precision allows - Use
int16orint32for integer data - Consider packed integers with scale/offset for large datasets
Compression¶
zstd(level 3): Good balance of speed and compressionlz4: Fastest, moderate compressionblosc: Good for scientific data
Applying Encoding¶
Encoding is automatically applied when converting to xarray or writing to Zarr: