Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 19 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,38 +12,37 @@
pyfive : A pure Python HDF5 file reader
=======================================

pyfive is an open source library for reading HDF5 files written using
``pyfive`` is an open source library for reading HDF5 files written using
pure Python (no C extensions). The package is still in development and not all
features of HDF5 files are supported.

pyfive aims to support the same API as [`h5py`](https://github.com/h5py/h5py)
for reading files. Cases where a file uses a feature that is supported by `h5py`
but not pyfive are considered bug and should be reported in our [Issues](https://github.com/NCAS-CMS/pyfive/issues).
Writing HDF5 is not a goal of pyfive and portions of the API which apply only to writing will not be
implemented.
``pyfive`` aims to support the same API as [`h5py`](https://github.com/h5py/h5py) for reading files.
Cases where a file uses a feature that is supported by ``h5py`` but not ``pyfive`` are considered bugs
and should be reported in our [Issues](https://github.com/NCAS-CMS/pyfive/issues).
Writing HDF5 output is not a goal of ``pyfive`` and portions of the API which apply only to writing will not be implemented.

Dependencies
============

pyfive is tested to work with Python 3.10 to 3.13. It may also work
with other Python versions.
``pyfive`` is tested against Python versions 3.10 to 3.14.
It may also work with other Python versions.

The only dependencies to run the software besides Python is NumPy.
The only dependencies to run the software besides Python is ``numpy``.

Install
=======

pyfive can be installed using pip using the command::
pyfive can be installed using ``pip`` using the command::

pip install pyfive

conda package are also available from conda-forge which can be installed::
``conda`` packages are also available from conda-forge::

conda install -c conda-forge pyfive

To install from source in your home directory use::

python setup.py install --user
pip install --user ./pyfive

The library can also be imported directly from the source directory.

Expand All @@ -54,21 +53,20 @@ Development
git
---

You can check out the latest pyfive souces with the command::
You can check out the latest ``pyfive`` souces with the command::

git clone https://github.com/NCAS-CMS/pyfive.git

testing
-------

pyfive comes with a test suite in the ``tests`` directory. These tests can be
exercised using the commands ``pytest`` from the root directory assuming the
``pytest`` package is installed.
``pyfive`` comes with a test suite in the ``tests`` directory.
These tests can be exercised using the ``pytest`` command from the root directory (requires installation of the ``pytest`` package).

Conda-feedstock
===============
Conda-forge feedstock
=====================

Package repository at [conda feedstock](https://github.com/conda-forge/pyfive-feedstock)
Package repository [conda-forge feedstock](https://github.com/conda-forge/pyfive-feedstock)

Codecov
=======
Expand All @@ -78,6 +76,6 @@ Test coverage assessement is done using [codecov](https://app.codecov.io/gh/NCAS
Documentation
=============

Build locally with Sphinx::
Build locally with Sphinx:

sphinx-build -Ea doc doc/build
$ sphinx-build -Ea doc doc/build
3 changes: 2 additions & 1 deletion doc/_sidebar.rst.inc
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@
Introduction <introduction>
Getting started <quickstart/index>
API Reference <api_reference>
The p5dump utility <p5dump>
Additional API Features <additional>
Optimising Data Access Speed <optimising>
The p5dump utility <p5dump>
Understanding Cloud Optimisation <cloud>
Change Log <changelog>
11 changes: 6 additions & 5 deletions doc/additional.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,10 @@ Modifications to the File API

When acccessing a file, in addition there are two modifications to the standard ``h5py`` API that can be used to optimise
performance. A new method (``get_lazy_view``) and an additional keyword argument on ``visititems`` (noindex) are provided
to support access to all dataset metadata without loading chunk indices. (Loading chunk indices at dataset
instantiation is mostly a useful optimisation, but not if you have no intent of accessing the data itself.)
to support access to all dataset metadata without loading chunk indices.

.. note::
Loading chunk indices at dataset instantiation is mostly a useful optimisation, but not if you have no intent of accessing the data itself.

The ``Group`` API is fully documented in the autogenerated API reference, but the additional methods and keyword arguments are highlighted here.
These methods are also avilable on the ``File`` class, since ``File`` is a subclass of ``Group``.
Expand All @@ -21,10 +23,9 @@ These methods are also avilable on the ``File`` class, since ``File`` is a subcl
Modifications to the DatasetID API
----------------------------------

When accessing datasets, additional functionality is exposed via the ``pyfive.h5d.DatasetID`` class, which
is the class which implements the low-level data access methods for datasets (aka "variables").
When accessing datasets, additional functionality is exposed via the ``pyfive.h5d.DatasetID`` class, which implements the low-level data access methods for datasets (`Variables`).

The DatasetID API is fully documented in the autogenerated API reference, but the additional methods and attributes are highlighted here:
The ``DatasetID`` API is fully documented in the autogenerated API reference, but additional methods and attributes are highlighted here:

.. autoattribute:: pyfive.h5d.DatasetID.first_chunk
.. autoattribute:: pyfive.h5d.DatasetID.btree_range
Expand Down
3 changes: 2 additions & 1 deletion doc/api_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ Dataset

DatasetID
----------

.. autoclass:: pyfive.h5d.DatasetID
:members:
:noindex:
Expand All @@ -41,7 +42,7 @@ Datatype
The h5t module
--------------

Partial implementation of some of the lower level h5py API, needed
Partial implementation of some of the lower level ``h5py`` API, needed
to support enumerations, variable length strings, and opaque datatypes.

.. autofunction:: pyfive.h5t.check_enum_dtype
Expand Down
17 changes: 7 additions & 10 deletions doc/cloud.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Cloud Optimisation
******************

While `pyfive` can only read HDF5 files, it includes some features to help users understand whether it might
While ``pyfive`` can only read HDF5 files, it includes some features to help users understand whether it might
be worth rewriting files to make them cloud optimised (as defined by Stern et.al., 2022 [#]_).

To be cloud optimised an HDF5 file needs to have a contiguous index for each
Expand All @@ -21,8 +21,7 @@ Metadata can be repacked to the front of the file and variables can be rechunked
which is effectively the same process undertaken when HDF5 data is reformatted to other cloud optimised formats.

The HDF5 library provides a tool (`h5repack <https://support.hdfgroup.org/documentation/hdf5/latest/_h5_t_o_o_l__r_p__u_g.html>`_)
which can do this, provided it is driven with suitable information
about required chunk shape and the expected size of metadata fields.
which can do this, provided it is driven with suitable information about required chunk shape and the expected size of metadata fields.
`pyfive` supports both a method to query whether such repacking is necessary, and to extract necessary parameters.

In the following example we compare and contrast the unpacked and repacked version of a particularly pathological
Expand Down Expand Up @@ -50,12 +49,11 @@ If we look at some of the output of `p5dump -s` on this file
uas:_first_chunk = 36520 ;


we can immediately see that this will be a problematic file! The b-tree index is clearly interleaved with the data
We can immediately see that this will be a problematic file! The `b-tree` index is clearly interleaved with the data
(compare the first chunk address with last index addresses of the two variables), and with a chunk dimension of ``(1,)``,
any effort to use the time-dimension to locate data of interest will involve a ludicrous number of one number reads
(all underlying libraries read the data one chunk at a time).
It would feel like waiting for the heat death of the universe if one
was to attempt to manipulate this data stored on an object store!
It would feel like waiting for the heat death of the universe if one was to attempt to manipulate this data stored on an object store!

It is relatively easy (albeit slow) to use
`h5repack <https://support.hdfgroup.org/documentation/hdf5/latest/_h5_t_o_o_l__r_p__u_g.html>`_
Expand Down Expand Up @@ -83,12 +81,11 @@ Now data follows indexes, the time dimension is one chunk, and there is a more s
While this file would probably benefit from splitting into smaller files, now it has a contiguous set of indexes
it is possible to exploit this data via S3.

All the metadata shown in this dump output arises from `pyfive` extensions to the `pyfive.h5t.DatasetID` class.
`pyfive` also provides a simple flag: `consolidated_metadata` for a `File` instance, which can take values of
All the metadata shown in this dump output arises from ``pyfive`` extensions to the ``pyfive.h5t.DatasetID`` class.
``pyfive`` also provides a simple flag: ``consolidated_metadata`` for a ``File`` instance, which can take values of
`True` or `False` for any given file, which simplifies at least the "is the index packed at the front of the file?"
part of the optimisation question - though inspection of chunking is a key part of the workflow necessary to
determine whether or not a file really is optimised for cloud usage.


.. [#] Stern et.al. (2022): *Pangeo Forge: Crowdsourcing Analysis-Ready, Cloud Optimized Data Production*, https://dx.doi.org/10.3389/fclim.2021.782909.
.. [#] Hassel and Cimadevilla Alvarez (2025): *Cmip7repack: Repack CMIP7 netCDF-4 Datasets*, https://dx.doi.org/10.5281/zenodo.17550920.
.. [#] Hassell and Cimadevilla Alvarez (2025): *Cmip7repack: Repack CMIP7 netCDF-4 Datasets*, https://dx.doi.org/10.5281/zenodo.17550920.
2 changes: 1 addition & 1 deletion doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@
'autosummary': True,
}

# FIXME: These libraries are not found in the documentation
autodoc_mock_imports = [
'cartopy',
'cf_units',
Expand Down Expand Up @@ -164,7 +165,6 @@

# The name of an image file (relative to this directory) to place at the top
# of the sidebar.
# FIXME add a logo
html_logo = "figures/Pyfive-logo.png"

# The name of an image file (within the static path) to use as favicon of the
Expand Down
32 changes: 16 additions & 16 deletions doc/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,41 +5,41 @@ About Pyfive
============

``pyfive`` provides a pure Python HDF reader which has been designed to be a thread-safe drop in replacement
for `h5py <https://github.com/h5py/h5py>`_ with no dependencies on the HDF C library. It aims to support the same API as
for reading files. Cases where access to a file uses a feature that is supported by the high-level ``h5py`` interface but not ``pyfive`` are considered bugs and
for `h5py <https://github.com/h5py/h5py>`_ with no dependencies on the HDF5 C library. It aims to support the same API as ``h5py`` for reading files.
Cases where access to a file uses a feature that is supported by the high-level ``h5py`` interface but not ``pyfive`` are considered bugs and
should be reported in our `Issues <https://github.com/NCAS-CMS/pyfive/issues>`_.
Writing HDF5 is not a goal of pyfive and portions of the ``h5py`` API which apply only to writing will not be
implemented.

Writing HDF5 output is not a goal of ``pyfive`` and portions of the ``h5py`` API which apply only to writing will not be implemented.

.. note::
While ``pyfive`` is designed to be a drop-in replacement for ``h5py``, the reverse may not be possible. It is possible to do things with ``pyfive``
that will not work with ``h5py``, and ``pyfive`` definitely includes *extensions* to the ``h5py`` API. This documentation makes clear which parts of
the API are extensions and where behaviour differs *by design* from ``h5py``.
While ``pyfive`` is designed to be a drop-in replacement for ``h5py``, the reverse may not be possible.
It is possible to perform actions with ``pyfive`` that are not supported by ``h5py`` as ``pyfive`` extends the ``h5py`` API beyond its initial specifications.
This documentation makes clear which parts of the API are extensions and where behaviour differs *by design* from ``h5py``.

The motivation for ``pyfive`` development were many, but recent developments prioritised thread-safety, lazy loading, and
The motivations for ``pyfive`` development were many, but recent developments prioritised thread-safety, lazy loading, and
performance at scale in a cloud environment both standalone,
and as a backend for other software such as `cf-python <https://ncas-cms.github.io/cf-python/>`_, `xarray <https://docs.xarray.dev/en/stable/>`_, and `h5netcdf <https://h5netcdf.org/index.html>`_.
and as a backend for other software such as `cf-python <https://ncas-cms.github.io/cf-python/>`_, `xarray <https://docs.xarray.dev/en/stable/>`_, and `h5netcdf <https://h5netcdf.org/index.html>`_.

As well as the high-level ``h5py`` API we have implemented a version of the ``h5d.DatasetID`` class, which now
holds all the code which is used for data access (as opposed to attribute access). We have also implemented
holds all the code which is used for data access (as opposed to attribute access). We have also implemented
extra methods (beyond the ``h5py`` API) to expose the chunk index directly (as well as via an iterator) and
to access chunk info using the ``zarr`` indexing scheme rather than the ``h5py`` indexing scheme. This is useful for avoiding
the need for *a priori* use of ``kerchunk`` to make a ``zarr`` index for a file.
to access chunk info using the ``zarr`` indexing scheme rather than the ``h5py`` indexing scheme.
This is useful for avoiding the need for *a priori* use of ``kerchunk`` to make a ``zarr`` index for a file.

The code also includes an implementation of what we have called pseudochunking which is used for accessing
a contiguous array which is larger than memory via S3. In essence all this does is declare default chunks
aligned with the array order on disk and use them for data access.

There are optimisations to support cloud usage, the most important of which is that
once a variable is instantiated (i.e. for an open ``pyfive.File`` instance ``f``, when you do ``v=f['variable_name']``)
the attributes and b-tree (chunk index) are read, and it is then possible to close the parent file (``f``),
the attributes and `b-tree`` (chunk index) are read, and it is then possible to close the parent file (``f``),
but continue to use (``v``).

The package includes a script ``p5dump`` which can be used to dump the contents of an HDF5 file to the terminal.
The package also includes a command line tool (``p5dump``) which can be used to dump the contents of an HDF5 file to the terminal.

.. note::

We have test coverage that shows that the usage of ``v`` in this way is thread-safe - the test which demonstrates this is slow,
We have test coverage that shows that the usage of ``v`` in this way is thread-safe - the test which demonstrates this is slow,
but it needs to be, since shorter tests did not always exercise expected failure modes.

The pyfive test suite includes all the components necessary for testing pyfive accessing data via both POSIX and S3.
The ``pyfive`` test suite includes all the components necessary for testing pyfive accessing data via both POSIX and `S3`.
21 changes: 10 additions & 11 deletions doc/optimising.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,22 +9,22 @@ how the data is stored in the file and how the data access library (in this case
The data storage complexities arise from two main factors: the use of chunking, and the way attributes are stored in the files.

**Chunking**: HDF5 files can store data in chunks, which allows for more efficient access to large datasets.
However, this also means that the library needs to maintain an index (a "b-tree") which relates the position in coordinate space to where each chunk is stored in the file.
There is a b-tree index for each chunked variable, and this index can be scattered across the file, which can introduce overheads when accessing the data.
However, this also means that the library needs to maintain an index (`b-tree`) which relates the position in coordinate space to where each chunk is stored in the file.
There is a `b-tree` index for each chunked variable, and this index can be scattered across the file, which can introduce overheads when accessing the data.

**Attributes**: HDF5 files can store attributes (metadata) associated with datasets and groups, and these attributes are stored in a separate section of the file.
**Attributes**: HDF5 files can store attributes (`metadata`) associated with datasets and groups, and these attributes are stored in a separate section of the file.
Again, these can be scattered across the files.


Optimising the files themselves
-------------------------------

Optimal access to data occurs when the data is chunked in a way that matches the access patterns of your application, and when the
b-tree indexes and attributes are stored contiguously in the file.
`b-tree` indexes and attributes are stored contiguously in the file.

Users of ``pyfive`` will always confront data files which have been created by other software, but if possible, it is worth exploring whether
the `h5repack <https://docs.h5py.org/en/stable/special.html#h5repack>`_ tool can
be used to make a copy of the file which is optimised for access by using sensible chunks and to store the attributes and b-tree indexes contiguously.
be used to make a copy of the file which is optimised for access by using sensible chunks and to store the attributes and `b-tree` indexes contiguously.
If that is possible, then all access will benefit from fewer calls to storage to get the necessary metadata, and the data access will be faster.


Expand Down Expand Up @@ -84,8 +84,7 @@ For example, you can use the `concurrent.futures` module to read data from multi

print("Results:", results)


You can do the same thing to parallelise manipulations within the variables, by for example using, ``Dask``, but that is beyond the scope of this document.
You can do the same thing to parallelise manipulations within the variables, by for example using, ``dask``, but that is beyond the scope of this document.


Using pyfive with S3
Expand All @@ -101,8 +100,6 @@ file, which for HDF5 will be stored as one object, look like it is on a file sys
memory so repeated reads can be more efficient. The optimal caching strategy is dependent on the file layout
and the expected access pattern, so ``s3fs`` provides a lot of flexibility as to how to configure that caching strategy.



For ``pyfive`` the three most important variables to consider altering are the
``default_block_size`` number, the ``default_cache_type`` option and the ``default_fill_cache`` boolean.

Expand All @@ -121,7 +118,9 @@ For ``pyfive`` the three most important variables to consider altering are the
This is a boolean which determines whether ``s3fs`` will persistently cache the data that it reads.
If this is set to ``True``, then the blocks are cached persistently in memory, but if set to ``False``, then it only makes sense in conjunction with ``default_cache_type`` set to ``readahead`` or ``bytes`` to support streaming access to the data.

Note that even with these strategies, it is possible that the file layout itself is such that access will be slow.
See the next section for more details of how to optimise your hDF5 files for cloud acccess.
.. note::

Even with these strategies, it is possible that the file layout itself is such that access will be slow.
See the next section for more details of how to optimise your hDF5 files for cloud acccess.


Loading