From 01a717532e81b184c7d376df65e7a4c5f572a9a7 Mon Sep 17 00:00:00 2001 From: Trevor James Smith <10819524+Zeitsperre@users.noreply.github.com> Date: Wed, 7 Jan 2026 15:19:11 -0500 Subject: [PATCH 1/3] documentation consistency fixes Signed-off-by: Trevor James Smith <10819524+Zeitsperre@users.noreply.github.com> --- README.md | 35 +++++++++---------- doc/_sidebar.rst.inc | 3 +- doc/additional.rst | 11 +++--- doc/api_reference.rst | 3 +- doc/cloud.rst | 15 ++++----- doc/conf.py | 2 +- doc/introduction.rst | 30 ++++++++--------- doc/optimising.rst | 6 ++-- doc/p5dump.rst | 13 ++++---- doc/quickstart/enums.rst | 59 +++++++++++++++++---------------- doc/quickstart/installation.rst | 24 +++++++------- doc/quickstart/opaque.rst | 21 ++++-------- doc/quickstart/usage.rst | 38 ++++++++------------- 13 files changed, 121 insertions(+), 139 deletions(-) diff --git a/README.md b/README.md index ed6727e3..741be9bc 100644 --- a/README.md +++ b/README.md @@ -12,38 +12,36 @@ pyfive : A pure Python HDF5 file reader ======================================= -pyfive is an open source library for reading HDF5 files written using +``pyfive`` is an open source library for reading HDF5 files written using pure Python (no C extensions). The package is still in development and not all features of HDF5 files are supported. -pyfive aims to support the same API as [`h5py`](https://github.com/h5py/h5py) -for reading files. Cases where a file uses a feature that is supported by `h5py` -but not pyfive are considered bug and should be reported in our [Issues](https://github.com/NCAS-CMS/pyfive/issues). -Writing HDF5 is not a goal of pyfive and portions of the API which apply only to writing will not be -implemented. +``pyfive`` aims to support the same API as [`h5py`](https://github.com/h5py/h5py) +for reading files. Cases where a file uses a feature that is supported by ``h5py`` +but not ``pyfive`` are considered bug and should be reported in our [Issues](https://github.com/NCAS-CMS/pyfive/issues). +Writing HDF5 is not a goal of ``pyfive`` and portions of the API which apply only to writing will not be implemented. Dependencies ============ -pyfive is tested to work with Python 3.10 to 3.13. It may also work -with other Python versions. +``pyfive`` is tested against Python versions 3.10 to 3.13. It may also work with other Python versions. -The only dependencies to run the software besides Python is NumPy. +The only dependencies to run the software besides Python is ``numpy``. Install ======= -pyfive can be installed using pip using the command:: +pyfive can be installed using ``pip`` using the command:: pip install pyfive -conda package are also available from conda-forge which can be installed:: +``conda`` packages are also available from conda-forge:: conda install -c conda-forge pyfive To install from source in your home directory use:: - python setup.py install --user + pip install --user ./pyfive The library can also be imported directly from the source directory. @@ -54,21 +52,20 @@ Development git --- -You can check out the latest pyfive souces with the command:: +You can check out the latest ``pyfive`` souces with the command:: git clone https://github.com/NCAS-CMS/pyfive.git testing ------- -pyfive comes with a test suite in the ``tests`` directory. These tests can be -exercised using the commands ``pytest`` from the root directory assuming the -``pytest`` package is installed. +``pyfive`` comes with a test suite in the ``tests`` directory. +These tests can be exercised using the ``pytest`` command from the root directory (requires installation of the ``pytest`` package). -Conda-feedstock -=============== +Conda-forge feedstock +===================== -Package repository at [conda feedstock](https://github.com/conda-forge/pyfive-feedstock) +Package repository [conda-forge feedstock](https://github.com/conda-forge/pyfive-feedstock) Codecov ======= diff --git a/doc/_sidebar.rst.inc b/doc/_sidebar.rst.inc index 7acc95b9..7cfbb47a 100644 --- a/doc/_sidebar.rst.inc +++ b/doc/_sidebar.rst.inc @@ -8,7 +8,8 @@ Introduction Getting started API Reference + The p5dump utility Additional API Features Optimising Data Access Speed - The p5dump utility + Understanding Cloud Optimisation Change Log diff --git a/doc/additional.rst b/doc/additional.rst index 5bd3492a..bdaee6b6 100644 --- a/doc/additional.rst +++ b/doc/additional.rst @@ -8,8 +8,10 @@ Modifications to the File API When acccessing a file, in addition there are two modifications to the standard ``h5py`` API that can be used to optimise performance. A new method (``get_lazy_view``) and an additional keyword argument on ``visititems`` (noindex) are provided -to support access to all dataset metadata without loading chunk indices. (Loading chunk indices at dataset -instantiation is mostly a useful optimisation, but not if you have no intent of accessing the data itself.) +to support access to all dataset metadata without loading chunk indices. + +.. note:: + Loading chunk indices at dataset instantiation is mostly a useful optimisation, but not if you have no intent of accessing the data itself. The ``Group`` API is fully documented in the autogenerated API reference, but the additional methods and keyword arguments are highlighted here. These methods are also avilable on the ``File`` class, since ``File`` is a subclass of ``Group``. @@ -21,10 +23,9 @@ These methods are also avilable on the ``File`` class, since ``File`` is a subcl Modifications to the DatasetID API ---------------------------------- -When accessing datasets, additional functionality is exposed via the ``pyfive.h5d.DatasetID`` class, which -is the class which implements the low-level data access methods for datasets (aka "variables"). +When accessing datasets, additional functionality is exposed via the ``pyfive.h5d.DatasetID`` class, which implements the low-level data access methods for datasets (`Variables`). -The DatasetID API is fully documented in the autogenerated API reference, but the additional methods and attributes are highlighted here: +The ``DatasetID`` API is fully documented in the autogenerated API reference, but additional methods and attributes are highlighted here: .. autoattribute:: pyfive.h5d.DatasetID.first_chunk .. autoattribute:: pyfive.h5d.DatasetID.btree_range diff --git a/doc/api_reference.rst b/doc/api_reference.rst index cc6f2368..2246ed6c 100644 --- a/doc/api_reference.rst +++ b/doc/api_reference.rst @@ -26,6 +26,7 @@ Dataset DatasetID ---------- + .. autoclass:: pyfive.h5d.DatasetID :members: :noindex: @@ -41,7 +42,7 @@ Datatype The h5t module -------------- -Partial implementation of some of the lower level h5py API, needed +Partial implementation of some of the lower level ``h5py`` API, needed to support enumerations, variable length strings, and opaque datatypes. .. autofunction:: pyfive.h5t.check_enum_dtype diff --git a/doc/cloud.rst b/doc/cloud.rst index 0a22ef80..e432e1fe 100644 --- a/doc/cloud.rst +++ b/doc/cloud.rst @@ -1,7 +1,7 @@ Cloud Optimisation ****************** -While `pyfive` can only read HDF5 files, it includes some features to help users understand whether it might +While ``pyfive`` can only read HDF5 files, it includes some features to help users understand whether it might be worth rewriting files to make them cloud optimised (as defined by Stern et.al., 2022 [#]_). To be cloud optimised an HDF5 file needs to have a contiguous index for each @@ -21,8 +21,7 @@ Metadata can be repacked to the front of the file and variables can be rechunked which is effectively the same process undertaken when HDF5 data is reformatted to other cloud optimised formats. The HDF5 library provides a tool (`h5repack `_) -which can do this, provided it is driven with suitable information -about required chunk shape and the expected size of metadata fields. +which can do this, provided it is driven with suitable information about required chunk shape and the expected size of metadata fields. `pyfive` supports both a method to query whether such repacking is necessary, and to extract necessary parameters. In the following example we compare and contrast the unpacked and repacked version of a particularly pathological @@ -50,12 +49,11 @@ If we look at some of the output of `p5dump -s` on this file uas:_first_chunk = 36520 ; -we can immediately see that this will be a problematic file! The b-tree index is clearly interleaved with the data +We can immediately see that this will be a problematic file! The b-tree index is clearly interleaved with the data (compare the first chunk address with last index addresses of the two variables), and with a chunk dimension of ``(1,)``, any effort to use the time-dimension to locate data of interest will involve a ludicrous number of one number reads (all underlying libraries read the data one chunk at a time). -It would feel like waiting for the heat death of the universe if one -was to attempt to manipulate this data stored on an object store! +It would feel like waiting for the heat death of the universe if one was to attempt to manipulate this data stored on an object store! It is relatively easy (albeit slow) to use `h5repack `_ @@ -83,12 +81,11 @@ Now data follows indexes, the time dimension is one chunk, and there is a more s While this file would probably benefit from splitting into smaller files, now it has a contiguous set of indexes it is possible to exploit this data via S3. -All the metadata shown in this dump output arises from `pyfive` extensions to the `pyfive.h5t.DatasetID` class. -`pyfive` also provides a simple flag: `consolidated_metadata` for a `File` instance, which can take values of +All the metadata shown in this dump output arises from ``pyfive`` extensions to the ``pyfive.h5t.DatasetID`` class. +``pyfive`` also provides a simple flag: ``consolidated_metadata`` for a ``File`` instance, which can take values of `True` or `False` for any given file, which simplifies at least the "is the index packed at the front of the file?" part of the optimisation question - though inspection of chunking is a key part of the workflow necessary to determine whether or not a file really is optimised for cloud usage. - .. [#] Stern et.al. (2022): *Pangeo Forge: Crowdsourcing Analysis-Ready, Cloud Optimized Data Production*, https://dx.doi.org/10.3389/fclim.2021.782909. .. [#] Hassel and Cimadevilla Alvarez (2025): *Cmip7repack: Repack CMIP7 netCDF-4 Datasets*, https://dx.doi.org/10.5281/zenodo.17550920. diff --git a/doc/conf.py b/doc/conf.py index 3e252d1b..5ca00a2c 100644 --- a/doc/conf.py +++ b/doc/conf.py @@ -67,6 +67,7 @@ 'autosummary': True, } +# FIXME: These libraries are not found in the documentation autodoc_mock_imports = [ 'cartopy', 'cf_units', @@ -164,7 +165,6 @@ # The name of an image file (relative to this directory) to place at the top # of the sidebar. -# FIXME add a logo html_logo = "figures/Pyfive-logo.png" # The name of an image file (within the static path) to use as favicon of the diff --git a/doc/introduction.rst b/doc/introduction.rst index ed28bf08..cf82f1bb 100644 --- a/doc/introduction.rst +++ b/doc/introduction.rst @@ -5,26 +5,26 @@ About Pyfive ============ ``pyfive`` provides a pure Python HDF reader which has been designed to be a thread-safe drop in replacement -for `h5py `_ with no dependencies on the HDF C library. It aims to support the same API as -for reading files. Cases where access to a file uses a feature that is supported by the high-level ``h5py`` interface but not ``pyfive`` are considered bugs and +for `h5py `_ with no dependencies on the HDF5 C library. It aims to support the same API as ``h5py`` for reading files. +Cases where access to a file uses a feature that is supported by the high-level ``h5py`` interface but not ``pyfive`` are considered bugs and should be reported in our `Issues `_. -Writing HDF5 is not a goal of pyfive and portions of the ``h5py`` API which apply only to writing will not be -implemented. + +Writing HDF5 output is not a goal of ``pyfive`` and portions of the ``h5py`` API which apply only to writing will not be implemented. .. note:: - While ``pyfive`` is designed to be a drop-in replacement for ``h5py``, the reverse may not be possible. It is possible to do things with ``pyfive`` - that will not work with ``h5py``, and ``pyfive`` definitely includes *extensions* to the ``h5py`` API. This documentation makes clear which parts of - the API are extensions and where behaviour differs *by design* from ``h5py``. + While ``pyfive`` is designed to be a drop-in replacement for ``h5py``, the reverse may not be possible. + It is possible to perform actions with ``pyfive`` that are not supported by ``h5py`` as ``pyfive`` extends the ``h5py`` API beyond its initial specifications. + This documentation makes clear which parts of the API are extensions and where behaviour differs *by design* from ``h5py``. -The motivation for ``pyfive`` development were many, but recent developments prioritised thread-safety, lazy loading, and +The motivations for ``pyfive`` development were many, but recent developments prioritised thread-safety, lazy loading, and performance at scale in a cloud environment both standalone, -and as a backend for other software such as `cf-python `_, `xarray `_, and `h5netcdf `_. +and as a backend for other software such as `cf-python `_, `xarray `_, and `h5netcdf `_. As well as the high-level ``h5py`` API we have implemented a version of the ``h5d.DatasetID`` class, which now -holds all the code which is used for data access (as opposed to attribute access). We have also implemented +holds all the code which is used for data access (as opposed to attribute access). We have also implemented extra methods (beyond the ``h5py`` API) to expose the chunk index directly (as well as via an iterator) and -to access chunk info using the ``zarr`` indexing scheme rather than the ``h5py`` indexing scheme. This is useful for avoiding -the need for *a priori* use of ``kerchunk`` to make a ``zarr`` index for a file. +to access chunk info using the ``zarr`` indexing scheme rather than the ``h5py`` indexing scheme. +This is useful for avoiding the need for *a priori* use of ``kerchunk`` to make a ``zarr`` index for a file. The code also includes an implementation of what we have called pseudochunking which is used for accessing a contiguous array which is larger than memory via S3. In essence all this does is declare default chunks @@ -35,11 +35,11 @@ once a variable is instantiated (i.e. for an open ``pyfive.File`` instance ``f`` the attributes and b-tree (chunk index) are read, and it is then possible to close the parent file (``f``), but continue to use (``v``). -The package includes a script ``p5dump`` which can be used to dump the contents of an HDF5 file to the terminal. +The package also includes a command line tool (``p5dump``) which can be used to dump the contents of an HDF5 file to the terminal. .. note:: - We have test coverage that shows that the usage of ``v`` in this way is thread-safe - the test which demonstrates this is slow, + We have test coverage that shows that the usage of ``v`` in this way is thread-safe - the test which demonstrates this is slow, but it needs to be, since shorter tests did not always exercise expected failure modes. -The pyfive test suite includes all the components necessary for testing pyfive accessing data via both POSIX and S3. +The ``pyfive`` test suite includes all the components necessary for testing pyfive accessing data via both POSIX and `S3`. diff --git a/doc/optimising.rst b/doc/optimising.rst index ff5406ab..c17299c2 100644 --- a/doc/optimising.rst +++ b/doc/optimising.rst @@ -121,7 +121,9 @@ For ``pyfive`` the three most important variables to consider altering are the This is a boolean which determines whether ``s3fs`` will persistently cache the data that it reads. If this is set to ``True``, then the blocks are cached persistently in memory, but if set to ``False``, then it only makes sense in conjunction with ``default_cache_type`` set to ``readahead`` or ``bytes`` to support streaming access to the data. -Note that even with these strategies, it is possible that the file layout itself is such that access will be slow. -See the next section for more details of how to optimise your hDF5 files for cloud acccess. +.. note:: + + Even with these strategies, it is possible that the file layout itself is such that access will be slow. + See the next section for more details of how to optimise your hDF5 files for cloud acccess. diff --git a/doc/p5dump.rst b/doc/p5dump.rst index 00a58084..e17e51be 100644 --- a/doc/p5dump.rst +++ b/doc/p5dump.rst @@ -3,18 +3,17 @@ p5dump ``pyfive`` includes a command line tool ``p5dump`` which can be used to dump the contents of an HDF5 file to the terminal (e.g ``p5dump myfile.hdf5``). This is similar to the ``ncdump`` tool included with the NetCDF library, or the ``h5dump`` tool included -with the HDF5 library, but like the rest of pyfive, is implemented in pure Python without any dependencies on the -HDF5 C library. +with the HDF5 library, but like the rest of ``pyfive``, is implemented in pure Python without any dependencies on the HDF5 C library. -It is not identical to either of these tools, though the default output is very close to that of ``ncdump``. -When called with `-s` (e.g ``p5dump -s myfile.hdf5``) the output provides extra information for chunked -datasets, including the locations of the start and end of the chunk index b-tree +``p5dump`` is not identical to either of these tools, though the default output is very close to that of ``ncdump``. +When called with the ``"-s"`` flag (e.g ``p5dump -s myfile.hdf5``) the output provides extra information for chunked +datasets, including the locations of the start and end of the chunk index `b-tree` and the location of the first data chunk for that variable. This extra information is useful for understanding the performance of data access for chunked variables, particularly when accessing data in object stores such as -S3. In general, if one finds that the b-tree index continues past the first data chunk, access +`S3`. In general, if one finds that the `b-tree` index continues past the first data chunk, access performance may be sub-optimal - in this situation, if you have control over the data, you might well consider using the ``h5repack`` tool from the standard HDF5 distribution to make a copy of the file with the -chunk index and attributes stored contiguously. All tools which read HDF5 files will benefit from this. +chunk index and attributes stored contiguously. All tools which read HDF5 files will benefit from this. A ``p5dump`` example: diff --git a/doc/quickstart/enums.rst b/doc/quickstart/enums.rst index feaa9ef1..2b8fd074 100644 --- a/doc/quickstart/enums.rst +++ b/doc/quickstart/enums.rst @@ -1,27 +1,28 @@ Enumerations ------------ -HDF5 has the concept of an enumeration data type, where integer values are stored in an array, but where those integer -values should be interpreted as the indexes to some string values. So, for example, one could have -an enumeration dictionary (`enum_dict`) defined as +HDF5 has the concept of an "`enumeration data type`", where integer values are stored in an array, but where those integer +values should be interpreted as the indexes to some string values. + +For example, one could have an enumeration dictionary (`enum_dict`) defined as: .. code-block:: python - clouds = ['stratus','strato-cumulus','missing','nimbus','cumulus','longcloudname'] - enum_dict = {v:k for k,v in enumerate(clouds)} - enum_dict['missing'] = 255 + clouds = ['stratus','strato-cumulus','missing','nimbus','cumulus','longcloudname'] + enum_dict = {v:k for k,v in enumerate(clouds)} + enum_dict['missing'] = 255 And an array of data which looked something like .. code-block:: python - cloud_cover = [0,3,4,4,4,1,255,1,1] + cloud_cover = [0,3,4,4,4,1,255,1,1] Which one would expect to interpret as .. code-block:: python - actual_cloud_cover = ['stratus','nimbus','cumulus','cumulus','cumulus', + actual_cloud_cover = ['stratus','nimbus','cumulus','cumulus','cumulus', 'stratus','missing','strato-cumulus','strato-cumulus'] These data are stored in HDF5 using a combination of an integer @@ -44,27 +45,29 @@ in the following example: .. code-block:: python - with pyfive.File('myfile.h5') as pfile: + with pyfive.File('myfile.h5') as pfile: - evar = pfile['evar'] - edict = pyfive.check_enum_dtype(evar.dtype) - if edict is None: - pass # not an enumeration - else: - # for some reason HDF5 defines these in what seems to be the wrong way around, - # with the string values as keys to the integer indices. - edict_reverse = {v:k for k,v in edict.items()} - # assuming evar data is a one dimensional array of integers - edata = [edict_reverse[k] for k in evar[:]] - -In this instance, `edata` would now be a array of strings indexed from the enumeration dictionary using -the `evar` data as the index values. - -(`h5py` and hence `pyfive` have both used an internal numpy dtype metadata feature to implement enumerations. -Numpy is not clear on the future of this feature, and doesn't promise to transfer metadata with all operations, -so the output of operations on this integer array may lose the direct link to the enumeration via the dtype. -Meanwhile, as well as using the `check_enum_dtype`, you can also get to this dictionary directly yourself, -it's available at ``evar.dtype.metadata['enum']``.) + evar = pfile['evar'] + edict = pyfive.check_enum_dtype(evar.dtype) + if edict is None: + pass # not an enumeration + else: + # for some reason HDF5 defines these in what seems to be the wrong way around, + # with the string values as keys to the integer indices. + edict_reverse = {v:k for k,v in edict.items()} + # assuming evar data is a one dimensional array of integers + edata = [edict_reverse[k] for k in evar[:]] + +In this instance, ``edata`` would now be a array of strings indexed from the enumeration dictionary using +the ``evar`` data as the index values. + +.. note:: + + ``h5py`` and hence ``pyfive`` have both used an internal numpy dtype metadata feature to implement enumerations. + ``numpy`` is not clear on the future of this feature, and doesn't promise to transfer metadata with all operations, + so the output of operations on this integer array may lose the direct link to the enumeration via the dtype. + Meanwhile, as well as using the `check_enum_dtype`, you can also get to this dictionary directly yourself, + it is available at ``evar.dtype.metadata['enum']``. diff --git a/doc/quickstart/installation.rst b/doc/quickstart/installation.rst index 1269e936..20ede9b3 100644 --- a/doc/quickstart/installation.rst +++ b/doc/quickstart/installation.rst @@ -7,10 +7,10 @@ Installation Installation from conda-forge ----------------------------- -``pyfive`` is on conda forge and can be installed with either ``conda`` or ``mamba`` (``mamba`` is now the +``pyfive`` is available on `conda-forge` and can be installed with either ``conda`` or ``mamba`` (``mamba`` is now the defaut solver for ``conda`` so might as well just use ``conda``): -.. code-block:: bash +.. code-block:: console conda install -c conda-forge pyfive @@ -19,7 +19,7 @@ Installation from PyPI ``pyfive`` can be installed from PyPI: -.. code-block:: bash +.. code-block:: console pip install pyfive @@ -27,11 +27,11 @@ Install from source: conda-mamba environment -------------------------------------------- Use a Miniconda/Miniforge3 installer to create an environment using -our conda ``environment.yml`` file; download the latest Miniconda3 for Linux installer from -the `Miniconda project `_, -install it, then create and activate the Pyfive environment: +our conda ``environment.yml`` file, download the latest Miniconda3 installer from +the `Miniconda project `_, +install it, then create and activate the ``pyfive`` environment: -.. code-block:: bash +.. code-block:: console wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh @@ -40,15 +40,15 @@ install it, then create and activate the Pyfive environment: .. note:: - Our dependencies are all from conda-forge, ensuring a smooth and reliable installation process. + Our dependencies are all from `conda-forge`, ensuring a smooth and reliable installation process. Installing Pyfive from source ----------------------------- -The installation then can proceed: installing with ``pip`` and installing ``all`` (ie +The installation then can proceed: installing with ``pip`` and installing ``all`` (i.e. installing the development and test install): -.. code-block:: bash +.. code-block:: console pip install -e . @@ -57,7 +57,7 @@ After installing, you can run tests via ``pytest -n 2``. Supported Python versions ------------------------- -We adhere to `SPEC0 `_ and support the following Python versions: +We adhere to `Scientific Python SPEC-0 `_ and support the following Python versions: * 3.10 * 3.11 @@ -66,4 +66,4 @@ We adhere to `SPEC0 `_ and suppo .. note:: - Pyfive is fully compatible with ``numpy >=2.0.0``. + ``pyfive`` is fully compatible with ``numpy >=2.0.0``. diff --git a/doc/quickstart/opaque.rst b/doc/quickstart/opaque.rst index c8e2455c..1dd17c47 100644 --- a/doc/quickstart/opaque.rst +++ b/doc/quickstart/opaque.rst @@ -1,25 +1,16 @@ Opaque Datasets --------------- -It is possible to create datasets with opaque datatypes in HDF5. These are +It is possible to create datasets with opaque datatypes in HDF5. These are datasets where the data is stored as a sequence of bytes, with no -interpretation of those bytes. This is not a commonly used feature of HDF5, -but it is used in some applications. The `h5py` package supports reading -and writing opaque datatypes, and so `pyfive` also supports reading them. +interpretation of those bytes. This is not a commonly used feature of HDF5, +but it is used in some applications. The ``h5py`` package supports reading +and writing opaque datatypes, and so ``pyfive`` also supports reading them. This implementation has only been tested for opaque datatypes that -were created using `h5py`. +were created using ``h5py``. Such opaque datatypes will be transparently read into the same type of -numpy array as was used to write the data. The users should not +numpy array as was used to write the data. The users should not need to do anything special to read the data - but may need to do something special with the data to interpret it once read. - - - - - - - - - diff --git a/doc/quickstart/usage.rst b/doc/quickstart/usage.rst index b55c3712..34daf169 100644 --- a/doc/quickstart/usage.rst +++ b/doc/quickstart/usage.rst @@ -1,11 +1,11 @@ .. _usage: -******* +***** Usage -******* +***** -In the main, one uses ``pyfive`` exactly as one would use ``h5py`` so the documentation for ``h5py`` is also relevant. However, -``pyfive`` has some additional API features and optimisations which are noted in the section on "Additional API Features". +In the main, one uses ``pyfive`` exactly as one would use ``h5py`` so the documentation for ``h5py`` is also relevant. +However, ``pyfive`` has some additional API features and optimisations which are noted in the section on "Additional API Features". .. note:: @@ -50,8 +50,7 @@ Here is a simple example of how to open an HDF5 file and read its contents using In this example: -* ``pyfive.File`` opens the file, returning a file object that behaves like a - Python dictionary. +* ``pyfive.File`` opens the file, returning a file object that behaves like a Python dictionary. * Groups (``Group`` objects) can be accessed using dictionary-like keys. * Datasets (``Dataset`` objects) expose attributes like ``shape`` and ``dtype`` which are loaded when you list them, but the data itself is not loaded from stroage into numpy arrays until you access it. @@ -60,9 +59,9 @@ In this example: .. note:: - If you are used to working with NetCDF4 files (and maybe `netcdf4-python `_) the concept of a ``File`` in ``pyfive`` corresponds to - a NetCDF4 ``Dataset`` (both are read from an actual file), and the ``HDF5``/``pyfive``/``h5py`` concept of a ``Dataset`` corresponds to a NetCDF ``Variable``. - (At least the notion of a group is semantically similar in both cases! ) + If you are used to working with NetCDF4 files (and maybe `netcdf4-python `_), the concept of a ``File`` in ``pyfive`` corresponds to + a NetCDF4 ``Dataset`` (both are read from an actual file), and the ``HDF5``/``pyfive``/``h5py`` concept of a ``Dataset`` corresponds to a NetCDF ``Variable`` + (the notion of a group is semantically similar in both cases!). Working with datasets ===================== @@ -100,9 +99,9 @@ working with large datasets in a parallel environment where you might want to cl Using S3/Object Storage ======================= -``pyfive`` is designed to work seamlessly with both local filesystems and S3-compatible object storage (and probably any remote storage that supports +``pyfive`` is designed to work seamlessly with both local filesystems and `S3`-compatible object storage (and probably any remote storage that supports the `fsspec `_ API). However, there are some additional considerations when working with S3, the -most important of which is the need to use the `s3fs` library to provide a filesystem interface to S3. +most important of which is the need to use the ``s3fs`` library to provide a filesystem interface to S3. Here is a simple example of how to open an HDF5 file stored in S3 and read its contents using ``pyfive``: @@ -133,19 +132,10 @@ Here is a simple example of how to open an HDF5 file stored in S3 and read its c .. note:: - The best `s3fs` parameters to use (`s3params`) will depend on what you are actually doing with the file, as - discussed in the section on "Optimising Access Speed". The parameters above worked well for accessing + The ideal ``s3fs`` parameters to use (``s3params``) will depend on what you are actually doing with the file, as + discussed in the section on "Optimising Access Speed". The parameters shown above work well for accessing small amounts of data from a large file, but you may need to adjust them for your specific use case. +# FIXME: Check if the following is still accurate This example also shows that while it is possible to close the file access context manager and still access the datasets, -you will need to ensure that the S3 filesystem is still available. ** TBD: CHECK IF THAT IS STILL TRUE** - - - - - - - - - - +you will need to ensure that the S3 filesystem is still available. From 0c9dea0783f485b57a6b72d07e2384245d398f7c Mon Sep 17 00:00:00 2001 From: Trevor James Smith <10819524+Zeitsperre@users.noreply.github.com> Date: Wed, 7 Jan 2026 16:23:55 -0500 Subject: [PATCH 2/3] small formatting nitpicks Signed-off-by: Trevor James Smith <10819524+Zeitsperre@users.noreply.github.com> --- README.md | 15 ++++++++------- doc/cloud.rst | 2 +- doc/introduction.rst | 2 +- doc/optimising.rst | 15 ++++++--------- doc/p5dump.rst | 4 ++-- doc/quickstart/installation.rst | 1 + 6 files changed, 19 insertions(+), 20 deletions(-) diff --git a/README.md b/README.md index 741be9bc..267ba8ee 100644 --- a/README.md +++ b/README.md @@ -16,15 +16,16 @@ pyfive : A pure Python HDF5 file reader pure Python (no C extensions). The package is still in development and not all features of HDF5 files are supported. -``pyfive`` aims to support the same API as [`h5py`](https://github.com/h5py/h5py) -for reading files. Cases where a file uses a feature that is supported by ``h5py`` -but not ``pyfive`` are considered bug and should be reported in our [Issues](https://github.com/NCAS-CMS/pyfive/issues). -Writing HDF5 is not a goal of ``pyfive`` and portions of the API which apply only to writing will not be implemented. +``pyfive`` aims to support the same API as [`h5py`](https://github.com/h5py/h5py) for reading files. +Cases where a file uses a feature that is supported by ``h5py`` but not ``pyfive`` are considered bugs +and should be reported in our [Issues](https://github.com/NCAS-CMS/pyfive/issues). +Writing HDF5 output is not a goal of ``pyfive`` and portions of the API which apply only to writing will not be implemented. Dependencies ============ -``pyfive`` is tested against Python versions 3.10 to 3.13. It may also work with other Python versions. +``pyfive`` is tested against Python versions 3.10 to 3.14. +It may also work with other Python versions. The only dependencies to run the software besides Python is ``numpy``. @@ -75,6 +76,6 @@ Test coverage assessement is done using [codecov](https://app.codecov.io/gh/NCAS Documentation ============= -Build locally with Sphinx:: +Build locally with Sphinx: - sphinx-build -Ea doc doc/build + $ sphinx-build -Ea doc doc/build diff --git a/doc/cloud.rst b/doc/cloud.rst index e432e1fe..5155f8cf 100644 --- a/doc/cloud.rst +++ b/doc/cloud.rst @@ -49,7 +49,7 @@ If we look at some of the output of `p5dump -s` on this file uas:_first_chunk = 36520 ; -We can immediately see that this will be a problematic file! The b-tree index is clearly interleaved with the data +We can immediately see that this will be a problematic file! The `b-tree` index is clearly interleaved with the data (compare the first chunk address with last index addresses of the two variables), and with a chunk dimension of ``(1,)``, any effort to use the time-dimension to locate data of interest will involve a ludicrous number of one number reads (all underlying libraries read the data one chunk at a time). diff --git a/doc/introduction.rst b/doc/introduction.rst index cf82f1bb..8cfe83be 100644 --- a/doc/introduction.rst +++ b/doc/introduction.rst @@ -32,7 +32,7 @@ aligned with the array order on disk and use them for data access. There are optimisations to support cloud usage, the most important of which is that once a variable is instantiated (i.e. for an open ``pyfive.File`` instance ``f``, when you do ``v=f['variable_name']``) -the attributes and b-tree (chunk index) are read, and it is then possible to close the parent file (``f``), +the attributes and `b-tree`` (chunk index) are read, and it is then possible to close the parent file (``f``), but continue to use (``v``). The package also includes a command line tool (``p5dump``) which can be used to dump the contents of an HDF5 file to the terminal. diff --git a/doc/optimising.rst b/doc/optimising.rst index c17299c2..c7e5db15 100644 --- a/doc/optimising.rst +++ b/doc/optimising.rst @@ -9,10 +9,10 @@ how the data is stored in the file and how the data access library (in this case The data storage complexities arise from two main factors: the use of chunking, and the way attributes are stored in the files. **Chunking**: HDF5 files can store data in chunks, which allows for more efficient access to large datasets. -However, this also means that the library needs to maintain an index (a "b-tree") which relates the position in coordinate space to where each chunk is stored in the file. -There is a b-tree index for each chunked variable, and this index can be scattered across the file, which can introduce overheads when accessing the data. +However, this also means that the library needs to maintain an index (`b-tree`) which relates the position in coordinate space to where each chunk is stored in the file. +There is a `b-tree` index for each chunked variable, and this index can be scattered across the file, which can introduce overheads when accessing the data. -**Attributes**: HDF5 files can store attributes (metadata) associated with datasets and groups, and these attributes are stored in a separate section of the file. +**Attributes**: HDF5 files can store attributes (`metadata`) associated with datasets and groups, and these attributes are stored in a separate section of the file. Again, these can be scattered across the files. @@ -20,11 +20,11 @@ Optimising the files themselves ------------------------------- Optimal access to data occurs when the data is chunked in a way that matches the access patterns of your application, and when the -b-tree indexes and attributes are stored contiguously in the file. +`b-tree` indexes and attributes are stored contiguously in the file. Users of ``pyfive`` will always confront data files which have been created by other software, but if possible, it is worth exploring whether the `h5repack `_ tool can -be used to make a copy of the file which is optimised for access by using sensible chunks and to store the attributes and b-tree indexes contiguously. +be used to make a copy of the file which is optimised for access by using sensible chunks and to store the attributes and `b-tree` indexes contiguously. If that is possible, then all access will benefit from fewer calls to storage to get the necessary metadata, and the data access will be faster. @@ -84,8 +84,7 @@ For example, you can use the `concurrent.futures` module to read data from multi print("Results:", results) - -You can do the same thing to parallelise manipulations within the variables, by for example using, ``Dask``, but that is beyond the scope of this document. +You can do the same thing to parallelise manipulations within the variables, by for example using, ``dask``, but that is beyond the scope of this document. Using pyfive with S3 @@ -101,8 +100,6 @@ file, which for HDF5 will be stored as one object, look like it is on a file sys memory so repeated reads can be more efficient. The optimal caching strategy is dependent on the file layout and the expected access pattern, so ``s3fs`` provides a lot of flexibility as to how to configure that caching strategy. - - For ``pyfive`` the three most important variables to consider altering are the ``default_block_size`` number, the ``default_cache_type`` option and the ``default_fill_cache`` boolean. diff --git a/doc/p5dump.rst b/doc/p5dump.rst index e17e51be..7865bf5b 100644 --- a/doc/p5dump.rst +++ b/doc/p5dump.rst @@ -2,11 +2,11 @@ p5dump ****** ``pyfive`` includes a command line tool ``p5dump`` which can be used to dump the contents of an HDF5 file to the -terminal (e.g ``p5dump myfile.hdf5``). This is similar to the ``ncdump`` tool included with the NetCDF library, or the ``h5dump`` tool included +terminal (e.g. ``p5dump myfile.hdf5``). This is similar to the ``ncdump`` tool included with the NetCDF library, or the ``h5dump`` tool included with the HDF5 library, but like the rest of ``pyfive``, is implemented in pure Python without any dependencies on the HDF5 C library. ``p5dump`` is not identical to either of these tools, though the default output is very close to that of ``ncdump``. -When called with the ``"-s"`` flag (e.g ``p5dump -s myfile.hdf5``) the output provides extra information for chunked +When called with the ``"-s"`` flag (e.g. ``p5dump -s myfile.hdf5``) the output provides extra information for chunked datasets, including the locations of the start and end of the chunk index `b-tree` and the location of the first data chunk for that variable. This extra information is useful for understanding the performance of data access for chunked variables, particularly when accessing data in object stores such as diff --git a/doc/quickstart/installation.rst b/doc/quickstart/installation.rst index 20ede9b3..ad3616ef 100644 --- a/doc/quickstart/installation.rst +++ b/doc/quickstart/installation.rst @@ -63,6 +63,7 @@ We adhere to `Scientific Python SPEC-0 Date: Thu, 8 Jan 2026 15:15:34 +0000 Subject: [PATCH 3/3] Apply suggestions from code review Co-authored-by: David Hassell --- doc/cloud.rst | 2 +- doc/p5dump.rst | 2 +- doc/quickstart/enums.rst | 8 ++++---- doc/quickstart/installation.rst | 2 +- 4 files changed, 7 insertions(+), 7 deletions(-) diff --git a/doc/cloud.rst b/doc/cloud.rst index 5155f8cf..84e5e9db 100644 --- a/doc/cloud.rst +++ b/doc/cloud.rst @@ -88,4 +88,4 @@ part of the optimisation question - though inspection of chunking is a key part determine whether or not a file really is optimised for cloud usage. .. [#] Stern et.al. (2022): *Pangeo Forge: Crowdsourcing Analysis-Ready, Cloud Optimized Data Production*, https://dx.doi.org/10.3389/fclim.2021.782909. -.. [#] Hassel and Cimadevilla Alvarez (2025): *Cmip7repack: Repack CMIP7 netCDF-4 Datasets*, https://dx.doi.org/10.5281/zenodo.17550920. +.. [#] Hassell and Cimadevilla Alvarez (2025): *Cmip7repack: Repack CMIP7 netCDF-4 Datasets*, https://dx.doi.org/10.5281/zenodo.17550920. diff --git a/doc/p5dump.rst b/doc/p5dump.rst index 7865bf5b..19e34589 100644 --- a/doc/p5dump.rst +++ b/doc/p5dump.rst @@ -6,7 +6,7 @@ terminal (e.g. ``p5dump myfile.hdf5``). This is similar to the ``ncdump`` tool i with the HDF5 library, but like the rest of ``pyfive``, is implemented in pure Python without any dependencies on the HDF5 C library. ``p5dump`` is not identical to either of these tools, though the default output is very close to that of ``ncdump``. -When called with the ``"-s"`` flag (e.g. ``p5dump -s myfile.hdf5``) the output provides extra information for chunked +When called with the ``-s`` flag (e.g. ``p5dump -s myfile.hdf5``) the output provides extra information for chunked datasets, including the locations of the start and end of the chunk index `b-tree` and the location of the first data chunk for that variable. This extra information is useful for understanding the performance of data access for chunked variables, particularly when accessing data in object stores such as diff --git a/doc/quickstart/enums.rst b/doc/quickstart/enums.rst index 2b8fd074..5ae8e81d 100644 --- a/doc/quickstart/enums.rst +++ b/doc/quickstart/enums.rst @@ -8,7 +8,7 @@ For example, one could have an enumeration dictionary (`enum_dict`) defined as: .. code-block:: python - clouds = ['stratus','strato-cumulus','missing','nimbus','cumulus','longcloudname'] + clouds = ['stratus', 'strato-cumulus', 'missing', 'nimbus', 'cumulus', 'longcloudname'] enum_dict = {v:k for k,v in enumerate(clouds)} enum_dict['missing'] = 255 @@ -16,14 +16,14 @@ And an array of data which looked something like .. code-block:: python - cloud_cover = [0,3,4,4,4,1,255,1,1] + cloud_cover = [0, 3, 4, 4, 4, 1, 255, 1, 1] Which one would expect to interpret as .. code-block:: python - actual_cloud_cover = ['stratus','nimbus','cumulus','cumulus','cumulus', - 'stratus','missing','strato-cumulus','strato-cumulus'] + actual_cloud_cover = ['stratus', 'nimbus', 'cumulus', 'cumulus', 'cumulus', + 'stratus', 'missing', 'strato-cumulus', 'strato-cumulus'] These data are stored in HDF5 using a combination of an integer valued array and a stored dictionary which is used for the enumeration. diff --git a/doc/quickstart/installation.rst b/doc/quickstart/installation.rst index ad3616ef..4bb76437 100644 --- a/doc/quickstart/installation.rst +++ b/doc/quickstart/installation.rst @@ -67,4 +67,4 @@ We adhere to `Scientific Python SPEC-0 =2.0.0``. + ``pyfive`` is fully compatible with ``numpy >= 2.0.0``.