Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 53 additions & 29 deletions docs/user-guide/custom-components.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,9 @@ for using body mass index (BMI) as a phenotype score.
>>> from gpsea.analysis.pscore import PhenotypeScorer
>>> class BmiScorer(PhenotypeScorer): # ❶
...
... def __init__( # ❷
... def __init__(
... self,
... id2bmi: typing.Mapping[str, float],
... id2bmi: typing.Mapping[str, float], # ❷
... ):
... self._id2bmi = id2bmi
...
Expand All @@ -60,36 +60,40 @@ for using body mass index (BMI) as a phenotype score.
... except KeyError:
... return float('nan')

❶ The ``BmiScorer`` must extend :class:`~gpsea.analysis.pscore.PhenotypeScorer`
to be used as a phenotype scorer.
❷ The scorer needs a ``dict`` with `label` → `BMI` for the analyzed individuals.
We assume the user will pre-compute the corresponding ``dict``.
The ``BmiScorer`` must extend :class:`~gpsea.analysis.pscore.PhenotypeScorer`
to be used as a phenotype scorer (❶).
The scorer needs a mapping (e.g. a Python ``dict``) with `label` → `BMI` for the analyzed individuals (❷).
We assume the user will pre-compute the BMI values.

Then, the scorer must expose several properties, including ``name``, ``description``,
and the ``variable_name`` it operates on (❸❹❺).
GPSEA uses the properties to describe the scorer in reports or visualizations.
We should always aim for short and concise descriptions.

Then, the scorer must expose several properties, including ❸ ``name``, ❹ ``description``,
and the ❺ ``variable_name`` it operates on.
The properties provide bookkeeping metadata to use in e.g. visualizations.
Try to choose short and concise names.
The most important part of the scorer is the `score` method (❻).
As stated above, the scorer is expected to compute a numerical value or `NaN`
if the individual should be excluded from the analysis.
In the case of BMI scorer, the BMI is retrieved from the ``id2bmi`` dictionary.
If the BMI is missing, `NaN` is returned and the individual is omitted from the analysis.

The most important part of the scorer is the ❻ `score` method
which retrieves the BMI for an individual or returns `NaN` if the value is not available
and the individual should be omitted from the analysis.

.. _custom-variant-predicate:

*****************
Variant predicate
*****************

The purpose of a :class:`~gpsea.analysis.predicate.VariantPredicate` is to test
if a variant meets a certain criterion and GPSEA ships with an array
of builtin predicates (see :mod:`gpsea.analysis.predicate` module).
However, chances are a custom predicate will be needed in future,
so we show how to how to extend
the :class:`~gpsea.analysis.predicate.VariantPredicate` class
to create one's own predicate.
A :class:`~gpsea.analysis.predicate.VariantPredicate` tests
if a variant meets a certain criterion (e.g. variant is a deletion, variant is annotated wrt. a transcript of interest)
in order to assign the individual harboring the variant into a genotype class.
GPSEA ships with an array of builtin predicates (see :mod:`gpsea.analysis.predicate` module)
that should cover the most commonly needed cases.

However, since it is unlikely that the builtin predicates cover *all* cases,
GPSEA allows to define custom variant predicates. Here we show how to create one.

Specifically, we show how to create a predicate to test if the variant affects a glycine residue
of the transcript of interest.
As an example, we show how to create a predicate for checking if the variant affects a glycine residue
in a transcript of interest.

>>> from gpsea.model import Variant, VariantEffect
>>> from gpsea.analysis.predicate import VariantPredicate
Expand Down Expand Up @@ -133,10 +137,30 @@ of the transcript of interest.
... def __str__(self) -> str: # ➓
... return f"AffectsGlycinePredicate(tx_id={self._tx_id})"

❶ The ``AffectsGlycinePredicate`` must extend :class:`~gpsea.analysis.predicate.VariantPredicate`.
❷ We ask the user to provide the transcript accession `str` and we set the target aminoacid code to glycine ``Gly``.
Like in the :ref:`custom-phenotype-scorer` above, ❸❹❺ provide metadata required for the bookkeeping.
The ❻ ``test`` method includes the most interesting part - we retrieve the :class:`~gpsea.model.TranscriptAnnotation`
with the functional annotation data for the transcript of interest, and we test if the HGVS protein indicates
that the reference aminoacid is glycine.
Last, we override ➐ ``__eq__()`` and ❽ ``__hash__()`` (required) as well as ❾ ``__repr__()`` and ➓ ``__str__()`` (recommended).

The ``AffectsGlycinePredicate`` must extend :class:`~gpsea.analysis.predicate.VariantPredicate` to work with GPSEA (❶).
We ask the user to provide the transcript accession `str` and we set the target aminoacid code to glycine ``Gly`` (❷).

.. note::

Clearly, to test for change of *any* aminoacid
with only a slight rewrite of the predicate's constructor.
We will leave this as an exercise for the interested readers.

Like in the :ref:`custom-phenotype-scorer` above, we provide metadata required for reports and visualizations (❸❹❺).

The ``test`` method includes the most important logic of the predicate (❻).
In this specific case, we retrieve the :class:`~gpsea.model.TranscriptAnnotation`
with the functional annotation data for the transcript of interest,
and we test if the HGVS protein indicates that the reference aminoacid is glycine.

.. note::

We recommend using an Integrated Development Environment (IDE) such as PyCharm or VS Code to design the predicate.
On top of autocompletion and syntax checking features, an IDE simplifies accessing the properties and methods of objects.
In case of :class:`~gpsea.model.Variant`, an IDE will help us discover its ``get_tx_anno_by_tx_id`` method,
realize that it returns either :class:`~gpsea.model.TranscriptAnnotation` or ``None``,
and retrieve the functional annotation of the variant with respect to transcript's protein sequence
from the ``hgvsp`` field.

Last, we override ``__eq__()`` and ``__hash__()`` (required, ➐❽) as well as ``__repr__()`` and ``__str__()`` (recommended, ❾➓).
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@ dependencies = [
"requests>=2.25.0,<3.0",
"scipy>=1.10,<2.0",
"statsmodels>=0.13.0",
# numpy==1.23 was released on June 23rd, 2022.
# numpy==2.0.0 does not break any APIs used by GPSEA.
"numpy>=1.23",
"matplotlib>=3.2.0,<4.0",
"ratelimit>=2.2.1,<3",
Expand Down
2 changes: 1 addition & 1 deletion src/gpsea/analysis/predicate/_predicates.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@ class VariantTranscriptPredicate(VariantPredicate):
is annotated to affect a transcript with `tx_id` accession.

Args:
tx_id (str): the accessiono of the transcript of interest, e.g. `NM_123456.7`
tx_id (str): the accession of the transcript of interest, e.g. `NM_123456.7`
"""

def __init__(self, tx_id: str) -> None:
Expand Down
1 change: 1 addition & 0 deletions src/gpsea/analysis/pscore/_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -356,6 +356,7 @@ def compare_genotype_vs_phenotype_score(

# Sort by PatientCategory.cat_id and unpack.
# For now, we only allow to have up to 2 groups.
# TODO: fails if one of the categories includes no individuals.
x_key, y_key = sorted(data[MonoPhenotypeAnalysisResult.GT_COL].dropna().unique())
x = data.loc[
data[MonoPhenotypeAnalysisResult.GT_COL] == x_key,
Expand Down