Skip to content

python_ALE: First stab.#48

Open
wwood wants to merge 15 commits intossolo:masterfrom
wwood:python_ALE
Open

python_ALE: First stab.#48
wwood wants to merge 15 commits intossolo:masterfrom
wwood:python_ALE

Conversation

@wwood
Copy link
Copy Markdown

@wwood wwood commented Mar 20, 2026

Opus-4.6's attempt at a straight port to Python. I haven't looked at this code at all, so there is probably errors. There are 15 tests that pass, though, fwiw.

I also added a pixi.toml so that

pixi run build

builds the C++ version from source (can rmeove the python-specific tasks of this if helpful)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8a4d78f85e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +644 to +646
acceptance_prob = math.exp(
min(0, (new_log_lk + new_log_prior) - (current_log_lk + current_log_prior))
) * hastings_ratio
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Clamp the full Metropolis-Hastings ratio before acceptance

ALEmcmc_undated computes exp(min(0, ΔlogP)) * hastings_ratio, but the clamp to 1 has to be applied after multiplying by the Hastings term. With the current code, any proposal that improves the posterior but has hastings_ratio < 1 is under-accepted—for example ΔlogP=5, hastings_ratio=0.1 is accepted with probability 0.1 instead of 1. That changes the chain’s stationary distribution and biases all sampled rates/reconciliations.

Useful? React with 👍 / 👎.

Comment on lines +210 to +214
if rm >= -1:
print(f"# rate multiplier for rate {rate_name} on branch with ID {e} set to {rm}")
rate_multipliers.setdefault("rate_multiplier_" + rate_name, {})[e] = rm
else:
print(f"# rate multiplier for rate {rate_name} on branch with ID {e} to be optimized ")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve optimized branch-multiplier requests in ALEml_undated

When users pass rate_multiplier:<rate>:<branch>:<value> with <value> < -1, this branch only logs "to be optimized" and then drops the request. Nothing later records that branch/rate pair for the minimize() call, so the run still optimizes only delta/tau/lambda/O_R and leaves the requested branch-specific multiplier at its default. That silently produces the wrong likelihood surface and output rates for workflows that rely on optimized branch multipliers.

Useful? React with 👍 / 👎.

Comment on lines +568 to +570
for rm_name, rm_dict in rate_multipliers.items():
for e, rm_val in rm_dict.items():
model.vector_parameter[rm_name][e] = rm_val
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply MCMC fixed rate multipliers after model construction

If ALEmcmc_undated is invoked with any rate_multiplier=... option, this loop indexes model.vector_parameter[...] before construct_undated() initializes the per-branch arrays. ExODTModel.__init__ starts vector_parameter empty, and construct_undated() is what creates keys like rate_multiplier_delta, so this path raises KeyError instead of running with the requested fixed multipliers.

Useful? React with 👍 / 👎.

…p test

- Remove incorrect min(0, ...) clamp from MCMC acceptance probability
  to match C++ behavior (exp(delta_log_posterior) * hastings_ratio)
- Move MCMC rate_multiplier application after construct_undated() so
  per-branch vectors are initialized before indexing
- Warn on unsupported per-branch rate optimization (rm < -1)
- Fix bip_counts test to compare by leaf name sets instead of integer
  IDs, which differ between Python and C++ implementations

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@wwood
Copy link
Copy Markdown
Author

wwood commented Mar 20, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 690d721958

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +689 to +690
model.MLRec_events.clear()
model.reset_T_to_from()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve MCMC transfer frequencies across posterior samples

When ALEmcmc_undated runs with sample>1, this reset clears the per-sample transfer matrix before each draw, but unlike src/ALEmcmc_undated.cpp:458-466 the Python port never copies those counts into a running aggregate before _mcmc.uTs is written from model.T_to_from at ALE_python/cli.py:742-756. The transfer file therefore reflects only the last sampled reconciliation (and then divides by all samples), so posterior transfer frequencies are severely understated or disappear entirely.

Useful? React with 👍 / 👎.

Comment on lines +1015 to +1018
f"Total \t{model.MLRec_events.get('D', 0) / samples}\t"
f"{model.MLRec_events.get('T', 0) / samples}\t"
f"{model.MLRec_events.get('L', 0) / samples}\t"
f"{model.MLRec_events.get('S', 0) / samples}\n"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Report ALEevaluate totals from all sampled reconciliations

This Total row is computed from model.MLRec_events, but that dict is cleared before every sample at ALE_python/cli.py:993-995. For outputFiles=y with sample>1, the .uml_rec summary therefore reports only the final sampled reconciliation divided by samples, instead of the mean over all sampled reconciliations, so the duplication/transfer/loss/speciation counts in the output file are wrong.

Useful? React with 👍 / 👎.

Comment on lines +375 to +376
s_tree_str = model.string_parameter.get("S_with_ranks", model.string_parameter.get("S_un", Sstring))
fout.write(f"S:\t{s_tree_str}\n")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Emit the ranked species tree in reconciliation outputs

The writers here fall back to S_un because construct_undated() never populates model.string_parameter['S_with_ranks'] like the C++ implementation does (src/undated.cpp:162-185). As a result, ALEml/ALEmcmc/ALEevaluate outputs write the raw input species tree instead of the branch-ID/rank annotated tree, so sampled reconciliations that refer to internal branch numbers cannot be mapped back to species-tree branches, and output_species_tree=y is similarly uninformative.

Useful? React with 👍 / 👎.

wwood and others added 2 commits March 20, 2026 17:20
When rate_multiplier:<rate>:<branch>:<value> is passed with value < -1,
the branch multiplier is now included as an extra parameter in the
Nelder-Mead optimizer alongside delta/tau/lambda, matching the C++
behavior. Optimized multipliers are printed after convergence.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Accumulate T_to_from counts across MCMC samples into a running
  aggregate instead of only writing the last sample's transfers
- Accumulate MLRec_events across samples in ALEevaluate so the Total
  row reports mean counts, not just the last sample divided by N
- Generate S_with_ranks in construct_undated() so reconciliation outputs
  include the branch-ID annotated species tree

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@wwood
Copy link
Copy Markdown
Author

wwood commented Mar 20, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 28b6755324

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

print("Reconciliation model initialised, starting DTL rate optimisation..")

# Optimize if not all rates are fixed
if not (delta_fixed and tau_fixed and lambda_fixed and not MLOR):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include branch-multiplier requests in the optimization gate

If the caller fixes delta, tau, and lambda but asks to optimize only a per-branch multiplier via rate_multiplier=...:-2, this condition evaluates false and skips minimize() even though ml_branch_multipliers is non-empty. The later for s in ml_rm_strings path then hits an UnboundLocalError, and even without that crash the requested branch multiplier would never be optimized.

Useful? React with 👍 / 👎.


# D event
uq_resum += self.PD[e] * (self.uq[gp_i][e] * self.uq[gpp_i][e] * 2) * pp + EPSILON
if r * uq_sum < uq_resum or no_T:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove the unconditional duplication shortcut in no-transfer sampling

When tau=0 or sample_undated(..., no_T=True) is used, or no_T makes the first duplication branch win unconditionally once the code reaches this point. That skips the remaining duplication choices and the later SL/DL cases, so DL-only runs produce systematically distorted sampled reconciliations and event counts in .uml_rec.

Useful? React with 👍 / 👎.

Comment on lines +382 to +384
if self.alpha > 0:
return (Bip_count + self.alpha / self.N_Gamma
* self.Bi(self.set_sizes[g_id])) / (
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle unseen bipartitions before applying alpha smoothing

With alpha > 0, any call that scores an unobserved split (for example p_bip_by_set() on a clade absent from the sample) reaches self.set_sizes[g_id] even though the code just assigned g_id = -10 for that case. The result is a KeyError instead of the intended smoothed prior probability, which makes alpha-corrected scoring unusable on unseen bipartitions.

Useful? React with 👍 / 👎.

Comment on lines +805 to +806
if (g_id >= len(self.Dip_counts)
or parts not in self.Dip_counts.get(g_id, {})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop calling dict methods on the Dip_counts list

If alpha/beta smoothing makes an unsampled partition selectable in random_tree(), this membership check calls .get() on self.Dip_counts, but Dip_counts is a list. The first such draw raises AttributeError, so corrected random sampling crashes as soon as it needs to pick an unobserved split.

Useful? React with 👍 / 👎.

wwood and others added 2 commits March 20, 2026 17:59
…ts access

- Include ml_branch_multipliers in the optimization gate so per-branch
  multiplier optimization runs even when global rates are fixed
- Remove 'or no_T' shortcut from D event sampling that biased partition
  selection when transfers are disabled
- Use .get() for set_sizes in alpha-smoothed p_bip to avoid KeyError
  on unseen bipartitions
- Fix Dip_counts access in random_tree to use index bounds check
  instead of .get() on a list

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Guard prior_origination against zero/negative values in expovariate
  and _compute_exponential_log_probability to avoid ZeroDivisionError
  and ValueError
- Clamp MH acceptance exponent at 700 to prevent OverflowError on
  large likelihood improvements (auto-accepts as intended)
- Mark optimizer-dependent tests with @pytest.mark.slow; add
  'pixi run test-fast' task that skips them (40s vs 8min)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@wwood
Copy link
Copy Markdown
Author

wwood commented Mar 20, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cd641f7ae9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

import random
import sys

import numpy as np
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid importing NumPy at CLI module import time

Because ALE_python/__main__.py imports cli.main, this top-level import makes every CLI entry point fail with ModuleNotFoundError in a plain Python environment—even ALEobserve, ALEcount, or just python -m ALE_python --help, which otherwise only need the standard library. In this patch pyproject.toml does not declare NumPy/Scipy, so users outside the Pixi environment will hit this immediately.

Useful? React with 👍 / 👎.

Comment on lines +60 to +63
if "(" in line:
tree_i += 1
if tree_i > burnin and tree_i % every == 0:
trees.append(line)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle singleton A; trees when observing ALE files

The C++ observe_ALE_from_file has a fallback for lines like A; and produces a one-leaf ALE object, because singleton gene families are valid inputs. Here we only keep lines containing '(', so a file containing a single-leaf tree falls through to ValueError("No trees found..."). That means ALEobserve cannot preprocess singleton families at all.

Useful? React with 👍 / 👎.

pixi.toml Outdated
configure = "cmake -S . -B build"
build = { cmd = "cmake --build build -j4", depends-on = ["configure"] }
python_ale = "python -m ALE_python"
test = "pytest ALE_python/tests -v"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make the Pixi test task build the C++ binaries first

ALE_python/tests/conftest.py points cpp_bin at build/bin, and multiple tests spawn ALEobserve/ALEml_undated from there. The new test task runs pytest directly, so on a fresh checkout pixi run test fails with FileNotFoundError unless the user already ran pixi run build by hand. Adding a build dependency (or redirecting this task to test-fast) would make the advertised workflow work out of the box.

Useful? React with 👍 / 👎.

- Move numpy import from module-level into ALEml_undated so CLI
  commands that don't need it (ALEobserve, ALEcount, etc.) work
  without numpy/scipy installed
- Accept singleton trees (e.g. 'A;') in observe_ALE_from_file by
  matching lines with ';' not just '('
- Make 'pixi run test' depend on build so C++ binaries are available
  on fresh checkouts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@wwood
Copy link
Copy Markdown
Author

wwood commented Mar 20, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a850dacbfe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +62 to +65
if "(" in line or ";" in line:
tree_i += 1
if tree_i > burnin and tree_i % every == 0:
trees.append(line)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle name; single-leaf inputs explicitly

observe_ALE_from_file() now treats any line containing ; as a normal Newick tree and appends it unchanged. The C++ implementation in src/ALE_util.cpp:32-43 special-cases that form, strips the delimiter, and builds a one-leaf ALE directly. With the Python port, a sample file containing A; produces a leaf named A; instead of A, so single-tip families no longer round-trip correctly and later species-name matching can fail.

Useful? React with 👍 / 👎.

Comment on lines +895 to +897
tree_i += 1
if tree_i > burnin and tree_i % every == 0:
trees.append(line)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve ALEadd's original every= subsampling order

ALEadd increments tree_i before checking every, which changes which trees are merged whenever every > 1. In src/ALEadd.cpp:56-61 the filter is evaluated before the increment, so every=2 burnin=0 keeps trees 1,3,5,...; this port keeps 2,4,6,... instead. That silently changes the clade counts added to an existing .ale, so downstream posterior estimates diverge for any workflow that relies on subsampling.

Useful? React with 👍 / 👎.

Comment on lines +322 to +323
bounds = [(1e-10, 10.0)] * (len(x0) - len(ml_branch_multipliers))
bounds += [(1e-7, 10000.0)] * len(ml_branch_multipliers)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restore the original optimizer bounds for rates and O_R

These bounds cap every optimized scalar at 10.0, but the C++ optimizer allows D/T/L rates up to 100 (src/ALEml_undated.cpp:34-37) and O_R up to 1000 (src/ALEml_undated.cpp:64-67). Any family whose likelihood keeps improving above 10 will now stick on this artificial boundary and report incorrect ML rates and log-likelihoods; MLOR runs are affected most because their search space is reduced by two orders of magnitude.

Useful? React with 👍 / 👎.

…ounds

- Strip ';' from singleton tree lines so leaf name is clean (e.g. 'A'
  not 'A;'), matching C++ ALE_util.cpp behavior
- Fix ALEadd to check burnin/every before incrementing tree_i,
  matching C++ ALEadd.cpp subsampling order (>=burnin, then increment)
- Set optimizer bounds to match C++: rates (1e-10, 100), O_R (1e-10,
  1000), branch multipliers (1e-7, 10000); use 1e-6 lower bound for
  rates when no_T is set

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@wwood
Copy link
Copy Markdown
Author

wwood commented Mar 20, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

wwood and others added 7 commits March 20, 2026 19:53
- Validate singleton tree lines by tokenizing on C++ delimiters
  (comma, colon, space) and taking only the first field, rejecting
  lines like 'not a tree;' that aren't valid single-leaf Newick
- Add build dependency to test-fast so C++ binaries are available
  on fresh checkouts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ately

Match C++ behavior where a singleton line (e.g. 'A;') immediately
builds a one-leaf ApproxPosterior via a minimal .ale state file and
returns, rather than mixing singletons into the normal tree list.
This prevents silent data loss with multiple singletons or KeyErrors
when mixing singletons with normal trees.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…arser

Add argparse with subparsers for all 8 commands, providing --help at
both top-level and per-command. Each subparser documents positional
args, key=value options, defaults, and usage examples. The original
bespoke key=value parsing is preserved for actual execution — argparse
is only used for help/version display and subcommand dispatch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ivision

Add _normalize_argv() preprocessor that converts --key value long options
to key=value form before dispatching to the bespoke parsers. Update all
subcommand epilogs to show both styles. Guard against ZeroDivisionError
when sample=0 in event count output across all commands.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace all bespoke one-line usage strings with the full argparse help
for each subcommand, so users see descriptions, options, and examples.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant