Releases: NaegleLab/KSTAR
v1.1.0: Proteome updates, reducing user burden, and streamlining pregeneration
Primary Goals of this Release
- Update the phosphoproteome and network files to match the most up to date SwissProt proteome
- Tweak how pregenerated random experiments are stored and handled during activity calculation, including restructuring the network directory to allow for multiple networks with different names
- Lower the user burden by automatically loading required data and having master functions that don't require multiple lines of code
- Provide better tools for determining thresholds to use prior to activity calculation
Full Changelog: v1.0.4...v1.1.0
Summary of Major Changes
- Phosphoproteome Update
- We updated the previous reference files to have position and peptide information matching the current UniProt (as of November 2025). These files have been uploaded to FigShare and are now the default files loaded with KSTAR.
- We also updated the underlying information in the KSTAR networks to match the current proteome (same weighted network, but updated site positions).
- To make sure the correct reference files are used, we added a unique reference hash stored in a json file in RESOURCE_FILES directory, which much match the hash of the network used. This ensures that the network is built on the same reference phosphoprotoeome
- Pregeneration Updates
- Default pregenerated random experiments now exist in the same directory as the corresponding KSTAR networks under the folder 'RANDOM_ACTIVITIES', which will always be used
- Added global configuration parameter indicating where custom random activities can be saved (not the default activities shipped with KSTAR resource files)
- Rather than using the 'directories.txt' file previously used to indicate where network directories, we now use a .json file that the user can update using the
update_configuration()function. This includes changing how pregenerated experiments are handled. - Fixed issues with saving and using custom pregenerated random activities, which in v1.0 weren't being recognized
- Lowering the user burden
- Rather than having user load the networks and create a log file, we now automatically do these actions when initializing the ExpMapper and KinaseActivity classes. All the user needs to provide us an output directory and name of the run.
- Provided a master dotplot function which automatically stitches together the different components of the dotplot (clusters, context, evidence size, and the actual dotplot). This does not require the user to create their subplots. Users can also directly apply this function from the KinaseActivity class.
- The three key master functions (
enrichment_analysis(),randomized_analysis(), andMannWhitney_analysis()) are combined into a single master function,run_kstar_analysis()
- Thresholding decisions
- The
KinaseActivity.test_threshold()function has been updated to also calculate the similarity of evidence between columns as well as how many data columns are lost at the provided threshold - To visualize the impact of different thresholds, we've added a new function called
KinaseActivity.test_threshold_range(), which produces plots of the evidence size and similarity across multiple different thresholds. - For ease of selection, we have also added a
KinaseActivity.recommend_threshold()function to provide our suggestion about the optimal threshold that provides a good balance between the total number of sites used as evidence and minimizing overlap between sample columns
- Configuration changes
- Added function to see total memory usage by KSTAR's resource files (
config.get_package_memory() - Added function to see the available networks in the default network directory (
config.get_available_networks()) - Created a .json file to store desired configuration parameters, including the network directory location, whether to use pregenerated experiments and save random experiments by default, and where to save custom random activities. This file replaces the old 'directories.txt' file.
- Other changes
- Added a new
ExperimentMapper.save_experiment()function to the mapper class, which will save the mapped experiment, as well as additional information about the success of mapping - While still in the testing phase, we have added a new module called
dataset_processing()intended to help users process their datasets for use with KSTAR, mainly by formatting peptide sequences and converting between IDs (such as converting gene names to uniprot IDs). - We added a new class to the plot module, called
KSTAR_PDF, which generates a three page PDF summarizing the results from the KSTAR run. This is intended to be a first pass that users can look at to get a quick idea of what their data looks like. - FDR calculations are now based on 150 comparisons, rather than 100. The fundamental way this is calculated remains unchanged.
- Removed use of pickles
- Removed dependency on biopython, as previously this was only used to read in fasta files.
For more changes, see the full changelog
v1.0.4: KSTAR update to allow use of pre-generated experiments
-
Updated KSTAR to allow use of pre-generated activity lists while maintaining existing functionality.
-
Added variables to the config module, USE_PREGEN_DATA, SAVE_NEW_PRECOMPUTE, PREGENERATED_EXPERIMENTS_PATH, NETWORK_HASH_Y, NETWORK_HASH_ST, DIRECTORY_FOR_SAVE_PRECOMPUTE
-
Added install_network_files() function to config module. This automatically installs the
network files when the user runs config.install_network_files() from the tutorial. -
Added a hash id to the prune module. This gives each network its own unique hash id that is stored in the run_information.txt.
-
Added instance variables to the calculate module, min_dataset_size_for_pregenerated, max_diff_from_pregenerated, random_activities_list, compendia_distribution, data_columns_from_scratch, use_pregen_data, save_new_precompute, pregenerated_experiments_path, directory_for_save_precompute, network_hash
-
Added new functions to the calculate module -
- calculate_random_enrichment: Generates random experiments matching real data's compendia distribution, calculates kinase activities for each using hypergeometric tests, and aggregates results into a DataFrame.
- calculate_random_activities: Controls random experiment pipeline - decides whether to use pre-generated data or create new experiments, then processes datasets individually or in batch.
- calculate_random_activity_singleExperiment2: Handles a single random experiment in multiprocessing mode - builds experiment matching real data's compendia distribution and calculates its activity.
- add_pregenerated_to_random_enrichment: Combines pre-generated activities with newly calculated ones, ensures proper ordering, and updates the master random_enrichment DataFrame.
- load_pregenerated_random_activities: Finds and loads pre-computed activity files based on dataset characteristics and renames columns to match current experiment.
- save_new_precomputed_random_enrichment: Saves random activity results in an organized directory structure for future reuse.
- network_check_for_pregeneration: Verifies if pre-generated data exists for the current network by checking hash directories and metadata.
- check_file_sizes_for_pregenerated: Locates pre-generated files matching current dataset characteristics and returns their sizes.
- get_compendia_distribution: Calculates percentage of sites in each compendia class (0-2) per dataset.
- get_run_information_content: Reads metadata from RUN_INFORMATION.txt in the appropriate network directory.
- parse_network_information: Extracts structured configuration data from a RUN_INFORMATION.txt file.
v0.5.3: Bug fixes and minor updates for pandas v2
- Fix aggregation so that it does not throw error from non-numeric columns
- Throw error if binarizing data does not output any evidence
- Fixed issue where evidence columns were incorrectly removed if no quantification was greater than 1
- Various updates for pandas v2
- Minor fixes to plotting code
- Remove setuptools as requirement, as it's no longer used
v0.5.0 Addition of new features for post hoc analysis
Updates/changes:
- Renamed modules for which their name no longer reflected their true use: normalize -> random_experiments, validate -> analysis
- Completely removed normalization functions from the first iterations of KSTAR that are no longer in use
- Added catch to the pruning procedure to ensure that the code is not stopped if a kinase does not have any remaining edges, and instead keeps the kinase with fewer edges and records the error in the log.
New features:
- New functions in pruning module intended to guide users to best parameter values to use for their purposes + whether their parameter values are actually feasible.
- In addition to binarizing experiments by a threshold, you can now instead provide the desired number of phosphorylation sites to use for each sample and KSTAR will grab that number of sites with the greatest abundance (or least if greater = False)
- New function in KinaseActivity class, called test_threshold, intended to make it easier to check how a threshold value impacts the number of sites used across all samples
- Can add the number of phosphorylation sites used for each sample to a dotplot using evidence_size() function in DotPlot class
- Added new submodule in analysis module, called coverage, which is for exploring the coverage (number of sites with connections in network) of the phosphoproteome and phosphoproteomic experiments by KSTAR networks (or other kinase-substrate networks)
- Added new submodule in analysis module, called interactions, which is intended to contain functions for determining what active kinases are interacting with in the sample. Currently, contains two functions for outputting the phosphorylation sites that contributed most to a kinases activity prediction, based on the number of different networks they are predicted to interact.
v0.4.2 Bug fixes and improving use of command line for pruning
Updated previous release to fix bugs and expand the number of parameters that can be inputted into the pruning.py script via the command line
v0.4.0 Pruning Generalization and Reducing Memory Burden
In this release, two major updates were made:
- Redundant steps were removed during the random experiment generation and activity calculation steps to reduce memory burden
- Additional parameters were added to the pruning class to allow for user to input different site accession and number columns (if different from those used in NetworKIN). Goal is to make it so that it can be used for any kinase-substrate network.
v0.3.2 Streamlining Fixes
Small changes to errors in pruning.py and other fixes to previous release. Functionally identical release to v0.3.1.
v0.3.1 Streamlining the Pipeline
The primary change of this release was to remove the normalization pipeline, which generated normalized p-values based on the random experiments, and instead focus on Mann Whitney generated p-values (as this works better). Other changes include:
- Added PROCESSES parameter to the pruning functions, as was done with activity calculation
- Updated plotting functions to fix various visualization errors
v0.2.1 Config Update 2
Made the following adjustments to KSTAR configuration:
-create_network_pickles() will only generate new pickles if it does not find them in the network directory
-config.PROCESSES was removed. Instead of setting config variable, the number of processes to run in parallel is set through function parameters.
v0.2.0 Configuration Update
Configuration files updated so that source code does not require editing and all setup can easily be performed within python