Extensions for DataFrames to make statistical and analysis operations much, much more comfortable and convenient. Turns your DataFrame into a StatFrame, composing Mindhunter's new features over it, supercharging its capabilities without sacrificing compatibility.
Example:
import pandas as pd
from mindhunter import StatFrame
from mindhunter.visualization import StatPlotter
dataset = pd.read_csv('Fish.csv') # load your data
data = StatFrame(dataset) # create a StatFrame
data.clean_df() # clean your data
plottable = StatPlotter(data) # turn your StatFrame into a StatPlotter
plottable.plot_normal_distr(data_to_test=data.df['width']) # create a set of normal distribution validation graphs
You need uv to build the module.
- Clone the repository
chmod +x ./build.sh./build.sh- It will clear cache, build, install and test the module.
Mindhunter implements a fairly rudimentary setup for testing. It will look inside tests for any fixtures or tests inside files starting with test_. It uses pytest and faker to create a randomised dataset to test upon.
So far, coverage goes to the extent of making sure a StatFrame can be created and data can be obtained. More testing is being developed and it's coming soon.
- Your new
StatFramecan be used now with Mindhunter's new Analyzers, Plotters and Toolkits:DistributionAnalyzer: adds normal distribution utilities directly on top of theDataFrame.HypothesisAnalyzer: adds hypothesis testing, binomial and related functionality.AnalyticalTools: provides access toscipy.statsmethods to generate and convert several values over a givenStatFrame.StatPlotter: adds ready-to-go plotting capabilities for many common values, like z-scores, Coefficient of Variation, Normal Distribution, and others; usingseabornandmatplotlib.pyplot.StatVisualizer: provides easy access to build common graphs and visualizations, returning ready-to-go graphs just by passing lists or aStatFrame.
StatFramealso holds a cache of the most commonly-used values and variables, providing easy access to the values of not just a column, but of a whole set. It caches:- Central Tendency:
- mean
- median
- mode
- Spread/Variability:
- std (standard deviation)
- variance
- range
- iqr (inter-quantile range)
- mad (median absolute deviation)
- Distribution Shape:
- skewness
- kurtosis
- Data Quality:
- count
- missing_count
- missing_pct
- Extreme Values:
- min
- max
- q1
- q3
- Key Ratios:
- cv (coefficient of variation)
- sem (standard error of mean)
- Mindhunter can also automatically cleans column names, drops NaN and duplicates of datasets. It also provides methods to locate, analyze and remove zero-values from your dataset.
I've been studying data analysis and, over the months, I've been collecting a bunch of little methods and scripts to do my homework. It then went to the point it was a 800+ line cell on each Jupyter Notebook. It became a bit too much.
In short: it uses basic OOP composition, against all advise, to pass the StatFrame as an argument. That class holds the DataFrame itself, and all operations are done through the StatFrame directly to the DF. All operations act directly on the source, and calling update() will re-trigger the caching process.
This library will be updated fairly regularly, as I start collecting and tidying up more and more little tools, and taking more advantage of the internal mechanisms. I am much more of a developer than a data analyst, so I need much more help knowing what the community needs for me to keep on improving the library. If you have any issue, suggestion or comment, feel free to create a new issue!