Group 42: Nathan Jones (2762057), Ilse Feenstra (2608345) & Yauheniya Makarevich (2772189)
This is the source code accompanying the report entitled "Assessing the Risk of a Stroke for an Unhealthy Lifestyle Using Bayesian Networks".
This project was implemented using Python 3. The list of required packages can be found in requirements.txt. To install them, run the following command from the root directory of this project (preferably in a venv).
pip install -r requirements.txtThe experiment described in the accompanying paper can be reproduced by running the following commands in order (the first step can be omitted if you wish to use the existing test set):
python gen_test_set.pypython experiment.pypython plot_results.pyFor the last two commands, some additional command-line arguments can be used. To see what is available run
--helpafter one of the last two commands.
The unit tests written for the BNReasoner class can be exectued by running:
pytest test.pyThe inferences made on the modelled use case of strokes can be executed by running:
python experiment_stroke.pyBaysNet.py
This class was extended with several other methods for querying the structure of the network (leaf nodes, reachable nodes etc.) as well as methods for randomly sampling network variables (randQe).
BNReasoner.py
Contains all the required algorithms for reasoning over BNs.
test.py
Unit tests for the functionality implemented in theBNReasonerclass.
experiment.py
The driver code for running tests on the test set as well as generating and saving results.
experiment_stroke.py
The inferences made on our modelled use case.
gen_test_set.py
Generates the test set, consisting of random BNs according to the settings provided.
plot_results.py
Plots the results generated byexperiment.py
The BIFXML file format is meant to provide an easy means of exchanging Bayesian networks. It works with standard XML
tags. The detailed description of the format can be found at
http://www.cs.cmu.edu/afs/cs/user/fgcozman/www/Research/InterchangeFormat/. Please read this carefully.
Note that for our purposes it will be enough to only have nodes of type "nature", and we will not need the <Property>
tag.
Be aware of the order of the values in the probability table tag. They should be ordered in a way that corresponds to boolean counting where the "For" variable is the least significant bit followed by the "Given" variables from bottom to top. Here an example of what this is supposed to mean. Let's assume we have the following CPT table in the BIFXML file:
<DEFINITION>
<FOR>dog-out</FOR>
<GIVEN>bowel-problem</GIVEN>
<GIVEN>family-out</GIVEN>
<TABLE>0.99 0.01 0.97 0.03 0.9 0.1 0.3 0.7 </TABLE>
</DEFINITION>
Then the order of the variables in the table (from left to right) is: "bowel-problem", "family-out", "dog-out". Filling in the values, this leads to the following table:
| bowel-problem | family-out | dog-out | p |
|---|---|---|---|
| False | False | False | 0.99 |
| False | False | True | 0.01 |
| False | True | False | 0.97 |
| False | True | True | 0.03 |
| True | False | False | 0.9 |
| True | False | True | 0.1 |
| True | True | False | 0.3 |
| True | True | True | 0.7 |
Among the files of this project you can find the BayesNet class. We provided you with this class, so you don't need to worry much about the data structure in which the BN can be represented. This class provides you with (hopefully) useful functions for BNs such as loading them from a file and retrieving the CPT of a variable. We highly recommend using it. Of course, you are also free to implement your own methods and change the existing ones if they don't fit your purpose. For this class to work you will need to install (either with pip or anaconda) the following packages: networkx, matplotlib, pgmpy, pandas (see also requirements.txt).
Internally, the graphical structure of the BN is represented as a DiGraph object from the networkx package. The CPTs are modelled with DataFrames form the Pandas package. Each CPT is a DataFrame which corresponds to the form as is shown in Table 1.
We want to point out some methods which we deem especially useful here, but all methods come with their documentation in the comments fo the methods.
-
get_compatible_instantiations_table(instantiation, cpt): This method takes an instantiation as a pandas series and a CPT as a pandas DataFrame. It checks which rows of the provided CPT are compatible with the instantiation and returns only those rows. -
reduce_factor(instantiation, cpt): This method takes an instantiation as a pandas series and a CPT as a pandas DataFrame. It returns a new CPT in which all the rows that are incompatible with the instantiation are set to 0. -
get_interaction_graph(): Returns a new, undirected Graph object which corresponds to the interaction graph of the current BN. -
draw_structure(): Plots the graph structure of the current BN.
There are also a few other methods which might turn out useful during the implementation of this project. Note that they are completely optional to use, and it might well be the case that your implementation will work well even without them. We also provide them as some are used int the BayesNet class.
-
dog_out_CPT['p']returns the 'p' column of the dog_out_CPT DataFrame as a pandas Series. (0.99 0.01 0.97 0.03 0.9 0.1 0.3 0.7) -
dog_out_CPT['p'].max()returns the maximum of the probability values. (0.99) -
dog_out_CPT['family-out'] == dog_out_CPT['bowel-problem']creates a pandas Series which is true for every index in which "bowel-problem" and "family-out" is the same in the dog_out_CPT. (True, True, False, False, False, False, True, True). The arguments in the brackets can also be lists of column names in which case multiple columns are compared. -
dog_out_CPT.iloc[2]returns the third row of the DataFrame. (False, True, False, 0.97) -
dog_out_CPT.loc[dog_out_CPT['family-out'] == dog_out_CPT['bowel-problem']]returns all rows of the CPT in which 'bowel-problem' == 'family-out'. (rows 0, 1, 6 and 7) -
dog_out_CPT.loc[dog_out_CPT['family-out'] == dog_out_CPT['bowel-problem'], 'p'] = 0.0sets the 'p' value of the above mentioned rows to 0.0. This usage of loc[row(s), column(s)] corresponds to direct access to a cell or subtable. -
pd.Series({'Winter?': True, 'Sprinkler?': False})creates a pandas Series in which 'Winter?' is set to True and 'Sprinkler?' is set to False. This is a useful format for passing evidence to some methods of this assignment -
dog_out_CPT.iterrows()provides an iterator through all the rows of a DataFrame. This is useful for using in loops. Beware that the returned value is always a tuple of (row_number, row_content).
It is likely, that you will not have to use this package at all. One possible methods that could be useful is
networkx.neighbors(G, var) which provides a list of all neighbors of the variable 'var' in the graph 'G'.