Vaccaro by michellevaccaro · Pull Request #1 · jdaries/de_id

michellevaccaro · 2017-04-07T19:36:26Z

No description provided.

Needed to install: pycountry pp pygeoip changed undeclared and unused variable c to cursor in a number of functions; assumed to be a typo in De-identification.py Removed a bad space at line 515 Added text file waldoLog.txt, which will be a development log for Jim Waldo

…d comments for functions

instructor and staff entries when building the database rather than doing an explicit delete step

… .gitignore file

…ces to correct fields (in both .ipynb and .py)

… for anonymization without having to rebuild all the time. Changed De0identification.py to use the built database.

…e database.

…e database. A number of new programs to start working on finding the sets of classes that students are taking, and dropping those that will have an equivalence set of fewer than 5. This is an interim check in; there is much left to do but this is a start.

…from the HarvardX tools codebase. The script buildEquivClasses.py will go through a .csv file and, based on a set of quasi-identifiers hard coded into the script (this has to change), will determine the number number of entities in the equivalence classes determined by that set of quasi-identifiers. This leverages some of the code used in testKAnon.py, which builds a dictionary keyed by sets of quasi-identifiers with value the number of entities that have that set of identifiers. Note that this is a first check-in, working only with .csv files. This should be expanded to allow building both the dictionary of concatenated quasi-identifiers and the dictionary of equivalence classes either with .csv files or with the sqlite database.

…nd therefore should probably be renamed) to delete records (rows) to insure that a student can't be identified by the set of courses that he or she has taken. The current version deletes courses prefering to get rid of those with the least activity, although the routine that will randomly pick a course is also available.

… in a list that were taking the same set of courses from being processed (not a good idea to change a list while iterating over it). Added lots of documentation. This now seems to work.

…eSetDeIdentify.py, which more accurately reflects what the program now does.

…arious foolish reasons.

…ildDB.py and print out a distribution of the levels of education (in a human-readable form) for the participants in that database.

…om countries to a geographic area that has at least 50 records in that area. If the country has at least 50 records, the mapping is the identity mapping, but if there are fewer than 50, the mapping will be to a region that will contain at least 50. Note that 50 is an arbitrary number; it can be changed by changing one value in the code.

…t change the database, but rather to build a table of records to supress.Note that the current runtime difference between running this querying the database (to delete the course in which the student did the least) and running it without querying the database (by simply picking a course at random) is immense; using the database requires 1 hour 20 minutes to work on the full data set, while not working with the database runs in 11 seconds. Sigh...

…ut the running time immensely.

…he chain is far faster than the original mechanisms, and allows building a de-identified set where suppression of records is minimized and generalization is preferred. We now need to know the size of the bins that will allow generalization to be performed to build a k-anonyous data set; the number appears to be surprisingly large. Note that we are currently not generalizing the level of education; this may need to be changed to allow reasonable de-identification bins in the other values.

…ome up with a full suppression list (combining the records that need to be suppressed to keep from allowing identification by the set of classes taken with the set that need to be suppressed because the quasi-identifiers are not k-anonymous). The current implementation is slow; some work needs to be done to see if this can be sped up by not creating as many lists.

…make it easier to run repeated tests.

…l with the final endpoint. Work was done by Olivia; I'm just checking it in.

…dvertised, and so records that should have been suppressed were leaking into the CSV file. Also changed the cache size for the db to 300,000 pages (which doesn't seem to help).

…peeds things up many factors.

…s by particpation in each course, using the original data set, the de-identified data set, and the set of suppressed records. Allows a testing of the effects in this dimension of de-identification.

…in sizes as arguments. If no numbers are passed in, or if they are not integers, the size of the bin will default to the size set in the code (and, if an invalid entry was passed in, an error message will be printed out. Note that we still don't deal with the case in which the minimum bin size is too large for the number of entries that we have; in such a case the bin should be everything. Updated the pydoc for the functions in numeric_generalization to reflect the new code.

…que user_id and unique user_id, course_id counts for a database or set of databases.

… bin files rather than looking in to tables in the database. Added runFullSuppressionSet.py which runs a batch of buildFullSuppressionSet for various k values and bin sizes.

…main() routine in buildDeIdCSVwithTrueNumerics so that main could be called with a script that had done these operations. Adeed runBuildDeId.py, which will run buildDeIdCSVwithTrueNumerics on a directory, generating a set of de-identified csv files.

…ntly by a script, and then changed runBuildDeId.py to use this program (which is better than the buildDeIdCSVwithTrueNumerics.py).

… this now has a step-by-step recipe for building a de-identified file.

…batch running script; wrote runbuildCountryGen.py to run this in a directory for bins of 5, 10, 15, 20, and 25k

… country generalization tables. These are generated by bin size; currently they will be run for bins of 5, 10, 15, 20, and 25 thousand per bin.

…combinations, to run the generation of the numeric binning, and changed the course-set deidentification program so it could be run by this script.

…e addition of a new command-line option, which is the size of the bin to trigger generalization.

…ized but not suppressed dataset and generates fake data so that suppression does not have to be performed. Was shown to improve bias.

…in creation of bins. still needs to be tested and debugged.

…ite, seemingly cannot take '*' as an argument, and so replaced that with the actual field to be summed.

Please enter the commit message for your changes. Lines starting

…ng only the minimal quasi-identifier information for the HarvardX person/course dataset records. These can be selected by amount of participation. The output file has the YoB and LoE fields cleaned up, and the long-tail of the forum data counts can also be collapsed.

…rom the postal code, region, and city to the other geographic quasi-identifiers.

…gram on a csv containing only the quasi-identifiers. The current version works up to the point of when the collapse function is called; there is a note in to Olivia to find out what happens there. Made a repair to qi_class to make the comparisons correct (type mismatch before). Also changed the last field in the quasi-identifier file to have the right index from the original data file.

…fier pickle files rather than a database. The code now removes the entries for '' from the numerics and creates a separate category for these value, outside of the generalization. These changes also allow us to no longer include de_id_functions.

…hon to anaconda

…KAnon so that it could be run from the command line

…e continent and region to build_num_gen_qi_file.py. Corrected indexes to deal with the addition of two new fields in the qi objects in numeric_generalization_v2.py file. Added the continent and region to the qi object in qi_class.py.

… than just index, without paying the price of a dictionary reader. Added output option to generate a csv line rather than printing an explanatory message. Added a set of idFields that will give reports on different sets of quasi-identifiers.

…o the Archive directory. Added a number of files to git control. Wrote simpleCS50deId.py, which will take a log file and replace the actual user name and user_id with a randomly generated number. This is sufficient for use within Harvard, but does not suffice for general sharing under FERPA.

… and courseSetDeIdentify.py. Created getBinSizes.py that will show the sizes of each of the numeric bins, and graph_utils.py that has a couple of graphing routines for bar charts (we should add more graphing).

…fty algorithms for non-numeric generalization, and Michelle's files for displaying the distributions of the high school use.

…ment of k = 5; seems to have no affect on the file. Changed the handling of the '' value (that is, no entry), as python3 is more careful about mixing types. Rather than special casing in all of the code, simply turned the value to -1; this means that it sorts correctly. This also means that the code in build_bins can be simplified by taking out the code that special cases the '' entries. Also changed the exit condition in the build_bins loop to be when sum(denom<bin_size) is less or equal to zero. Testing this shows that it merges the final outlier into the last bin, which is what we want.

jimwaldo and others added 30 commits December 22, 2014 14:04

incorporated waldo's initial optimizations; olivia added more detaile…

aeaff98

…d comments for functions

Added database file

97d5d5b

Minor bug fixes and enhancements; mostly deal with removing the

ea8d720

instructor and staff entries when building the database rather than doing an explicit delete step

getting rid of the database file, added the database extension to the…

6d5161f

… .gitignore file

Corrected hard reference of 5 to soft reference to k; updated QI indi…

ca89211

…ces to correct fields (in both .ipynb and .py)

Added buildDB.py, which will build the database that then can be used…

e11d70a

… for anonymization without having to rebuild all the time. Changed De0identification.py to use the built database.

Minor changes to De-identification.py to work better with the separat…

40eae68

…e database.

Fixed a bug in buildCourseDict.py that kept all but the first student…

376f938

… in a list that were taking the same set of courses from being processed (not a good idea to change a list while iterating over it). Added lots of documentation. This now seems to work.

Minor cleanup to what is printed. Renamed buildCourseDict.py to cours…

2e93dc1

…eSetDeIdentify.py, which more accurately reflects what the program now does.

Adding a bunch of stuff that wasn't making it to the repository for v…

0a48d4b

…arious foolish reasons.

Added edLevelDistribution.py, that will take a database created by bu…

9099e05

…ildDB.py and print out a distribution of the levels of education (in a human-readable form) for the participants in that database.

Added making the cache much larger to courseSetDeIdentify.py, which c…

33ffaf1

…ut the running time immensely.

Switching back to the mainline

c99ddb3

Changed the order of the command-line arguments in testkAnonDB.py to …

e140ea5

…make it easier to run repeated tests.

Removed un-used imports

11a66e7

adding new version of numeric_generalization, with the bug fix to dea…

c8980bc

…l with the final endpoint. Work was done by Olivia; I'm just checking it in.

Bug fix to buildDeIdentifiedCSV.py; a "continue" was not working as a…

adfe3af

…dvertised, and so records that should have been suppressed were leaking into the CSV file. Also changed the cache size for the db to 300,000 pages (which doesn't seem to help).

Changing from using lists for the suppression records to sets; this s…

76a473f

…peeds things up many factors.

Wrote excludedByParticipation, which tracks the percentage of student…

7ed810b

…s by particpation in each course, using the original data set, the de-identified data set, and the set of suppressed records. Allows a testing of the effects in this dimension of de-identification.

First pass at buildDeIdCSVwithTrueNumerics.py

374f4cd

jimwaldo and others added 30 commits January 15, 2016 15:00

Added get_user_class_counts.py, a simple script that will get the uni…

66f72d4

…que user_id and unique user_id, course_id counts for a database or set of databases.

Cleaned up buildFullSuppressionSet so that it would use the generated…

0369163

… bin files rather than looking in to tables in the database. Added runFullSuppressionSet.py which runs a batch of buildFullSuppressionSet for various k values and bin sizes.

Made changes to buildDeIdentifiedCSV.py to allow it to be run efficie…

1e3a816

…ntly by a script, and then changed runBuildDeId.py to use this program (which is better than the buildDeIdCSVwithTrueNumerics.py).

Added instructions for producing the de-identified file to README.md;…

bb21d5d

… this now has a step-by-step recipe for building a de-identified file.

Changed buildcountrygeneralizer.py so that it could be called from a …

d892cac

…batch running script; wrote runbuildCountryGen.py to run this in a directory for bins of 5, 10, 15, 20, and 25k

Changed runBuildDeId.py and runFullSuppressionSet.py to use the right…

367a167

… country generalization tables. These are generated by bin size; currently they will be run for bins of 5, 10, 15, 20, and 25 thousand per bin.

Added scripts to run generating the suppression files for course-set …

079da68

…combinations, to run the generation of the numeric binning, and changed the course-set deidentification program so it could be run by this script.

Updated the documentation to buildcountrygeneralizer.py to reflect th…

0cfaf0e

…e addition of a new command-line option, which is the size of the bin to trigger generalization.

Added suppressAndBuildDeidentifiedCSV_v2.py, which takes in a general…

963d08c

…ized but not suppressed dataset and generates fake data so that suppression does not have to be performed. Was shown to improve bias.

add greedy generalization file that minimizes distortion of the mean …

bfdbe3e

…in creation of bins. still needs to be tested and debugged.

Added cursor to the argument list of main() in courseSetDeIdentify.py

3d77562

Changed syntax of the select statements to reflect that SUM(), in sql…

68206a1

…ite, seemingly cannot take '*' as an argument, and so replaced that with the actual field to be summed.

debugged greedy generalization file, numeric_generalization_v2.py

f64c16e

Please enter the commit message for your changes. Lines starting

Minor changes to the README.md file

3a44390

Added the city to the set of quasi-identifiers; this allows mapping f…

e207c7d

…rom the postal code, region, and city to the other geographic quasi-identifiers.

Final additions to documentation, along with changing the default pyt…

e17d352

…hon to anaconda

Added code to build a generalized quasi-identifier file, changed test…

69d7e6a

…KAnon so that it could be run from the command line

Minor changes to buildDeIdentifiedCSV.py, buildFullSuppressionSet.py,…

02e9a18

… and courseSetDeIdentify.py. Created getBinSizes.py that will show the sizes of each of the numeric bins, and graph_utils.py that has a couple of graphing routines for bar charts (we should add more graphing).

Added new files to the git repo. These include Jack's greedy and thri…

3600089

…fty algorithms for non-numeric generalization, and Michelle's files for displaying the distributions of the high school use.

Conversion to Python 3

c19101e

Minor changes to qi_class.py

9e7d320

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vaccaro#1

Vaccaro#1
michellevaccaro wants to merge 73 commits intojdaries:masterfrom
harvard:master

michellevaccaro commented Apr 7, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

michellevaccaro commented Apr 7, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments