Skip to content

Vaccaro#1

Open
michellevaccaro wants to merge 73 commits intojdaries:masterfrom
harvard:master
Open

Vaccaro#1
michellevaccaro wants to merge 73 commits intojdaries:masterfrom
harvard:master

Conversation

@michellevaccaro
Copy link

No description provided.

jimwaldo and others added 30 commits December 22, 2014 14:04
	Needed to install:
		pycountry
		pp
		pygeoip
	
	changed undeclared and unused variable c to cursor in a number of
functions; assumed to be a typo
	
in De-identification.py
	Removed a bad space at line 515
	
Added text file waldoLog.txt, which will be a development log for Jim
Waldo
instructor and staff entries when building the database rather than
doing an explicit delete step
…ces to correct fields (in both .ipynb and .py)
… for anonymization without having to rebuild all the time. Changed De0identification.py to use the built database.
…e database.

A number of new programs to start working on finding the sets of classes that students are taking, and dropping those that will have an equivalence set of fewer than 5.
This is an interim check in; there is much left to do but this is a start.
…from the HarvardX tools codebase. The script buildEquivClasses.py will go through a .csv file and, based on a set of quasi-identifiers hard coded into the script (this has to change), will determine the number number of entities in the equivalence classes determined by that set of quasi-identifiers. This leverages some of the code used in testKAnon.py, which builds a dictionary keyed by sets of quasi-identifiers with value the number of entities that have that set of identifiers.

Note that this is a first check-in, working only with .csv files. This should be expanded to allow building both the dictionary of concatenated quasi-identifiers and the dictionary of equivalence classes either with .csv files or with the sqlite database.
…from the HarvardX tools codebase. The script buildEquivClasses.py will go through a .csv file and, based on a set of quasi-identifiers hard coded into the script (this has to change), will determine the number number of entities in the equivalence classes determined by that set of quasi-identifiers. This leverages some of the code used in testKAnon.py, which builds a dictionary keyed by sets of quasi-identifiers with value the number of entities that have that set of identifiers.

Note that this is a first check-in, working only with .csv files. This should be expanded to allow building both the dictionary of concatenated quasi-identifiers and the dictionary of equivalence classes either with .csv files or with the sqlite database.
…from the HarvardX tools codebase. The script buildEquivClasses.py will go through a .csv file and, based on a set of quasi-identifiers hard coded into the script (this has to change), will determine the number number of entities in the equivalence classes determined by that set of quasi-identifiers. This leverages some of the code used in testKAnon.py, which builds a dictionary keyed by sets of quasi-identifiers with value the number of entities that have that set of identifiers.

Note that this is a first check-in, working only with .csv files. This should be expanded to allow building both the dictionary of concatenated quasi-identifiers and the dictionary of equivalence classes either with .csv files or with the sqlite database.
…nd therefore should probably be renamed) to delete records (rows) to insure that a student can't be identified by the set of courses that he or she has taken. The current version deletes courses prefering to get rid of those with the least activity, although the routine that will randomly pick a course is also available.
… in a list that were taking the same set of courses from being processed (not a good idea to change a list while iterating over it). Added lots of documentation. This now seems to work.
…eSetDeIdentify.py, which more accurately reflects what the program now does.
…ildDB.py and print out a distribution of the levels of education (in a human-readable form) for the participants in that database.
…om countries to a geographic area that has at least 50 records in that area. If the country has at least 50 records, the mapping is the identity mapping, but if there are fewer than 50, the mapping will be to a region that will contain at least 50. Note that 50 is an arbitrary number; it can be changed by changing one value in the code.
…t change the database, but rather to build a table of records to supress.Note that the current runtime difference between running this querying the database (to delete the course in which the student did the least) and running it without querying the database (by simply picking a course at random) is immense; using the database requires 1 hour 20 minutes to work on the full data set, while not working with the database runs in 11 seconds. Sigh...
…he chain is far faster than the original mechanisms, and allows building a de-identified set where suppression of records is minimized and generalization is preferred. We now need to know the size of the bins that will allow generalization to be performed to build a k-anonyous data set; the number appears to be surprisingly large.

Note that we are currently not generalizing the level of education; this may need to be changed to allow reasonable de-identification bins in the other values.
…ome up with a full suppression list (combining the records that need to be suppressed to keep from allowing identification by the set of classes taken with the set that need to be suppressed because the quasi-identifiers are not k-anonymous). The current implementation is slow; some work needs to be done to see if this can be sped up by not creating as many lists.
…l with the final endpoint. Work was done by Olivia; I'm just checking it in.
…dvertised, and so records that should have been suppressed were leaking into the CSV file. Also changed the cache size for the db to 300,000 pages (which doesn't seem to help).
…s by particpation in each course, using the original data set, the de-identified data set, and the set of suppressed records. Allows a testing of the effects in this dimension of de-identification.
jimwaldo and others added 30 commits January 15, 2016 15:00
…in sizes as arguments. If no numbers are passed in,

or if they are not integers, the size of the bin will default to the size set in the code (and, if an invalid entry was
passed in, an error message will be printed out.

Note that we still don't deal with the case in which the minimum bin size is too large for the number of entries that we
have; in such a case the bin should be everything.

Updated the pydoc for the functions in numeric_generalization to reflect the new code.
…que user_id and unique user_id, course_id counts for a database or set of databases.
… bin files rather than looking in to tables in the database. Added runFullSuppressionSet.py which runs a batch of buildFullSuppressionSet for various k values and bin sizes.
…main() routine in buildDeIdCSVwithTrueNumerics so that main could be called with a script that had done these operations. Adeed runBuildDeId.py, which will run buildDeIdCSVwithTrueNumerics on a directory, generating a set of de-identified csv files.
…ntly by a script, and then changed runBuildDeId.py to use this program (which is better than the buildDeIdCSVwithTrueNumerics.py).
… this now has a step-by-step recipe for building a de-identified file.
…batch running script; wrote runbuildCountryGen.py to run this in a directory for bins of 5, 10, 15, 20, and 25k
… country generalization tables. These are generated by bin size; currently they will be run for bins of 5, 10, 15, 20, and 25 thousand per bin.
…combinations, to run the generation of the numeric binning, and changed the course-set deidentification program so it could be run by this script.
…e addition of a new command-line option, which is the size of the bin to trigger generalization.
…ized but not suppressed dataset and generates fake data so that suppression does not have to be performed. Was shown to improve bias.
…in creation of bins.

still needs to be tested and debugged.
…ite, seemingly cannot take '*' as an argument, and so replaced that with the actual field to be summed.
 Please enter the commit message for your changes. Lines starting
…ng only the minimal quasi-identifier information for the HarvardX person/course dataset records. These can be selected by amount of participation. The output file has the YoB and LoE fields cleaned up, and the long-tail of the forum data counts can also be collapsed.
…rom the postal code, region, and city to the other geographic quasi-identifiers.
…gram on a csv containing only the quasi-identifiers. The current version works up to the point of when the collapse function is called; there is a note in to Olivia to find out what happens there.

Made a repair to qi_class to make the comparisons correct (type mismatch before). Also changed the last field in the quasi-identifier file to have the right index from the original data file.
…fier pickle files rather than a database. The code now removes the entries for '' from the numerics and creates a separate category for these value, outside of the generalization. These changes also allow us to no longer include de_id_functions.
…KAnon so that it could be run from the command line
…e continent and region to build_num_gen_qi_file.py. Corrected indexes to deal with the addition of two new fields in the qi objects in numeric_generalization_v2.py file. Added the continent and region to the qi object in qi_class.py.
… than just index, without paying the price of a dictionary reader. Added output option to generate a csv line rather than printing an explanatory message. Added a set of idFields that will give reports on different sets of quasi-identifiers.
…o the Archive directory. Added a number of files to git control. Wrote simpleCS50deId.py, which will take a log file and replace the actual user name and user_id with a randomly generated number. This is sufficient for use within Harvard, but does not suffice for general sharing under FERPA.
… and courseSetDeIdentify.py. Created getBinSizes.py that will show the sizes of each of the numeric bins, and graph_utils.py that has a couple of graphing routines for bar charts (we should add more graphing).
…fty algorithms for non-numeric generalization, and Michelle's files for displaying the distributions of the high school use.
…ment of k = 5; seems to have no affect on the file.

Changed the handling of the '' value (that is, no entry), as python3 is more careful about mixing types. Rather than
special casing in all of the code, simply turned the value to -1; this means that it sorts correctly. This also means that
the code in build_bins can be simplified by taking out the code that special cases the '' entries.

Also changed the exit condition in the build_bins loop to be when sum(denom<bin_size) is less or equal to zero. Testing
this shows that it merges the final outlier into the last bin, which is what we want.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments