-
-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Describe the bug
First names are not gender-specific, and therefore often not realistic. This is a problem when combining name and sex in Faker.profile(). For example, typically Barbara is a female name whereas Jonathan is a male name. A StackOverflow posting suggested this code solution for Python Faker:
fake.first_name_male() if gender=="M" else fake.first_name_female()
But I prefer Julia.
To Reproduce
Steps to reproduce the behavior:
julia> using Faker
julia> Faker.profile("name", "sex")
Dict{Any, Any} with 2 entries:
"name" => "Barbara"
"sex" => "M"
julia> Faker.profile("name", "sex")
Dict{Any, Any} with 2 entries:
"name" => "Jonathan"
"sex" => "F"
Expected behavior
I expected this:
julia> Faker.profile("name", "sex")
Dict{Any, Any} with 2 entries:
"name" => "Barbara"
"sex" => "F"
julia> Faker.profile("name", "sex")
Dict{Any, Any} with 2 entries:
"name" => "Jonathan"
"sex" => "M"
Screenshots
Not applicable because not all profiles are problematic.
Environment
- Repository: Julia 1.7.2
- OS: Win 10
- IDE: REPL
- Project/Manifest: I am a Julia beginner. Is this what you want?
(@v1.7) pkg> st
Status `C:\Users\aalexandersson\.julia\environments\v1.7\Project.toml`
[a93c6f00] DataFrames v1.3.2
[0efc519c] Faker v0.3.2
[be6f12e9] ODBC v1.0.4
[08abe8d2] PrettyTables v1.3.1
Additional context
SSA provides national, state-specific, and territory-specific data which perhaps could be used:
https://www.ssa.gov/oact/babynames/limits.html
Personally, I need realistic fake datasets for testing record linkage for my work at the Florida cancer registry. The Faker output is only one record (observation), and not in a file (dataset). Is it easy to add several profile observations saved as a dataset? In my case, I need two datasets, say one dataset with 100,000 records and the other dataset with 1 million records. If I could create and read a dataset with, for example, just three records then it should be trivial to repeat the procedure for varying number of observations and datasets.
Edit 1: The SSA data requires lots of merging. It would be good enough for me to have just one approximated dataset such as "name_gender.csv" from data.world. The dataset has 95,025 rows and the 3 columns "name", "gender" and "probability". According to the dataset, the example names Barbara and Jonathan respectively have probabilities 1 and 0.9957. The dataset can be accessed from here: https://data.world/howarder/gender-by-name
There is a Julia package which also might help: NameToGender.jl
Edit 2: There is also another Julia package which seems to be even more useful here: GenderInference.jl