Skip to content

First names are not gender-specific #30

@aalexandersson

Description

@aalexandersson

Describe the bug
First names are not gender-specific, and therefore often not realistic. This is a problem when combining name and sex in Faker.profile(). For example, typically Barbara is a female name whereas Jonathan is a male name. A StackOverflow posting suggested this code solution for Python Faker:
fake.first_name_male() if gender=="M" else fake.first_name_female()
But I prefer Julia.

To Reproduce
Steps to reproduce the behavior:

julia> using Faker
julia> Faker.profile("name", "sex")
Dict{Any, Any} with 2 entries:
  "name" => "Barbara"
  "sex"  => "M"
julia> Faker.profile("name", "sex")
Dict{Any, Any} with 2 entries:
  "name" => "Jonathan"
  "sex"  => "F"

Expected behavior
I expected this:

julia> Faker.profile("name", "sex")
Dict{Any, Any} with 2 entries:
  "name" => "Barbara"
  "sex"  => "F"
julia> Faker.profile("name", "sex")
Dict{Any, Any} with 2 entries:
  "name" => "Jonathan"
  "sex"  => "M"

Screenshots
Not applicable because not all profiles are problematic.

Environment

  • Repository: Julia 1.7.2
  • OS: Win 10
  • IDE: REPL
  • Project/Manifest: I am a Julia beginner. Is this what you want?
(@v1.7) pkg> st
      Status `C:\Users\aalexandersson\.julia\environments\v1.7\Project.toml`
  [a93c6f00] DataFrames v1.3.2
  [0efc519c] Faker v0.3.2
  [be6f12e9] ODBC v1.0.4
  [08abe8d2] PrettyTables v1.3.1

Additional context
SSA provides national, state-specific, and territory-specific data which perhaps could be used:
https://www.ssa.gov/oact/babynames/limits.html

Personally, I need realistic fake datasets for testing record linkage for my work at the Florida cancer registry. The Faker output is only one record (observation), and not in a file (dataset). Is it easy to add several profile observations saved as a dataset? In my case, I need two datasets, say one dataset with 100,000 records and the other dataset with 1 million records. If I could create and read a dataset with, for example, just three records then it should be trivial to repeat the procedure for varying number of observations and datasets.

Edit 1: The SSA data requires lots of merging. It would be good enough for me to have just one approximated dataset such as "name_gender.csv" from data.world. The dataset has 95,025 rows and the 3 columns "name", "gender" and "probability". According to the dataset, the example names Barbara and Jonathan respectively have probabilities 1 and 0.9957. The dataset can be accessed from here: https://data.world/howarder/gender-by-name

There is a Julia package which also might help: NameToGender.jl

Edit 2: There is also another Julia package which seems to be even more useful here: GenderInference.jl

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions