In the cleaned_hm.csv file, I believe the modified column is the opposite of what it should be. You can see this by example with:
> df.loc[[50, 995],:]
original_hm cleaned_hm modified
50 I went shopping I went shopping True
995 I ate chikfila I ate chik-fil-a False
And confirmed it by recreating this column like so:
> (df.modified == (df.cleaned_hm != df.original_hm)).sum()
0
And seems reasonable, since currently modified is True > 99% of the time!
> df.modified.value_counts()
True 98329
False 2206
Name: modified, dtype: int64
Or am I misunderstanding the data?
In the
cleaned_hm.csvfile, I believe themodifiedcolumn is the opposite of what it should be. You can see this by example with:And confirmed it by recreating this column like so:
And seems reasonable, since currently
modifiedis True > 99% of the time!Or am I misunderstanding the data?