Skip to content

Debugging build_distance_objects()#10

Open
paddytobias wants to merge 1 commit intomatthewjdenny:masterfrom
paddytobias:master
Open

Debugging build_distance_objects()#10
paddytobias wants to merge 1 commit intomatthewjdenny:masterfrom
paddytobias:master

Conversation

@paddytobias
Copy link
Copy Markdown

Dealing with cases were there is no similarity for a document in the dfm and therefore NA is returned by quanteda::textstat_simil(). Fixing this by setting all NAs to 0 as a default. This would be a common problem for sparse DFMs

…o similarity for a document in the dfm and therefore NA is returned by quanteda::textstat_simil(). Fixing this by setting all NAs to 0 as a default
@kbenoit
Copy link
Copy Markdown
Contributor

kbenoit commented Mar 19, 2020

A distance of 0 means the objects are equivalent, such as dist(A, A) = 0. For sparse symmetric matrix objects (of which the Matrix package defines several, the margin can be treated in a special way, at least for distance measure for which this basic axiom of distance is true. (In quanteda we removed all those measures for which it was not true.)

A distance of NA means the distance is undefined, for instance dist(docA, docB) where docB is empty (contains zero counts for all features).

There should imo be a difference. Note that proxy and stats::dist() treat these differently. We went with the stats::dist() approach. See See quanteda/quanteda#1540 where we discussed this at some length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants