Skip to content

The wcc algorithm output contains a large amount of duplicate data #761

@jyswpp

Description

@jyswpp

I found that the output file from the WCC algorithm contains duplicate data, and I suspect that intermediate results of the algorithm were also exported.

the wcc sql is:

CREATE GRAPH cc_graph_test (
Vertex nodes (
id bigint ID
),
Edge edges (
srcId bigint SOURCE ID,
targetId bigint DESTINATION ID
)
) WITH (
storeType='memory',
shardCount = 1
);

INSERT INTO cc_graph_test.nodes(id) VALUES
(1),
(2),
(3),
(4),
(5),
(6);

INSERT INTO cc_graph_test.edges VALUES
(1, 2),
(2, 3),
(4, 5),
(5, 6)
;

CREATE TABLE IF NOT EXISTS cc_geaflow_test (
v_id int,
k_value VARCHAR
) WITH (
type='file',
geaflow.dsl.table.parallelism= 64,
geaflow.dsl.source.parallelism = 64,
geaflow.file.persistent.config.json = '{''}',
geaflow.dsl.file.path = '
',
geaflow.dsl.column.separator='\s'
);

USE GRAPH cc_graph_test;
insert into cc_geaflow_test(v_id, k_value)
CALL wcc() YIELD (vid, component)
RETURN vid, component;

output is :
1s1
1s1
2s1
1s1
2s1
3s1
1s1
2s1
3s1
4s4
1s1
2s1
3s1
4s4
5s4
1s1
2s1
3s1
4s4
5s4
6s4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions