We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
我这边有这样的数据,有三级标签(层级包含关系),很显然一级标签数量最多,二级或三级标签的数量和一级标签相差好几个数量级,并且每级标签都有热门标签和冷门标签,热门和冷门也相差好几个数量级,针对这种情况我应该如何处理呢?
The text was updated successfully, but these errors were encountered:
@JiaWenqi
首先你要明确自己要做什么样的任务,是要预测全部层级的所有标签还是就单单预测最 general 那层的标签。根据你任务的不同,就涉及到了是做 Extreme Multi-Label (涉及每一层级标签数量很多)还是 Hierarchical Multi-Label (涉及层级联系)问题。
每个层级标签都有热门标签和冷门标签,初步想到的做法有:
拆开每一层来讲,都是一个 flatten 的 Multi-Label 问题,关于类别不均衡问题,更多的做法可以参考 09 年 IEEE 的一篇文章 《Learning from Imbalanced Data》,应该会对你的工作有所帮助。
Sorry, something went wrong.
恩,我目前采取的方式就是,先统计所有数据的标签分布,然后将这些标签按照doc的数量分区间来做,doc数量少的标签区间对应的doc,会全部取;doc数量多的标签区间对应的doc会采样一部分,最后的结果,热门标签的数量在万级别和千级别的数量级,冷门标签数量在百级别数量级(有些冷门标签对应的doc数量总共也就100多)。不知道这样数量比例是否合适?
个人觉得和你总体数据量大小挂钩。 如果数据量大概在十万以内,按照个人经验而言,你这样的比例应该差不多。
No branches or pull requests
我这边有这样的数据,有三级标签(层级包含关系),很显然一级标签数量最多,二级或三级标签的数量和一级标签相差好几个数量级,并且每级标签都有热门标签和冷门标签,热门和冷门也相差好几个数量级,针对这种情况我应该如何处理呢?
The text was updated successfully, but these errors were encountered: