05月31, 2021

minhash/LSH计算局部哈希给文档去重

minhash/LSH的思想是利用词和文档的矩阵,为每个文档计算固定长度的哈希值,等价于计算文档降维后的嵌入向量。

snapy 提供了PYTHON的实现

第一步: 计算文档的相似度并筛选

from snapy import MinHash, LSH

seed = 3

# Create MinHash object.
minhash = MinHash(content, n_gram_type='char', n_gram=2, permutations=100, hash_bits=64, seed=seed)


# Create LSH model.
lsh = LSH(minhash, labels, no_of_bands=50)


# Query to find near duplicates for text 1.
print(lsh.query(labels[0], min_jaccard=0.5))



# Returns edge list for use creating a weighted graph.
edge_list = lsh.edge_list(min_jaccard=0.6, jaccard_weighted=True)
print(len(edge_list))

第二步:计算不重复的文档

from collections import defaultdict
block_list=defaultdict(set)
seen_ids=dict()
for e in edge_list:
    if e[0] not in seen_ids and e[1] not in seen_ids:
        block_key=e[0]
    else:
        block_key= seen_ids.get(e[0]) or seen_ids.get(e[1])
    block_list[block_key].add(e[1])
    block_list[block_key].add(e[0])
    seen_ids[e[1]]=block_key
    seen_ids[e[0]]=block_key

本文链接:http://57km.cc/post/minhash/LSH-ji-suan-ju-bu-ha-xi-gei-wen-dang-qu-zhong.html

-- EOF --

Comments