Join Waitlist
Login
Ghaith Bilbeisi
blog 10 min read

Similarity between files

The similarity matching feature provided by Reasonal, as well as how we approach this problem.

In today’s blogpost, we will briefly discuss the problem of content matching and give an overview of our approach and how we try to keep all your content up to date. We also dive a little deeper into a couple of the techniques that inspired our approach, namely MinHashing and Locality Sensitive Hashing.



Technical solutions for similarity matching

In this blog post, we focus on the following two methods for similarity matching:

Unique Token Similarity

In this branch of matching, we attempt to find the unique tokens in your corpus, be it names, dates, even numbers, or links, and we then create links between documents that contain these words. The set of the rarest unique words in each document can be used as an identifier for that document, hence connecting it with other documents that share the same identity.

Overlapping Content Similarity

In this version of similarity measuring, we look at the content of each file in a broader fashion. Here we are interested in matching larger blocks of content across documents, since having near-identical pieces of content are a good indicator of similarity between documents, and can produce valid links.



Difficulties in finding matching content

It is generally important to perform content matching with the following two requirements in mind:

  • You have to make sure that your similarity measures are accurate enough and that you are maximizing the number of correct matches (True Positives) while minimizing the number of missed matches (False Negatives) so that you’re always notified only when you need to be.
  • Your algorithms must perform this content matching at a near-instant pace, so that when you edit a document, you are immediately notified of other documents you may want to update, if necessary.

Trying to find matching content across an entire corpus of documents can be a tricky task, and doing so in an efficient and timely manner even more so. In the most trivial approach, you would have to compare the entire content of each pair of documents within your corpus:

while (doc_A:=corpus.pop()):
	for doc_B in corpus:
        # perform one of the content comparison methods here
		compare_content(doc_A, doc_B) 

	if not corpus: 
        break

The brute-force approach explodes with the number of documents in your corpus; If you have a corpus of size $N$ and an average document length of $M$ you will have $N(N-1)/2$ pairs of documents with $M^2$ comparisons each (depending on the similarity measure used). Instead, you must turn to probabilistic methods to get the number of comparisons down to a manageable value, this is where MinHashing and LSH come in.

MinHashing and Locality Sensitive Hashing (LSH)

Through MinHashing, you can turn documents into relatively small sets of numbers that allow you to perform comparison operations at a much faster rate. First, you can represent your document as a set of “shingles” -- each containing a set number of words in consecutive order; for example, if you have a document with the following content:

”What a wonderful document with ground breaking information.”

Your set of shingles with size 5 would be:

  • “What a wonderful document with”
  • “a wonderful document with ground”
  • “wonderful document with ground breaking”
  • “document with ground breaking information”
doc_A_shingles = [
	"What a wonderful document with",
	"a wonderful document with ground",
	"wonderful document with ground breaking",
	"document with ground breaking information",
]

These shingles allow you to have a more comprehensive representation of each document than simply taking the set of words in each document. Now if you were to try to find the similarity measure of this document with any other document, the basic approach would be to take the sets of shingles for each document, and find the Jaccard similarity over these sets:

set_a = set(doc_A_shingles)
set_b = set(doc_B_shingles)
Jaccard  = len(set_a.intersection(set_b)) / len(set_a.union(set_b))

This, however, is a very inefficient approach that would require you to perform many set comparisons; on sets of strings and over a large corpus, this is not feasible at all. It would be great if you could somehow cut down the number of shingles in each document to a set number, say 128, while simultaneously transforming your shingles into a numerical representation for faster comparison operations.

This can all be done with the aid of MinHashing. In MinHashing, a hash function is applied to each shingle, turning it into an integer so that in the case of a 500-word document with a shingle size of 5 you will have $496 = (500 - 4)$ (for the incomplete shingles at the beginning and end) integers to represent the document instead of $496$ 5-word strings. Then, a single value out of those should be selected -- this is where the (min) comes in, you simply take the minimum of all the integers.

import hashlib
import struct

def sha_hash(shingle):
	# generate the hash using SHA then unpack the binary values using struct.unpack()
	# “<I” indicates the format of the content to be unpacked: unsigned integer (4 bytes)
	return struct.unpack("<I", hashlib.sha1(shingle.encode("utf-8")).digest()[:4])[0]

doc_A_hashes = map(sha_hash, doc_A_shingles)

minimum_hash = min(doc_A_hashes)

Now you have a value for our first hash, but that is hardly enough to represent an entire document, clearly a lot of information is lost here. If you want a document representation of size 128, the solution is to do this for another 127(!) hash functions.

In reality however, you do not need to use 127 different hashing functions with different hashing algorithms. Instead, you can use a proper hashing algorithm for the first hash, then for each of the other hash values you can simply XOR the first hash value by 127 different pseudo-random bits of the same length. If your proper hash function produces a 32-bit value, you will only need to generate 127 32-bit values; one for each remaining “hashing function”. Once you have your list of XOR bits, all you have to do is repeat the same steps you did for the first hash so that you end up with 128 integers to represent your document.

import random

def min_hash(content, number_of_hashes):
	# generate the list of 32-bit values
	random.seed(1)
	xor_bits_list = [random.getrandbits(32) for _ in range(number_of_hashes - 1)]

	# generate the hash using SHA then generate the remaining hashes by XOR-ing with a list of random 32-bit values
	hashes = [sha_hash(shingle) for shingle in content]

	# add the smallest sha-hash as the first integer in the minhash
	minhash = [min(hashes)]
	
	# iterate the XOR bits and find the minimum for each across the shingles
	for xor_bits in xor_bits_list:
	  minhash.append(min([hashes[i] ^ xor_bits for i in range(len(content))]))

	return minhash

doc_A_minhash = min_hash(doc_A_shingles, 128)

Note: A good hash function will not have an obvious pattern to the values it produces. For example, 50% of strings starting with the letter “a” will have a higher hash value than a string starting with the letter “b” — the hash value is not related to alphabetic ordering. So, selecting the shingle with minimum hash value is more or less random. This is discussed in more depth in this blog post.

By changing the way you represent our documents, you have already improved the efficiency at which you can perform comparisons across your corpus by a considerable margin. Nevertheless, you still need to go over each document pair and perform comparisons which is too costly. What you are looking for now is a way to limit the number of documents you compare, and find a way to only compare documents with a high likelihood of similarity.

To this end, we turn to Locality Sensitive Hashing (LSH). LSH provides you with the ability to reduce the number of compared document pairs significantly while providing a high degree of confidence that the comparisons you are performing are the correct ones. An LSH index is divided into a set of bands and rows, the product of which must be equal to the number of hashes used in the MinHashing. Each document’s MinHash is then distributed over the LSH bands and rows, with each row holding one hash value. Any two documents will be presented as a possible match if they are equal in a certain number of buckets (this number is determined by the LSH threshold). Here’s an illustrated example of an LSH index with 4 bands and 2 rows, for MinHashes of size 8:

File Similarity

Here you can see document A and B with their MinHashes of size 8 (single digit for each hash for simplicity), and how they are distributed across the LSH bands and rows. You can also see that the two documents are matching for band 2 and 4, meaning that this LSH could present them as a valid candidate for similarity evaluation. With LSH you are able to significantly reduce the number of documents you compare, and maintain a high level of confidence in the validity of the selected pairs for comparison; for a much more comprehensive discussion of how LSH provides candidates with high confidence, please refer to section 3.4 in Chapter 3 of Mining of Massive Datasets book.

Through Minhashing and LSH you can guarantee that your matching algorithm can be very fast while providing extremely reliable results that you can turn into instant insights in your application.



Summary

We hope that you find some of our insights and how-tos useful, and that these articles help you better understand our services and product. If you are interested in a deeper more technical explanation of MinHashing and LSH, we strongly encourage you to read through Chapter 3 of “Mining of Massive Datasets” by J. Leskovec, A. Rajaraman, and, J. Ullman (2014). doi:10.1017/CBO9781139924801.004

📬 If you found this blog interesting, subscribe to our newsletter, where you'll receive some exclusive content on productivity, AI in work tools, and news about Reasonal.

Author

Ghaith Bilbeisi

SHARE

More from Reasonal

Subscribe to receive early news and updates

Receive a monthly digest of our posts on machine-learning how-to's and productivity hacks, some exclusive newsletter content, event invites, and early-access news about Reasonal.