Myers Diff for Tab Management: Part 2

2024-09-16

In Part 1 of this article, I discussed why I was using Myers diffing algorithm for an project I'm working on. The project is to create a chrome extension tab manager based on Git version control and UNIX commands. Now, I will be breaking down the algorithm to make it easy to understand.

The original paper was written by Eugene W. Myers. It can be found here

Introduction

Before we get into the algorithm, let me re-introduce the problem quickly. Skip ahead if you've read part 1.

Imagine I have two strings of text A and B. We can pick random strings to assign these variables for the sake of illustration:

A = "CAT" B = "BAT"

The "shortest edit script" is the minimum number of insertions and deletions you would need to derive B from A.

The shortest edit script to turn A into B would look like this:

DELETE C
INSERT B
KEEP A
KEEP T

This problem appears, for example in UNIX's diff utility. In order to compare two files and see their differences, we must first calculate the shortest edit script to ensure that all the lines match.

In my version controlled tab manager, I use it to store the minimum number of changes needed to arrive from one "state" to another. A state in this context, simply consists of an array of tab URLs and metadata from the current window. Each state represents a version of a repository (workspace).

Instead of storing each line, I do not store the KEEPs, but rather store the indices of each addition and deletion. Then, I can apply the additions and deletions on one state, to get the other.

This theoretically requires less memory than storing all the information from each state. It is especially helpful because you can store an original state, as well as edit scripts for each subsequent state. The very last state can be derived by applying these edit scripts one after the other, while only needing to store one original state, and the necessary edits. Rather than repeatedly storing the tabs from previous states.

The LCS (Longest Common Subsequence) is the longest sequence of characters that appear in two strings in the same order, but not necessarily contiguously. In the above example, the LCS of A and B is "AT". If we were to compare A to "COAT", the longest subsequence would be "CAT".

The shortest edit script is the one that has the longest common subsequence, as it would require the least amount of edits to derive the second string.

A Graph Problem

Myers diff solves a graph problem in order to get the shortest edit script.

Imagine a grid: Grid

This grid will be used to calculate the shortest edit script for the inputs A = "CAT" and B = "COAL".

We can represent an insertion of the next character from B to original string A using a downwards move. This is because the y axis represents the characters in B. By making a path down the y axis, we are incrementing the index in B to represent that the character has been consumed for an insertion.

We can represent a deletion of the next character from A as a right-move on the grid. The x axis represents the characters in A. Making a move to the right signifies that we have skips a character from A, and will do so because it is not present in B.

A keep can be represented as a diagonal move in the grid (consuming from both strings). The light blue dotted lines in the image represents possible diagonals. Anytime there is a diagonal, we assume that its cost is 0, since it requires no changes and is part of the LCS. The character from B is already present in A (it matches), therefore, we don't need to make any changes.

We can start at the coordinate pair (x,y) = (0,0). Our goal is to find the shortest path to the point at the bottom right, (x,y) = (3,4). This is because (3, 4) consumes all characters from both strings.

Now that we've defined a start point, an end point, as well as the significance of each traversal direction, we can define a method of finding the shortest path from (0,0) to (3,4).

We can find this shortest path using a BFS (Breadth first search) from (0,0) to (3, 4). In this BFS, we can represent the current iteration we are at with the variable D. If there is a match between two characters, we should traverse this path before exploring the other unvisited neighbors. That way, diagonals are explored before we explore the other unvisited neighbors. We must traverse diagonals to their end before incrementing D because diagonals represent matches, and therefore require no edits.

Step 1: Step 1

Step 2: Step 2

Step 3: Step 3

Step 4: Step 4

By now, we have reached the bottom right of the graph. A shortest path obtained through the BFS traversal should be (from (0, 0)):

diagonal
down
diagonal
down
right

This can be translated to: (the cursor indicates the change affected by each step)

D=1:
KEEP "C" (both sequences) -> C AT
                             ^
D=1:
INSERT "O" (from B) -> CO AT
                        ^
D=2:
KEEP "A" (both sequences) -> COA T
                               ^
D=2:
INSERT "L" (from B) -> COAL T
                          ^
D=3:
DELETE "T" (from A) -> COAL
                           ^

This script can transform "CAT" into "COAL". It inserts "O" after "C". It also replaces "T" with "L" by using a deletion followed by an insertion. This results in D=3 edits.

The standard Myers diff algorithm runs in O((N+M)D) time complexity, where N and M are the lengths of the sequences, and D is the length of the shortest edit script.

However, in the worst case it is O((N+M)^2) time complexity when the sequences are completely different. This worst case is unlikely in typical use cases because we usually compare similar states with minimal changes.

The space complexity is obviously O(MN) because we must store the graph. Although, in the paper, Myers writes about a modified version which uses linear space.

In my tab manager, instead of comparing characters in strings, I compare tab objects by their URLs and metadata. The same algorithm applies - we're just finding the shortest sequence of tab additions and removals to transform one workspace state into another.

Please tip here