## CodeFights Solves It: findSubstrings

The Interview Practice Task of the Week was findSubstrings. This is an Uber interview question, and their technical interviews are notoriously lengthy and difficult. And the solution implements tries, a data structure that you’ll see a lot in interview questions. It’s a double whammy!

You know the drill by now: If you haven’t solved this challenge yet, hop on over to CodeFights and do it! Then come back here and we’ll walk through solving it together. We’ll start with a naive implementation and then work on optimizing it.

…Done? Okay!

## Technical interview question

The goal of this Uber interview question is to find the longest substring that appears in a given list `parts`

in a set of `words`

. You’ll return the list of words, with the substrings enclosed in square brackets. If there is a tie for the longest substring, you’ll mark the one that appears earliest in the input string.

On a word-by-word basis, if our list is `parts = ["An","a","men", "ing", "ie","mel","me"]`

then we would indicate substrings in the following way:

- “word” becomes “word” (there is no substring of “word” in
`parts`

) - “anticipating” becomes “anticipat[ing]” (the substrings “a” and “ing” both appear, but “ing” is longer)
- “interest” becomes “interest”
- “metro” becomes “[me]tro”
- “melrose” becomes “[mel]rose”
- “melting” becomes “[mel]ting” (“mel” and “ing” are the same length, but “mel” appears first in the word)
- “ingoltsmelt” becomes “[ing]oltsmelt”

Our function, `findSubstrings`

, should take a list of strings `words`

and another list of strings `parts`

, and return the list of words with the longest substring indicated. For example, with `words = ["word", "anticipating", "ingolt", "melting"]`

, and `parts`

defined as before we would get:

1 2 |
['word', 'anticipat[ing]', '[ing]olt', '[mel]ting'] |

## Naive solution

Let’s try approaching this problem directly. This would mean going through our list of words, one at a time, and then checking each substring in `parts`

to see if it appears in our word. We will keep track of the longest part that occurs earliest in the string.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
def findSubstrings(words, parts): # We will keep all the final strings here output = [] for w in words: # for each word, reset the longest fragment found, and what the match was longest = 0 match = "" position = len(w) # look at each part for p in parts: # is the part in there? if p in w: # make sure this part is strictly longer than the part already found, # OR if it is a tie, make sure that it appears earlier in the word w if len(p) > longest or (len(p) == longest and w.index(p) < position): longest = len(p) match = p position = w.index(p) # if we found something, put in the square brackets around the match, and # copy the word into our output list. If there is no match, copy the # original word to the output list if longest > 0: loc = w.index(match) output.append( w[:loc] + "[" + match + "]" + w[loc+len(match):]) else: output.append(w) return output |

We can try running this function on our example, and it works. So we have a running implementation.

But is it good enough to pass the CodeFights time tests, or will it take too long to execute? We submit it and… it **passes**!

(I’m going to add an important caveat here and say that it passes *at the time when I wrote this*. Our engineers add new tests to challenges fairly frequently, based on user feedback and challenge results, so there’s a chance that when you read this my naive solution might not pass anymore. But the explanation below of its run time is still really useful, so don’t just skip to the optimized solution!)

### What’s the run time?

We should take a moment to think about the running time of this code. If we call *W* the number of words, and *P* the number of parts, then we can tell that the running time is at least *O(WP)* from our nested loop. With a high-level language like Python, we also want to look for hidden loops in function calls. Inside the `p`

loop, we call `if p in w`

and `w.index(p)`

multiple times. These functions are scanning the word `w`

to look for `p`

, and will have a running time of *O(len(W))*. We could speed up the function a little bit by calling `w.index(p)`

once and saving the result, but we still have a running time that looks like *O(WPN)*, where *N* is the length of a typical word in `w`

.

So how bad is *O(WPN)*? From the constraints page:

* We have at most 5000 words, so in the worst-case scenario `W = 5000`

;

* Each word is at most 30 characters, so the worst case is `N = 30`

;

* We have at most 5000 substrings in `parts`

, so the worst-case is `P = 5000`

.

Putting this together, this gives us **750 million loops**. This means each loop would have to run in about 5 picoseconds to be done in 4 seconds! Put another way, we would have to have PetaHz processors for the worst case scenario.

Can we argue that we solved the challenge anyway, and that the CodeFights test showed us that the cases we will be seeing aren’t the worst possible cases? Yes, you can absolutely argue that! When companies are hiring, they want people who can write code quickly to solve *today’s* problems. A working solution today is better than a perfect solution in a month!

But you should make sure you demonstrate to the interviewer that you know how to think about algorithmic complexity. Microsoft famously had an **exponential** algorithm for determining which software patches were needed for Windows XP, but it was working fast enough for 10 years(!), and no one realized that the algorithm was exponential.

So you can tell the interviewer that you know this is an algorithm with three multiplicative factors (**length of a typical word**, **number of words**, and **number of substrings**). You should explain that you pass the tests, but that your algorithm won’t handle the worst case in reasonable time. Go on to explain that this isn’t an exponential algorithm, so getting more powerful computers or waiting a little longer isn’t unreasonable.

An interviewer may ask you to write a faster algorithm anyway, so if you **really** want to impress her you can preemptively mention that you think there are better algorithms that you would use if there was a good business case for spending the extra time on this. Remember that premature optimization is the root of all evil. (Maybe that’s a little over-dramatic… but it’s still bad!)

## Giving it our best Trie

Instead of the naive implementation above, we will use a special type of tree data structure called a **trie** to store the substrings in `parts`

. Like all trees in programming, a trie has a root node. Every other node keeps track of a letter, and getting to this point in the trie completes a substring from `parts`

. Taking our example of `parts = ["An","a","men", "ing", "ie","mel","me"]`

from earlier, we can think about the trie pictorially as:

Note there are 7 red letters in the picture of the trie, which correspond to the 7 words we have in `parts`

. Every leaf is a red letter, but the `e`

below the `m`

is also red to indicate that `me`

is in `parts`

.

To check if a string `s`

is in `parts`

, we start at the root and move to the first level in the trie that matches the first letter of `s`

. If we are trying to match `me`

, for example, we would start by moving from the root to `m`

, and then from `m`

to `e`

. Since we end up on a red node, we know that our word is on the list.

To see if `in`

is a member of `parts`

, we would start at the root and follow the path to `i`

, then to `n`

. Since `n`

is not red, we haven’t completed a string from `parts`

, so we would conclude that `in`

is **not** a member of `parts`

.

If we want to see whether `dog`

was in `parts`

, we would see there is no path from the root to a `d`

, so we could report right away that `dog`

isn’t in `parts`

. We will be scanning a word character-by-character, and instead of having to search through all of `parts`

, we only have to search through the subset of `parts`

that matches what we have already seen.

You have seen this data structure in action before – in fact, every time you see auto-complete working every time you use a browser or text!

### Implementing it in code

Now we have to translate our picture into code. We will make a class `TrieNode`

to store all the information in a node:

1 2 3 4 5 6 7 8 9 |
class TrieNode: def __init__(self, letter): self.letter = letter self.terminal = False # here we store the children nodes of this node # the letter of the children are the key, and the # TrieNode object is the value self.children = {} |

To create a trie with the words “cat”, “cam”, and “cha” and “chat”, we could write the following (tedious) code:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
# This is terrible coding style. # Definitely don't do this in your interview! root = TrieNode("") # level 1 nodes c = TrieNode("c") root.children['c'] = c # level 2 nodes a1 = TrieNode("a") c.children['a'] = a1 h = TrieNode("h") c.children['h'] = c # level 3 nodes t1 = TrieNode("t") # m = TrieNode("m") # these words are terminal! a2 = TrieNode("a") # t1.terminal = True m.terminal = True a2.terminal = True a1.children['t'] = t1 m.children['m'] = m h.children['a'] = a2 # level 4 node t2 = TrieNode("t") # also a terminal node t2.terminal = True a2.children['t'] = t2 |

or, pictorially, we have

Let’s make a function that adds a word to the trie so that we don’t have to do it manually. Our method will be to start at the root, reusing existing nodes along our path (or making new nodes as needed). The code for this is:

1 2 3 4 5 6 7 8 9 10 11 12 |
def addFragmentToTrie(root, fragment): currentNode = root for letter in fragment: if letter not in currentNode.children: # create a new node currentNode.children[letter] = TrieNode(letter) currentNode = currentNode.children[letter] # we are at the final node. # Mark it as terminal currentNode.terminal = True |

Using this function, we can condense the code to create our trie to:

1 2 3 4 5 |
# This is much better, and produces the same trie root = TrieNode("") for p in ["cat","cam","cha","chat"]: addFragmentToTrie(root, p) |

## Using the trie

I am going to move most of the hard work – that is, finding the longest substring and enclosing it in square brackets – to a function `findSubstringInWord(word, root)`

. The function `findSubstrings(words, parts)`

will be responsible for making the trie and collecting all the words together at the end.

Here is our strategy for `findSubstringInWord`

:

- Initialize our
`lenLongSubstr`

to`0`

. - Go through
`word`

one character at a time, using that character as a starting point for matching substrings in`parts`

. - For each character, use the trie to find the longest substring from
`parts`

starting from this character. If it is longer than the longest one we’ve seen so far, record the current position in the string and the current length. - Use the length of the
`longestSeenSubstring`

and its starting position to put the square brackets in the right place.

Note how going through the string in order, and only updating the position when we see a *strictly* longer string, automatically encodes our tie-breaking condition (yay!). The trickiest part is finding the the longest substring from `parts`

starting at a given location, because we need to know the last **terminal** node we passed though before failing to find a match on the trie. It would be a lot easier if we just had to keep track of the last node we were on before we failed to match.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
def findSubstringInWord(w, root): lenLongSubstr, longestPos = 0,0 for start_pos in range(len(w)): # reset to the beginning of the trie currNode = root for position in range(start_pos, len(w)): letter = w[position] if letter not in currNode.children: # we have run out of branches in trie, # so no use looking further down the string # from this starting position. Go back to # the previous loop break currNode = currNode.children[letter] length = position - start_pos + 1 if currNode.terminal and length > lenLongSubstr: lenLongSubstr = length longestPos = start_pos # now we have found where the longest segment starts (longestPos) # and how long it is (longestSeenSubstring). We now have to place # the square brackets if lenLongSubstr == 0: return w end = longestPos + lenLongSubstr return w[:longestPos] + "[" + w[longestPos: end] + "]" + w[end:] |

The function `findSubstrings`

is simple by comparison! Note that it uses our earlier method of constructing the trie directly from `parts`

.

1 2 3 4 5 6 7 8 |
# Not much to this function - it just ties everything else together def findSubstrings(words, parts): # build the trie root = TrieNode('') for p in parts: addFragmentToTrie(root, p) return [findSubstringInWord(w, root) for w in words] |

## Running time for the trie

Our algorithm still looped through each word. The constraint that we haven’t used yet is the length of the each element in `parts`

. The worst case scenario is having to check 5 levels at each position at each word. In the worst case, we have:

- The number of words,
`W = 5000`

; - The number of letters in a word,
`N = 30`

; - The number of levels to check in the trie,
`P = 5`

.

This is smaller than the naive implementation by a factor of 1000.

#### The final code!

Our code was spread over a few functions. Collecting it all together, we have:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
class TrieNode: def __init__(self, letter): self.letter = letter self.terminal = False # here we store the children nodes of this node # the letter of the children are the key, and the # TrieNode object is the value self.children = {} def addFragmentToTrie(root, fragment): currentNode = root for letter in fragment: if letter not in currentNode.children: # create a new node currentNode.children[letter] = TrieNode(letter) currentNode = currentNode.children[letter] # we are at the final node. # Mark it as terminal currentNode.terminal = True def findSubstringInWord(w, root): lenLongSubstr, longestPos = 0,0 for start_pos in range(len(w)): # reset to the beginning of the trie currNode = root for position in range(start_pos, len(w)): letter = w[position] if letter not in currNode.children: # we have run out of branches in trie, # so no use looking further down the string # from this starting position. Go back to # previous loop break currNode = currNode.children[letter] length = position - start_pos + 1 if currNode.terminal and length > lenLongSubstr: lenLongSubstr = length longestPos = start_pos # now we have found where the longest segment starts (longestPos) # and how long it is (longestSeenSubstring). # We now have to place the square brackets if lenLongSubstr == 0: return w end = longestPos + lenLongSubstr return w[:longestPos] + "[" + w[longestPos: end] + "]" + w[end:] # Not much to this function # It just ties everything else together def findSubstrings(words, parts): # build the trie root = TrieNode('') for p in parts: addFragmentToTrie(root, p) return [findSubstringInWord(w, root) for w in words] |

We did it! High fives for everyone!

## Tell us…

How did you solve this Uber interview question? Have you ever seen one like it in a coding interview? Let us know on the CodeFights user forum!