Category: CodeFights Explainer

In our CodeFights Explainer series, we break down complex computer science topics so that they’re easy to understand, complete with in-depth examples.

CodeFights Explainer: Trees

CodeFights Explainer: Trees

Trees. They’re considered to be a computer science basic, but if it’s been a while since you’ve dealt with them – or if you’ve never encountered them before at all – they can be daunting. But tree questions are extremely common in technical interviews, so they’ll come up while you’re finding a job. And then on the job, they’re a wildly useful data structure. So let’s get you up to speed on trees!

A tree is a data structure that is particularly useful for representing data that’s sorted or in hierarchical order, and where we want to be able to retrieve data quickly. I’m going to dive into the idea of trees by introducing one of the most common use cases for them: the binary search tree.

Example: An address book app

Suppose we’re designing an address book application. You type in the name of the person you’re looking for, and if they are in your address book the program retrieves some information about them, like their address, email, and telephone number. If they aren’t in your contact book, you’ll get a “User not Found” message. For the moment, let’s just look at the list of names of people in our sorted address book:

Suppose you want to know if David is in your contact list. You could scan through the names, one by one, and then keep track of whether or not you have seen David yet. If you haven’t seen it by the time you reach the end of the list, then you can confidently report he isn’t one of your contacts.

Better yet, since you started with a sorted list, you can stop as soon as you see any name that would appear after David in an alphabetically ordered list. In this particular example, once you reach the 4th element in the list, Flaka, you know that David isn’t on this list.

Even with the improvement of the “stop once you reach an element bigger than the one you are looking for” approach, this algorithm is still O(N), where N is the number of items in our list. If you were checking to see if Zoe was on your list, then you would have to check Every. Single. User. Not cool.

A better approach would be to start in the middle, with Flaka. If you are looking for a name that appears earlier in the alphabet than Flaka (like David) then you know it is in the left half of the list (or it is not there at all). If you are looking for Zoe, you know that it appears on the right half of the list. For either name, you only have to search one half of the list. Let’s focus on the steps you would use to search for David:

  1. Start halfway through the list. The list of contacts has 7 items, so start at the 4th item, Flaka.
  2. Since David appears earlier in the alphabet than Flaka (i.e. as strings, David < Flaka), then you know that if David is in the list at all he is first, second, or third. Pick the middle element again, in this case Amy.
  3. Since Amy < David, if David is in this list he has to be the third element. But the third element is Colin, so David isn’t in this list.
list vs binary search
Visualizing the search: Look at 3 nodes, two jumps.

For David, we see that it takes 3 searches to find that he is not on the list, rather than 4. For Zoe, the difference is more impressive, because we can also exclude her name in 3 searches rather than 7! Here’s what searching for Zoe in your contact list would look like:

list vs binary search, part 2

At each step, we are halving the number of names you have to search through, so the number of searches you have to do to determine if a name is on the list is O(log N), which is a vast improvement over our original O(N)!

Using a tree

Instead of thinking about the list of contacts as a flat list, let’s try to visualize it the way that our sorting algorithm goes through it:

names in a tree
Visualizing the contacts as a tree.

Regardless of what name n we search for, we will always start at Flaka. If n is Flaka, then we’re done! If we are looking for a name that appears earlier in the alphabet, we will look at indices less than Flaka (to the left). Otherwise, we’ll move to the right. We keep moving down the diagram, looking for n, using the convention that if n is smaller than the current word we move to the left, and if it is larger we move to the right. If we reach the names at the bottom of the diagram and still haven’t found the name n, we can conclude that it isn’t in our diagram.

A tree is a way of implementing this data structure directly, rather than imagining this data structure on a simple list.

Each box in the diagram is called a node. Nodes keep track of the following information:

Properties of a node

Property name Type Description
children Pointers to other nodes We show all children of a node are connected to this node via arrows. The node Flaka in our example has two children: Amy and Madison.
value Data This is the data that is actually stored in the node. We are just storing the names at the moment, but a more useful address book would contain a reference to a person object so that we could get the email address, phone number, etc.

Continuing the family-tree analogy, we also have the notion of a parent node. Node X is a parent of node Y if node Y is a child of node X. We’ve already stated that the Flaka node had two children: Amy and Madison. This means that the parent node of Amy is Flaka.

There are also special types of nodes in a tree:

The root node is the only node on the tree that doesn’t have a parent, so in this tree the root is Flaka.

Then there are leaf nodes, which are the nodes that don’t have any children. In this case, the leaf nodes are Amanda, Colin, Gustav and Naomi.

Each node also a level, which is how far this node is from the root node. So Flaka (the root) is at level 1, Amy and Madison are at level 2, and the leaf nodes are all at level 3 in our example.

For a collection of nodes to be a non-empty tree (as used in computer programming), they need to satisfy 4 conditions:

  1. There needs to be exactly one root node (i.e. one node with no parent).
  2. Each node that is not the root node has only one parent node (i.e. no node is a child of two or more nodes)
  3. You need to be able to reach each node by starting at the root and following a sequence of “child” pointers.
  4. A node at level x cannot have a child at a level less than x.
    (In fact, all children of a node at level x have a level of x+1, because children of a node at level x are one extra step from the root.)

When we draw the nodes, we typically draw level 1 (the root) at the top of the page, and increase levels as we move down to the leaves. For some reason, computer scientists have decided to draw all of their trees “upside down”, with the root at top and the leaves at the bottom! Also note that we are jumping between two metaphors when labeling things for trees. We use a lot of terms inspired by family trees such as “parent” and “child”, as well as a lot of terms inspired by botanical trees such as “root” and “leaves”.

To get an idea of what some of the restrictions mean, let’s take a moment to look at two versions of our address book that are not trees. The problem with the data structure on the top is that Colin is a child of Amy and Madison, which breaks rule 2. The problem with the graph on the bottom is that Madison (level 2) is a child of Amanda (level 3).

not a tree
not a tree

These rules guarantee that we cannot have cycles while moving along the arrows, because either we will have no children nodes to visit (i.e. we are at a leaf), or we are guaranteed to be going to a higher level. Since we are moving along children links, all the nodes we have seen already are at a lower level than the current level.

All of this is pretty cool, but it seems like we’ve gone to a lot of work and just reproduced the O(log N) search of my contacts that I could have done with a flat (ordered) list! But one of the big advantages that the tree has over a flat list is that it is more flexible. Continuing on with the address book example, suppose I contact Amanda and Gustav a lot more than I call any of the other people. That means that every time I open my address book, my phone is usually doing three searches in the tree. We can rearrange the tree a little to get the following:

unbalanced tree
A tree with the nodes rearranged to make the average search quicker

In the longest searches, my app needs to do four steps now instead of three. But the most frequently searched-for names are now close to the top with the goal of reducing the average search time. Of course, we can also design a (bad) tree that is just a linked list:

a tree pretending to be a linked list
A tree that’s acting like a linked list.

The process of finding the best organization of the tree for the expected use of the data is called balancing the tree.

Other uses of trees

We have used trees to order data by value (i.e. sort it) as a way of replacing a list. A different way of using trees is to manage objects that are clustered or grouped by some attribute. Consider the organization of CodeFights Cafe. We have an owner, a head of accounting (Alice), and a store manager (Katherine). We also have some employees that belong either to the accounting department, or are managed by Katherine. We might structure a tree for CodeFights Cafe in the following way:

tree organizational structure
The organizational structure of CodeFights Cafe.

This tree arranges the nodes by responsibility, which means that any person in this chart manages their descendants. So Katherine can tell her assistant manager Fred what to do, as well give instructions to Claire and Jim. But she can’t give instructions to Bob, for example, because even though he is at a lower level he’s part of a different department.

It’s clear how this structure is a useful way for us to visualize how the Cafe is organized. It also makes certain types of operations easy to implement recursively. Suppose we wanted to send a message to the entire accounting department. We might have a function tell( node, msg) that will use the information in node to get the contact information for the person and email them the message msg. Suppose each node n has its children stored in a list n.children. We could send a message to n and all of its descendants (those nodes that n can reach that are at a lower level than it) with the following function:

If we want to message the entire company “Happy New Year!”, we can simply call tellAllDescendants( owner_node, "Happy New Year"). If we want to tell the service side that we need to bake more cookies, we can call tellAllDescendants( katherine_node, "Pls bake more cookies") and the message will be relayed to Katherine, Fred, Claire, and Jim.

This same organizational structure can answer other questions for us too, such as quickly finding out what the labor costs are in each department. Suppose we wanted to know what the total cost of the salaries in the service side of the business are (so Katherine’s salary, plus that of all the employees under her). If each node n stored that person’s salary as n.salary, we could write this function:

Note that us just asking for the sum of the node.salary and the sum of salaries of all of nodes children doesn’t work. When applied to the node for Katherine, this would add her salary to Fred’s, but it would miss the salaries of Claire and Jim. To find the salary for the service side of the business, we could now call sumSalaries(katherine_node). As a bonus, if we wanted to determine the entire salary cost for the business, we could call sumSalaries(owner_node). The fact that every employee node has only one parent ensures that each node is counted exactly once. (You can see that if Claire worked in the accounting department as well, then her salary would be counted twice by this code).

Our sumSalaries code is an example of a depth-first recursive algorithm. When calling sumSalaries(katherine_node), here is what our algorithm is doing:

  • Start processing sumSalaries on the node for Katherine
    • Before returning, we call sumSalaries on the node for Fred (Katherine’s child)
    • Before returning, we call sumSalaries on the node for Claire (Fred’s child)
      • Claire’s node has no children, so now it returns.
    • Before returning, we call sumSalaries on the node for Jim (Fred’s child)
      • Jim’s node has no children, so now it returns.
    • We have now summed the salaries of all Fred’s node’s children, so the call to sumSalaries on node for Fred returns.
  • We have now finished summing the salaries of all Katherine’s node’s children, so call to SumSalaries on Katherine returns.

We had to reach a leaf node before we started having functions returning values. When using depth-first recursive algorithms, we have to be careful that we don’t run out of room on the stack! In this case, we had three function calls “stacked” until we finally started returning answers.

We can also use depth-first recursion to search for a particular node (but we’ll discuss a better approach to this problem below). So far, I have been taking it for granted that I have a variable katherine_node lying around. What if I only have the root, owner_node, and I want to find the total salary of the service side of the business? Suppose nodes stored the department’s name as node.department. One way we could do this is to do a depth-first search, stopping the first time we find a department with the right name:

Using depth-first search (DFS) doesn’t make much sense in this example. If we start searching by moving from the owner node to the node for Alice, DFS will keep search through the entire accounting department before giving up. Since we are looking for the highest-ranking person in the service department, it would make more sense to search the tree level by level. This is called breadth first search (BFS), and is implemented below:

The question of which type of search strategy, DFS or BFS, is better depends on the structure of the tree, and what information you are looking for. (I’m going to cover this issue in depth in a future article – stay tuned!)

Summary

We’ve seen that when we have ordered data, trees can give us an O(log N) way of searching the tree for the data. We’ve also seen that trees are more flexible than using simple arrays because we can balance the tree.

Trees are particularly useful when:

  1. You have data that is sorted in some way, and you want to do a lot of searches on it.
  2. You have data that is “grouped” in some way, such as the organizational tree for CodeFights Cafe. File systems are a common hierarchy, where the nodes are either files or directories, and node X is a child of node Y if node X is contained in Y. In this example, all ordinary files are leaves, every parent is a directory, and / is the root. A web-based example is HTML’s document object model, which is tree-based. Front end developers even talk about “child elements” and “parent elements”!

We’ve seen that with hierarchical data, we can still search through it. We can iterate through the entire tree to get information from each node (commonly by going either depth-first or breadth-first), while ensuring that we visit each node exactly once. This process is called tree traversal. We looked closely at a related topic, which was searching in a hierarchical tree. This is like tree traversal, except we can stop whenever we find what we are looking for, so we no longer guarantee that every node will be looked at.

Ready to tackle some trees?

As I mentioned earlier, tree questions are extremely common in technical interviews. You’ll definitely want to practice them while you’re looking for jobs and preparing for interviews! But you don’t have to hunt for good questions. CodeFights Interview Practice has you covered! When you’ve solved all of these questions, take a look through the solutions that other CodeFighters have posted. It’s always really instructive to see how other users have solved interview questions – you’ll definitely learn something new!

Tell us…

Have you encountered any tree questions in interviews? How did you tackle them? Head on over to the CodeFights forum and let us know!

CodeFights Explainer: Prefix Sums

CodeFights Explainer: Prefix Sums

Prefix sums, or cumulative sums, allow us to quickly find the sum of any contiguous slice of an array.

As a quick example, suppose you copy down some driving directions. They tell you to drive down Penny Lane for 3 miles, then make a left onto Abbey Road, which we travel down for 5 miles before continuing on Lake Shore Drive for 4 miles, and so on. We convert these directions into the number of miles driven on each road.

The prefix sum for the distances is:

This tells us that it took 3 miles to get to the end of the first direction (Penny Lane), and 8 miles in total to get to the end of the second direction (Abby Road). If we want to know how long it took to get from the end of Abbey Road (mile 8 of our trip) to the end of Sunset Blvd (mile 18), we do one subtraction on prefix_distances rather than two additions on distances:

A useful example of prefix sums would be calculating a moving average of an array, which is designed to remove periodic fluctuations in data. For example, if we knew the amount of money brought in by sales of cookies at the CodeFights Cafe per day, we might get something like:

We might guess there is fluctuation based on the day of the week. (Maybe it’s in an area that slows down over the weekend.) The seven day moving average would contain the average of the first seven values (4,5,8,10,12,0,0), while the next value would be the overlapping average of the next seven values (5,8,10,12,0,0,5). The moving average is:

In order to calculate the moving_ave we can start with the prefix sums:

To get the cookies sold for the first seven days, we can look at prefix_cookies[6]. To get the total number of cookies sold from day 2 to day 8 we can calculate prefix_cookies[7]-prefix_cookies[1]. Once we have the prefix sums, calculating the total number of cookies sold in any seven day period becomes trivial. Once we know the number of cookies sold in a week, we can divide by seven to find the average.

The nice thing about the prefix sums approach to moving averages is that the method is agnostic about the period used to average over. If CodeFights Cafe found that there was a monthly cycle to their cookie sales, it would be easy to use the same array prefix_cookies to average over 30 days instead.

Other applications of prefix sums are:

  • Calculating the average value of blocks of pixels (useful in noise removal).

  • Calculating whether one point is visible from another, given an array of heights (called the line of sight problem).

  • Finding cumulative distributions (for example, working out what percentage of income the top 1% of earners make from an array of incomes).

Maybe more impressive is the fact that prefix sums can be performed in parallel. This leads to some really useful algorithmic tricks.

Parallelizing prefix sums

If we are trying to calculate the prefix sum of an array L , we can give a linear time algorithm. To calculate the prefix[n], simply take:

This algorithm seems simple and fast, but it is also clear that because prefix[n] depends on prefix[n-1] the problem seems embarrassingly serial. That is, it doesn’t seem like we can take a huge array and split it in two, and have each computer calculate half the prefix sum, and then easily join the results together in a way that saves time.

The amazing thing is that there is a parallel algorithm for prefix sum! The method involves passing through the array twice:

  1. Bottom-up
    Takes the array and builds a binary tree from the array, making pairwise sums along the way.
  2. Top-down
    Takes the array of sums, and determines the prefix sums.

Let’s look at how this works with our array of distances, [3,5,4,6,2].

First we build a tree, where each element is a leaf at the bottom. Each node keeps track of the index range it comes from (to help us put results from different processes together), the sum of all the elements in index range, and one other element that we’ll ignore for right now.

Prefix Sums Figure 1
Notice that the sum stored in each node can be obtained by adding the sum of its two children. This means that we can split the job off to different machines, and then combine them at the end. What we end up with at the top node (or root node) is the sum of all the elements in the array.

To get the prefix sums, we will define left for a node with an index range [a,b) to be the sum of all the elements of the array with an index of less than a. In other words, this is the sum of all the elements that appear to the left of the first element included in this node. We start at the root node and make our way down the tree.

Prefix Sums Figure 2
To convince ourselves that we can really construct the left attribute this way, we will concentrate on the red boxed square. Moving from the parent (index = [0,4)) to the left child is easy: Since the left child has the same lower limit, it just copies the left value (0 in this case, because these are the elements that start at the beginning of the array). Moving to the right child (index = [2,4)) takes a little more thought. The right child knows that indices from [2,4) add up to 10 by looking at its own sum attribute. By knowing the sum in the parent element is 18, we can deduce that the sum of all elements with indices less than 2 must be 18 - 10 = 8.

To get the prefix sum requires taking the left element of all the leaves (except the first one, which is trivially zero) and the sum of the entire array.

Prefix Sums Figure 3
so our prefix_distances are found to be:

Key points

  • Prefix sums are interesting in their own right, in terms of just precomputing results and then allowing you to rapidly calculate any contiguous slice of an array.

  • Demonstrates a pattern in computer science of breaking a seemingly serial task into one that can be parallelized. Can be generalized: build up a binary tree, then move down the tree from the top to separate off the contribution from the “left hand side” of the tree.

Tell us…

Have you ever encountered a problem in an interview that you solved using prefix sums? Or better yet, encountered one in real life? Let us know over on the CodeFights forum!