Tag: database

Introducing the CodeFights Arcade Database World

Introducing the CodeFights Arcade Database World

The ability to access, understand, and manipulate data is extremely important in the current engineering job market. Employers are increasingly listing SQL as a “required skill” in job listings, whether the job is a back-end engineering job or not.

Data scientists, data analysts, researchers, and many others also need to be able to access information that’s stored in relational databases. No matter what your job title is, the ability to dive deep into an organized collection of data to answer questions, uncover patterns, and discover information is incredibly useful!

But unless you already work with a relational database, whether at a job or at school, how do you practice writing efficient and accurate queries?

Database World

At CodeFights, we wanted to give people a fun, easy way to practice writing queries and manipulating data. That’s why we created Database World, our latest addition to the CodeFights Arcade!

Database World has over 80+ tasks spread out over 16 levels. The questions start out easy… But they ramp up in difficulty as you progress. This means you’ll never get bored as you practice!

Database World is a unique way of learning database skills that’s both educational and entertaining. While you write more and more queries, you’ll start to know what to look for and how to pinpoint the information you need. You’ll get faster and have less hesitation. In other words, writing amazing SQL queries will become second nature for you. You’re going to be able to leverage this into being highly competitive in the job market!

Back up – what’s the Arcade?

There are five different worlds in the CodeFights Arcade. Each world has tons of levels and tasks to complete that help you master a particular concept.

  • Start tackling tasks and get familiar with Arcade in the The Intro.
  • Get up to speed on programming fundamentals in The Core.
  • Python World focuses on helping you master programming in Python.
  • Solve graph theory questions in Graphs World.
  • And now, practice your SQL and database skills with Database World!

Get started!

Are you ready to become a database guru? If so, get started on Database World with projectList, the first task in the introductory level!

 

CodeFights Solves It: hostnamesOrdering

CodeFights Solves It: hostnamesOrdering

The Challenge of the Week that ended on March 21 was hostnamesOrdering. Only 72 CodeFighters submitted solutions – this was a tricky one! In this breakdown, I’m going to walk you through my thought process in writing a solution.

In this database challenge, the task is to take a table of numeric ids and hostnames, and return a sorted version of the table. The challenge is that we have to order the hostnames by the reverse hostnameThe hostname is a string of domains separated by periods, such as codefights.com or www.github.com. The domains don’t include any protocols (such as http://), nor do they include any paths (such as codefights.com/forums). The general form of a hostname is domain1.domain2. … .domainN. The reversed hostname is domainN. … .domain2.domain1. For example:

Hostname...has the reverse hostname
www.codefights.comcom.codefights.www
dev.mysql.comcom.mysql.dev
web.mit.eduedu.mit.web
codepen.ioio.codepen

Note that the list above is ordered by the reverse hostname. We are told that there are at most 3 domains in any given hostname that appears in this challenge.

Getting started

Let’s start with an example of the table hostnames:

idhostname
1workbench.mysql.com
2codefights.slack.com
3codefights.com
4snarknews.info
5sololearn.com
6dev.mysql.com
7codepen.io
8com

Let’s suppose for a moment that we have a way of generating the reversed hostnames, which will actually be the hardest part of the challenge.

idhostnamereversed
1workbench.mysql.comcom.mysql.workbench
2codefights.slack.comcom.slack.codefights
3codefights.comcom.codefights
4snarknews.infoinfo.snarknews
5sololearn.comcom.sololearn
6dev.mysql.comcom.mysql.dev
7codepen.ioio.codepen
8comcom

The query we want to return is:

which would return:

idhostname
8com
3codefights.com
6dev.mysql.com
1workbench.mysql.com
2codefights.slack.com
5solonews.com
4snarknews.info
7codepen.io

Reversing the domain name

We won’t actually construct an explicit reversed column. Instead, we will generate the pieces of the domain we want on the fly. To do this, we will be using the SUBSTRING_INDEX function. From the documentation, we see that SUBSTRING_INDEX( string, delim, count) returns a substring.

If count is positive, SUBSTRING_INDEX( string, delim, count) returns the substring from the beginning to the countth occurrence of delim. For example:

If count is negative, SUBSTRING_INDEX finds the |count|th occurrence of delim from the right, and returns the substring from there to the end of the string. For example:

Using the same example we’ve been working with, here is a query that splits the hostname into three pieces. Using SUBSTRING_INDEX, we can extract the top level domain, the midlevel domain, and the lowest level of domain just by changing the index value.

This would work if every domain has exactly three pieces. Unfortunately, some domains only have one or two pieces. What is actually returned is:

idhostnametopmid
3codefights.comcomcodefights.com
8comcomcom
6dev.mysql.comcommysql.com
1workbench.mysql.comcommysql.com
2codefights.slack.comcomslack.com
5solonews.comcomsolonews.com
4snarknews.infoinfosnarknews.info
7codepen.ioiocodepen.io

The first two rows are in the wrong order. They match at the top level (i.e., they are both .com domains). The com domain has no lower level, but the SUBSTRING_INDEX just returns the entire string com again. Since codefights.com is earlier in the alphabet than com, the table puts codefights.com first. What should happen is that codefights should be compared to the empty string, and com should appear first.

One trick to fix this problem is to prepend the hostnames with ‘…’, guaranteeing that there are three periods in each domain. So our query is now:

Writing down just the first four rows, this query returns:

idhostnametopmid
8comcom.com
3codefights.comcomcodefights.com
6dev.mysql.comcommysql.com
1workbench.mysql.comcommysql.com
…….……..…….……..

This has everything in the right order!

A problem and a workaround

In principle, we just have to eliminate the top and middle column, so we are only reporting the id and hostname. Before making this change, we can try running this in CodeFights. When we do it, we run into a problem: the CodeFights MySQL engine returns workbench.mysql.com before dev.mysql.com!

This seems to be a bug; running MySQL on a local machine returns dev.mysql.com before workbench.mysql.com. However, ordering by the hostname on CodeFights returns the opposite. When using ORDER BY top, mid, hostname, workbench.mysql.com appears first. This seems wrong, since both these hostnames have identical top and mids, so the ordering should be determined by hostname.

Part of being a programmer is working around issues in any software! The problem we are encountering comes from doing multiple orderings. Our workaround will be to concatenate the top, middle, and hostname into one string. We will separate the top, middle, and hostname by a space, because spaces cannot occur in a domain and spaces appear earlier in ASCII than any letter.

So our workaround is:

This orders dev.mysql.com and workbench.mysql.com correctly.

Eliminating columns

Now we need to eliminate the top and mid columns. We could do a subquery, but instead we will actually construct the top and the middle strings as part of the ORDER BY clause.

Our solution ends up looking like this:

This solution only selects the desired columns. Notice that in the ORDER BY we have a single string, made from the “top” level domain, a space, the “middle level” domain, a space, and then the hostname. This gets around the implementation bug, and gets us a solution with 188 characters.

On a local MySQL database, we don’t need the final concatenation, and we could use multiple ordering to get:

Going further

As a code golfing exercise, we have some compact solutions. As a coding exercise, you might be interested in trying to solve the general case where we don’t have a limit on the maximum number of domains in a hostname!

This takes us back to our original solution:

where the interesting exercise is to create a custom function reverse_hostname. If you get stuck, this StackOverflow post might be useful.

Tell us…

How did you tackle this challenge? What do you think about the approach I took in solving it? Let us hear from you in the CodeFights forum!

CodeFights Solves It: pastEvents

CodeFights Solves It: pastEvents

A little while back, our Challenge  of the Week was a tricky database challenge called pastEvents. 281 of our awesome CodeFighters solved it! I’ve created a walkthrough of the problem and a few different tactics for finding a solution. Here’s the problem, in case you need a refresher:

During the most recent social event you suddenly realized that you had forgotten your USB drive on one of the previous events. You are pretty sure that you had your flash drive with you just last week, which means that it was probably lost during one of the events of the last 7 days. So, you would like to find all the events you attended during this period.

And the output should be:

A table of all names and dates of events within 7 days of the most recent event, not including the most recent event, order from most to least recent.

The problem gives us a database with the table events, as well as a sample dataset:

idnameevent_date
1TGIF2016-11-07
2TGIF2016-11-28
3Weekly meeting2016-12-01
4Weekly meeting2016-12-02
5Client meeting2015-12-03

How would we construct the solution by hand?

  1. We would look for the most recent date, in this case the event with id 4 is the most recent with an event_date of 2016-12-02.
  2. We would then SELECT all dates that occur within 7 days before 2016-12-02. This would select the events with ids 2 and 3.
  3. Finally we would sort the returned events based on event_date with ORDER BY.

With this strategy in mind, let’s start putting the pieces together!

Step 1: Find the most recent event

For this, we need to find the maximum value of the event_date column:

The time taken for this query will scale linearly with the number of events we have stored, which is about the best we can hope for. After all, to find the maximum we have to look at the date of every single record!

Here is a different approach that also selects the most recent date:

This approach is to take all the dates, order them from latest to earliest, and then select the first one off the list. This approach isn’t as good, because it isn’t as clear what the code is trying to do. If you are reviewing this code, you have to get to the very end of the line to realize that we’re only picking one value. You have to carefully think if you want to order the list in ASCending or DESCending order. A naive implementation of this query would also be slower than our first attempt, because first the entire list would need to be sorted before the most recent event was selected (MySQL is almost certainly smart enough to optimize this away). Finally, if our goal is to get a short solution, this line is less clear and more verbose than our attempt based off the max function.

Now we need to write a query that uses the latest date and selects events in the 7 days prior to our event. We can either use our query above as a subquery, or store the results as a variable and use that. We will start by using the variable @recent, as it leads to more readable code, and then optimize once we’re done.

Our solution so far looks like this:

Step 2: Select events within 7 days of @recent

Given an event_date, we need to determine if it was within 7 days of @recent. MySQL provides 60 functions for dealing with date calculations. Some of the options relevant to us are:

DATE_ADD(event_date,INTERVAL 7 DAY) > @recentTrue if event_date + 7 is earlier than @recent.
Does include most recent event.
DATE_SUB(@recent, INTERVAL 7 DAY) <= event_dateTrue if event_date is 7 or fewer days earlier than @recent.
Does include most recent event.
DATEDIFF(@recent,event_date) <= 7Same as DATE_SUB, but does subtraction in days (which saves some characters).
Does include most recent event.

All of these include the most recent event. We can throw out the most recent event by adding another DATEDIFF, or by using BETWEEN to ensure that our DATEDIFF is also bigger than 1. Because we are assured that there is at most one event per day, eliminating the most recent one can be achieved with event_date != @recent.

We now add a SELECT with a WHERE clause to get the events within 7 days, so our code now takes the form:

There is a really sneaky way of eliminating the extra clause to see if the event is the most recent event. We are interested in selecting events where

 

Instead of using subtraction, what do we know about the reciprocal of date differences? We know that

 

 

We get the lower limit by @recent being 7 days ahead of event_date. If the difference between the dates is 8 days, then 1/(@recent-event_date) is . The cool part is the upper limit is achieved if @recent and event_date are one day apart. If we look at the case where @recent and event_date are the same day, then 1/(@recent-event_date) is undefined, so MySQL will return false for any comparison. This means we don’t have to worry about the upper limit, and instead we only care if

 

 

Writing this another way:

 

 

Our shorter (but conceptually harder to understand) solution is

Step 3: Order the output

Once we’ve gotten this far, it’s simply a case of adding an ORDER BY clause to the end. Finally we have something that passes the test cases and the hidden tests! Let’s call it a solution with 171 characters:

Refinement 1: Subclause

It’s clear that we could get shorter code by naming @recent something like @r. We can also eliminate some characters by running a subclause to get the maximum date:

This code JOINS every event with the single value from the subclause SELECT MAX(event_date) as r FROM Events, which is just the variable we stored in @recent in our previous example. We want to be careful with JOINS, because the number of rows in the returned table is the product of the rows in the two input tables. In this case, the derived table recent only has one row, so we only have to sort through the same number of events as we started with.

Refinement 2: Change >= to >; eliminate BEGIN and END

We can also eliminate a character by noting that DATEDIFF has to return an integer, so requiring 7/DATEDIFF(r,event_date) >= 1 gives the same dates as 8/DATEDIFF(r,event_date) > 1. MySQL doesn’t require BEGIN and END for procedures, so we can eliminate those too. This gets us to:

The shortest solution

The very short solutions to this challenge took a very different approach from the one I just demonstrated. Here’s the code submitted by CodeFighter gennadi_m:

The first step is taking the Event table, and taking the cartesian product (i.e. OUTER JOIN) with itself. In our case, this gives us a 25 row table:

e.ide.namee.event_datev.idv.namev.event_date
1TGIF2016-11-071TGIF2016-11-07
1TGIF2016-11-072TGIF2016-11-28
1TGIF2016-11-073Weekly meeting2016-12-01
1TGIF2016-11-074Weekly meeting2016-12-02
1TGIF2016-11-075Client meeting2015-12-03
2TGIF2016-11-281TGIF2016-11-07
..................

The columns are grouped by “column 2” in the output, e.event_date. The group by is only important when we are using an aggregate function, which is this case is max. The first group on this table is the group e.event_date = 2016-11-07, for example. Out of this group, we ask for the maximum v.event_date, which is 2016-12-02. We can only use aggregate properties, so our table becomes:

e.ide.namee.event_datemax(v.event_date)
1TGIF2016-11-072016-12-02
2TGIF2016-11-282016-12-02
3Weekly meeting2016-12-012016-12-02
4Weekly meeting2016-12-022016-12-02
5Client meeting2015-12-032016-12-02

The combination of using a self-join, grouping on event date,  and max aggregation is just a complicated way of adding the most recent date to every row.

While the query is a lot smaller, joining a table with itself squares the number of entries in it. MySQL might be able to optimize your query in some cases, but you want to be careful before trying this trick in production code!

Tell us…

How did you tackle this challenge? What do you think about the approach I took in solving it, and the refinements I added? Let us hear from you in the CodeFights forum!