Hash Code 2015 selection round

The Google Hash Code 2015 selection round was yesterday. This post summarizes the problem of the contest, and the approach followed by my team (ENS Ulm 1), providing code for a simplified version of our approach that achieves a score of 388 with less than 100 lines of code.

You can read the problem statements in French and English and the input file. In brief, the task was a knapsack variant where you had to assign servers to positions in a grid, and then assign the servers to groups to maximize a certain function that favors balanced assignments.

The decision problem associated to the task (is it possible to achieve score n given the set of constraints) is NP-complete. It is in NP because one can represent solutions to the problem in polynomial space, and test in polynomial time whether a solution achieves the required score and satisfies the input constraints. Hence, an NP algorithm to solve the problem is to guess a solution, check that it is indeed a solution, and check that it achieves the required score. It is NP-hard by a reduction from the NP-hard partition problem. Consider a multiset of numbers that is the input to the partition problem. Create an instance of the Hash Code problem by making this the multiset of the capacities of servers, with all servers having size one. Now, say you have only two rows, a number of columns that is larger than the number of servers, no forbidden squares, and one single group to create. The maximal possible capacity (half of the total capacity) can be achieved if and only if the original multiset of numbers can be partitioned in two sets with equal sums (the set of servers in the first row, and the set of servers in the second row; clearly it is never useful to not use a server in this case, because all can fit on the grid). This reduction from the partition problem to (a very restricted case of) the Hash Code problem shows that it is NP-hard.

Of course, this hardness proof is not relevant in practice, because the contest task was to solve the problem on one specific instance. Hence, in the four hours or so that the contest lasted, the idea was to try heuristics to reach the best possible solution on that specific instance.

My team (ENS Ulm 1) reached the top position one hour after the start of the contest, and stayed there for almost two hours until other teams caught up with us. So we found a fairly good basic solution quickly but then we didn't optimize it very well. (By contrast, we were beaten by teams that had a worse initial solution but better incremental optimizations.) The goal of this post is to explain the basic strategy that we coded in one hour. I give an explanation of the strategy in English, and then a cleaned-up version of the code for it (less than 100 lines!). As this is a simplification of our solution, its final score is 388 (and not 400 as we finally achieved); we conclude by the additional ideas that are not shown in the code.

Strategy

The solution is entirely greedy, meaning that it never changes its mind about the choices that it makes, and does all choices in sequence, in a certain order, and optimizing certain criteria.

We try to place each server sequentially. However, we consider them according to a smart (greedy) order: we sort them by decreasing capacity per cell. In other words, we try to place first the servers that give the most capacity divided by the number of cells that they use up. In case of ties, we sort servers by decreasing size, because when packing it is more efficient to pack larger items first.

For each server, we greedily assign it to a group by choosing the group with lowest guaranteed capacity. This means that we compute the guaranteed capacity of each group: the capacity that is guaranteed to remain for any row deletion, which is just the total capacity of servers assigned to the group minus the maximum over rows of the capacity of servers of this row assigned to the group. Once the group is chosen, we try to find room for the server on the grid. If we can, we will place it there and assign it to the chosen group.

To place the server, we consider all rows in the following order: we sort them by ascending capacity for the chosen group. The intuition is that we would rather put the server on a row where the existing capacity for this group is as small as possible, to balance the capacity of the group on all rows and minimize the impact of the row deletion. Then, in every row considered in that order, we try to place the server greedily by finding the first sequence of contiguous free cells of sufficient size. We place the server whenever we find a suitable location for it, otherwise we choose not to use the server. We then continue with the next server.

To make this faster, we maintain the following information: the table of used cells (initialized with the forbidden cells and updated as we place servers), the sum of the capacities assigned to each group, and the sum of the capacities assigned to each group on each row.

Code

The code is written in C++ for speed, in the terse and monolithic style of programming contests. It is optimized for programmer time and running time, not at all for readability or reusability. However, I went over the code to clean it up, ensure that variables names were logical (if succinct), and add a few critical comments. This code produces a solution with score 388, runs instantaneously, and is about 100 lines long.

#include <algorithm>
#include <cstdio>
#include <vector>

using namespace std;

// generous bound on all possible inputs and sizes
#define MAXN 1002

struct Server {
  int id, z, c;
  Server(int id = 0, int z = 0, int c = 0) : id(id), z(z), c(c) {}
  // sort servers by decreasing c/z then decreasing z
  bool operator< (const Server &s) const {
    // a/b < c/d iff da < bc and (e, f) < (g, h) with f,h < N iff eN+f < gN+h
    return z*s.c*MAXN + s.z < s.z*c*MAXN + z;
  }
};

// declaring those globally means we don't have to initialize them
int R, S, U, P, M;
vector<Server> serv;
char grid[MAXN][MAXN]; // filled grid cells, indexed by row, col
int capa[MAXN]; // capa[g]: total capacity allocated to group g
int gcapa[MAXN][MAXN]; // gcapa[r][g]: total capacity for group g in row r
vector<pair<int, pair<int, int> > > out; // group, row, col

int main(int argc, char **argv) {
  scanf("%d%d%d%d%d", &R, &S, &U, &P, &M);
  for (int i = 0; i < U; i++) {
    int r, s;
    scanf("%d%d", &r, &s);
    grid[r][s] = 1; // grid cell is occupied
  }
  for (int i = 0; i < M; i++) {
    int z, c;
    scanf("%d%d", &z, &c);
    serv.push_back(Server(i, z, c));
    out.push_back(make_pair(-1, make_pair(-1, -1))); // server is unused
  }

  sort(serv.begin(), serv.end());

  for (int s = 0; s < M; s++) {
    // place server s
    int guar[MAXN]; // guar[g]: guaranteed capa for group g
    for (int g = 0; g < P; g++)
      guar[g] = capa[g];
    for (int r = 0; r < R; r++)
      for (int g = 0; g < P; g++)
        guar[g] = min(guar[g], capa[g] - gcapa[r][g]); // capa if killing row r
    int mguar = guar[0], mg = 0; // min and argmin of guar over all groups
    for (int g = 0; g < P; g++)
      if (guar[g] < mguar) {
        mguar = guar[g];
        mg = g;
      }
    // mg is the group with lowest guar, assign s to mg if we can place it
    vector<pair<int, int> > rows;
    for (int r = 0; r < R; r++)
      rows.push_back(make_pair<int, int>(gcapa[r][mg], r));
    sort(rows.begin(), rows.end()); // sort rows by inc capa for group mg
    int fr = -1, fc = -1; // row and col to place s
    for (int rr = 0; rr < R; rr++) {
      int contig = 0; // count contiguous free cells
      fr = rows[rr].second;
      for (int c = 0; c < S; c++) {
        if (!grid[fr][c])
          contig++;
        else
          contig = 0;
        if (contig == serv[s].z) {
          fc = c - (serv[s].z - 1); // we can place s at (fr, fc)
          rr = R; // break out of both loops
          break;
        }
      }
    }
    if (fc >= 0) {
      // assign s to mg and place it at (fr, fc)
      capa[mg] += serv[s].c;
      gcapa[fr][mg] += serv[s].c;
      for (int d = 0; d < serv[s].z; d++)
        grid[fr][fc+d] = 1; // squares occupied by s
      out[serv[s].id] = make_pair(mg, make_pair(fr, fc));
    }
  }

  for (int i= 0 ; i < M; i++)
    if (out[i].first >= 0)
      printf("%d %d %d\n", out[i].first,
          out[i].second.first, out[i].second.second);
    else
      printf("x\n");

  return 0;
}

Extensions

To improve the performance of our solution by 12 points, we changed a few additional things:

Once the list of servers was sorted, we kept the top of that list with the servers whose total size filled all the free space in the grid, and, assuming optimistically that we would be able to place those servers, we sorted them by decreasing size to help the packing.
When considering the rows when choosing where to place a server, we broke ties on the assigned capacity by examining first the rows with the most free space left.
We did random unprincipled tweaks which happened to make our solution better by one or two points.
The last point (from 399 to 400) was gained by a brute-force incremental optimization that tried to perform random swaps.

Our brute-force incremental optimization was not very effective. I think that teams that beat us had a much more efficient approach.