File servers | sagenhaft

Since the last Haskell cicada: Electromagnet Pulse was so much fun, let’s do another one from the same chapter from “Algorithm Design” by Jon Kleinberg and Éva Tardos.

Algorithm Design by Jon Kleinberg, Éva Tardos

"Algorithm Design takes a fresh approach to the algorithms course, introducing algorithmic ideas through the real-world problems that motivate them. In a clear, direct style, Jon Kleinberg and Eva Tardos teach students to analyze and define problems for themselves, and from this to recognize which ...

The exercise is Exercise 12 on page 323:

Suppose we want to replicate a file over a collection of $n$ servers, labeled $S_1, S_2, \ldots, S_n$. To place a copy of the file at server $S_i$ results in a placement cost of $c_i$, for an integer $c_i > 0$. Now, if a user requests the file from server $S_i$, and no copy of the file is present at $S_i$, then the servers $S_{i+1}, S_{i+2}, S_{i+3} \ldots$ are searched in order until a copy of the file is finally found, say at server $S_j$, where $j > i$. This results in an access cost of $j − i$. (Note that the lower-indexed servers $S_{i−1}, S_{i−2}, \ldots$ are not consulted in this search.) The access cost is $0$ if $S_i$ holds a copy of the file. We will require that a copy of the file be placed at server $S_n$, so that all such searches will terminate, at the latest, at $S_n$. We’d like to place copies of the files at the servers so as to minimize the sum of placement and access costs. Formally, we say that a configuration is a choice, for each server $S_i$ with $i = 1, 2, \ldots, n − 1$, of whether to place a copy of the file at $S_i$ or not. (Recall that a copy is always placed at $S_n$.) The total cost of a configuration is the sum of all placement costs for servers with a copy of the file, plus the sum of all access costs associated with all $n$ servers. Give a polynomial-time algorithm to find a configuration of minimum total cost.

The key observation is that lower-indexed servers are not consulted when server $S_i$ is being asked for the file and the file is not there. That gives us a certain independence and direction, namely to the right and it lets us decompose the problem into independent subproblems that are optimal (because if not, they can be replaced with optimal ones without perturbing what happens further right in a configuration).

Let $R_n$ be the cost of the optimal configuration which has the file placed on server $S_n$. Then we have $n$ choices where we place the file left to the server $S_n$ with no other placement between them: we can place it at $S_{n - k}$ for $k = 1, \ldots, n$ (with the convention that $k = n$ and $S_0$ means that there is no placement other than at $S_n$).

Let $C(n,k)$ be the cost when the next placement left of server $S_n$ is server $S_{n-k}$. Then

$$ C(n, k) = c_n + R_{n-k} + \sum_{i = n - k + 1}^{n - 1} (n - i) $$

The sum simplifies with a change in variable

$$ \sum_{i = n - k + 1}^{n - 1} (n - i) = \sum_{j = 1}^{k - 1} (n - (n - k + j)) = \sum_{j = 1}^{k - 1} (k - j) $$

and then one more time

$$ \sum_{j = 1}^{k - 1} (k - j) = \sum_{l = 1}^{k - 1} l = \frac{k (k - 1)}{2} $$

$$ C(n, k) = c_n + R_{n-k} + \frac{k (k - 1)}{2} $$

Then

$$ R_n = min_{k = 1}^n \{C(n, k)\} = min_{k = 1}^n \{c_n + R_{n-k} + \frac{k (k - 1)}{2}\} $$

which comes to

$$ R_n = c_n + min_{k = 1}^n \{R_{n-k} + \frac{k (k - 1)}{2}\} $$

We set $R_0 = 0$ and then we have a recursion fit for dynamic programming. We leave the Haskell implementation as an exercise to the reader.