Kemal Erenhttp://kemaleren.com/2013-09-12T22:30:00-04:00The BiMax algorithm2013-09-12T22:30:00-04:00Kemal Erentag:kemaleren.com,2013-09-12:the-bimax-algorithm.html<div class="section" id="introduction">
<h2>Introduction</h2>
<p>The focus of this post is BiMax, a method developed to serve as a
baseline in a comparison of biclustering algorithms <a class="footnote-reference" href="#id2" id="id1">[1]</a>. Because
BiMax was developed as a reference method, its objective is
intentionally simple: it finds all biclusters consisting entirely of
1s in a binary matrix.</p>
<p>Specifically, BiMax enumerates all inclusion-maximal biclusters, which
are biclusters of all 1s to which no row or column can be added
without introducing 0s. More formally, an inclusion-maximal bicluster
of <span class="math">\(m \times n\)</span>
binary matrix <span class="math">\(M\)</span>
is a set of rows and a
set of columns <span class="math">\((R, C)\)</span>
, with <span class="math">\(R \in 2^{\{1, \dots, m \}}\)</span>
and <span class="math">\(C \in 2^{\{1, \dots, n\}}\)</span>
such that:</p>
<ul class="simple">
<li><span class="math">\(\forall i \in R, \forall j \in C : M[i, j] = 1\)</span>
</li>
<li>for any other bicluster <span class="math">\((R’, C’)\)</span>
that meets the first
condition, <span class="math">\((R \subseteq R’ \land C \subseteq C’) \implies (R
= R’ \land C = C’)\)</span>
</li>
</ul>
<p>BiMax only works with binary matrices, but of course non-binary data
can be converted to binary data in a number of ways. The simplest
method is to threshold it at some threshold <span class="math">\(t\)</span>
, where <span class="math">\(t\)</span>
could be the mean, median, some other quantile, or any arbitrary
threshold. Of course, more sophisticated binarizing methods may also
be used.</p>
<p>Now let us see how the algorithm works.</p>
</div>
<div class="section" id="method">
<h2>Method</h2>
<p>BiMax uses a recursive divide and conquer strategy to enumerate all
the biclusters in a <span class="math">\(m \times n\)</span>
matrix <span class="math">\(M\)</span>
. To illustrate
the process we will use the following data matrix, taken from the
original paper:</p>
<img alt="./static/images/bimax_orig.png" src="./static/images/bimax_orig.png" style="width: 45%;" />
<p>The process begins by choosing any row containing a mixture of 0s and
1s. In this case, all the rows fits this criterion. (If no such row
exists, then clearly either all entries of <span class="math">\(M\)</span>
are 1, in which
case the entire matrix is a single bicluster, or all its entries are
0, in which case it has no biclusters.) We arbitrarily choose the
first row. The chosen row <span class="math">\(r^*\)</span>
will be used to divide <span class="math">\(M\)</span>
into two submatrices, each of which can be processed independently.</p>
<p>To find this submatrices, first the columns <span class="math">\(C = \{1, \dots,
n\}\)</span>
are divided into two sets: those for which row <span class="math">\(r^*\)</span>
is 1,
and those for which it is 0.</p>
<ul class="simple">
<li><span class="math">\(C_U = \{c : M[r^*, c] = 1\}\)</span>
</li>
<li><span class="math">\(C_V = C - C_U\)</span>
</li>
</ul>
<p>Next the <span class="math">\(m\)</span>
rows of <span class="math">\(M\)</span>
are divided into three sets:</p>
<ul class="simple">
<li><span class="math">\(R_U\)</span>
: rows with 1s only in <span class="math">\(C_U\)</span>
</li>
<li><span class="math">\(R_W\)</span>
: rows with 1s in both <span class="math">\(C_U\)</span>
and <span class="math">\(C_V\)</span>
</li>
<li><span class="math">\(R_V\)</span>
: rows with 1s only in <span class="math">\(C_V\)</span>
</li>
</ul>
<p>After rearranging the rows and columns of <span class="math">\(M\)</span>
to make each set
contiguous, the matrix looks like this:</p>
<img alt="./static/images/bimax_rearranged.png" src="./static/images/bimax_rearranged.png" style="width: 50%;" />
<p>This image makes it clear how to divide the matrix for the recursive
step. The submatrix formed by <span class="math">\((R_U, C_V)\)</span>
is empty and cannot
contain any biclusters. The submatrix <span class="math">\(U = (R_U \cup R_W, C_U)\)</span>
,
shown in blue in the next figure, and the submatrix <span class="math">\(V = (R_W
\cup R_V, C_U \cup C_V)\)</span>
, shown in red, contain all possible
biclusters in <span class="math">\(M\)</span>
. The procedure continues by recursively
processing <span class="math">\(U\)</span>
and then <span class="math">\(V\)</span> .</p>
<img alt="./static/images/bimax_u_v.png" src="./static/images/bimax_u_v.png" style="width: 50%;" />
<p>There is one problem yet to address: because <span class="math">\(U\)</span>
and <span class="math">\(V\)</span>
overlap, it is possible that a bicluster in <span class="math">\(U\)</span>
has some rows in
<span class="math">\(V\)</span>
. Then that part of the bicluster would be a maximal
bicluster in <span class="math">\(V\)</span>
but not in <span class="math">\(M\)</span>
. To avoid this error,
BiMax requires that any bicluster found in <span class="math">\(V\)</span>
must contain at
least one column from each column set in <span class="math">\(Z\)</span>
, where <span class="math">\(Z\)</span>
is
the set of all <span class="math">\(C_V\)</span>
column sets in the current call stack.</p>
<p>The full algorithm is available in the paper’s <a class="reference external" href="http://www.tik.ee.ethz.ch/sop/bimax/SupplementaryMaterial/supplement.pdf">supplementary material</a>.</p>
</div>
<div class="section" id="example">
<h2>Example</h2>
<p>I have implemented the BiMax algorithm in Python. However, even after
optimizing it and rewriting parts in Cython, mine is so much slower
than the <a class="reference external" href="http://www.tik.ee.ethz.ch/sop/bimax/">original implementation</a> that it will probably not be
added to scikit-learn. For now I have made <a class="reference external" href="https://gist.github.com/kemaleren/6547145">the pure Python version</a> available.</p>
<p>Here is a simple example of BiMax in action. Since the pure Python
version is slow, here we demonstrate it with a small artificial dataset.</p>
<div class="highlight"><pre><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">matplotlib</span> <span class="kn">import</span> <span class="n">pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">from</span> <span class="nn">bimax</span> <span class="kn">import</span> <span class="n">BiMax</span>
<span class="n">generator</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">RandomState</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">generator</span><span class="o">.</span><span class="n">binomial</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="p">(</span><span class="mi">20</span><span class="p">,</span> <span class="mi">20</span><span class="p">))</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">BiMax</span><span class="p">()</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="c"># get largest bicluster</span>
<span class="n">idx</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">rows_</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span> <span class="o">*</span> <span class="n">model</span><span class="o">.</span><span class="n">columns_</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">rows_</span><span class="p">))))</span>
<span class="n">bc</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">outer</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">rows_</span><span class="p">[</span><span class="n">idx</span><span class="p">],</span> <span class="n">model</span><span class="o">.</span><span class="n">columns_</span><span class="p">[</span><span class="n">idx</span><span class="p">])</span>
<span class="c"># plot data and overlay largest bicluster</span>
<span class="n">plt</span><span class="o">.</span><span class="n">pcolor</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="n">plt</span><span class="o">.</span><span class="n">cm</span><span class="o">.</span><span class="n">Greys</span><span class="p">,</span> <span class="n">shading</span><span class="o">=</span><span class="s">'faceted'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">pcolor</span><span class="p">(</span><span class="n">bc</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="n">plt</span><span class="o">.</span><span class="n">cm</span><span class="o">.</span><span class="n">Greys</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.7</span><span class="p">,</span> <span class="n">shading</span><span class="o">=</span><span class="s">'faceted'</span><span class="p">)</span>
</pre></div>
<p>Here is the resulting image, with the original data in gray and the
largest bicluster highlighted in black:</p>
<img alt="./static/images/bimax_example.png" src="./static/images/bimax_example.png" style="width: 50%;" />
</div>
<div class="section" id="conclusion">
<h2>Conclusion</h2>
<p>This post showed how BiMax works and demonstrated its use. It
interesting to note that BiMax’s objective is equivalent to finding
all <a class="reference external" href="https://en.wikipedia.org/wiki/Complete_bipartite_graph">bicliques</a> in the
bipartite graph representation of the binary matrix. A recent <a class="reference external" href="https://www.biomedcentral.com/content/pdf/1471-2105-13-S18-A10.pdf">meeting
abstract</a>
by Voggenreiter, et. al, in <span class="caps">BMC</span> Bioinformatics describes a variant of
the Bron-Kerbosch algorithm for enumerating bicliques that in their
tests was faster than BiMax.</p>
</div>
<div class="section" id="references">
<h2>References</h2>
<table class="docutils footnote" frame="void" id="id2" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>Prelić, A., Bleuler, S., Zimmermann, P., Wille, A., Bühlmann,
P., Gruissem, W., … <span class="amp">&</span> Zitzler, E. (2006). <a class="reference external" href="http://ocw.metu.edu.tr/file.php/40/Schedule/reading8.pdf">A systematic
comparison and evaluation of biclustering methods for gene
expression data</a>.
Bioinformatics, 22(9), 1122-1129.</td></tr>
</tbody>
</table>
</div>
Cheng and Church2013-08-08T16:30:00-04:00Kemal Erentag:kemaleren.com,2013-08-08:cheng-and-church.html<div class="section" id="introduction">
<h2>Introduction</h2>
<p>The subject of today’s post is a biclustering algorithm commonly
referred to by the names of its authors, Yizong Cheng and George
Church <a class="footnote-reference" href="#id3" id="id1">[1]</a>. It is one of the best-known biclustering algorithms, with
over 1,400 citations, because it was the first to apply biclustering
to gene microarray data. This algorithm remains popular as a
benchmark: almost every new published biclustering algorithm includes
a comparison with it.</p>
<p>Cheng and Church were interested in finding biclusters with a small
mean squared residue, which is a measure of bicluster homogeneity. To
define the mean squared residue, the following notation will be
necessary: let <span class="math">\(A\)</span>
be a matrix, and let the tuple <span class="math">\((I, J)\)</span>
represent a bicluster of <span class="math">\(A\)</span>
, where <span class="math">\(I\)</span>
is the set of rows
in the bicluster, and <span class="math">\(J\)</span>
is the set of columns. The submatrix
of <span class="math">\(A\)</span>
defined by this bicluster is <span class="math">\(A_{I J}\)</span>
. Let
<span class="math">\(a_{ij}\)</span>
with <span class="math">\(i \in I, j \in J\)</span>
be an element of the
bicluster, and let</p>
<div class="math">
\begin{equation*}
a_{iJ} = \frac{1}{|J|} \sum_{j} a_{ij}
\end{equation*}
</div>
<div class="math">
\begin{equation*}
a_{Ij} = \frac{1}{|I|} \sum_{i} a_{ij}
\end{equation*}
</div>
<div class="math">
\begin{equation*}
a_{I J} = \frac{1}{|I| |J|} \sum_{i, j} a_{ij}
\end{equation*}
</div>
<p>be the row, column, and overall means of the bicluster. Then the
residue of element <span class="math">\(a_{ij}\)</span> is</p>
<div class="math">
\begin{equation*}
a_{ij} - a_{i J} - a_{I j} + a_{I J}
\end{equation*}
</div>
<p>and the mean squared residue of the bicluster is</p>
<div class="math">
\begin{equation*}
H(I, J) = \frac{1}{|I| |J|} \sum_{i \in I, j \in J} \left (a_{ij} - a_{iJ} - a_{Ij} + a_{I J} \right )^2
\end{equation*}
</div>
<p>The residue is meant to measure how an element differs from the row
mean, column mean, and overall mean of the bicluster. If all the
elements of the bicluster have small residues, clearly the mean
squared residue will be small.</p>
<p><span class="math">\(H(I, J)\)</span>
achieves a minimum when all the rows and columns of
the bicluster are shifted versions of each other. In other words, if
we can represent every element of the bicluster as <span class="math">\(a_{ij} =
r_i + c_j\)</span>
, where <span class="math">\(r\)</span>
is a column vector with <span class="math">\(|I|\)</span>
entries and <span class="math">\(c\)</span>
is a row vector with <span class="math">\(|J|\)</span>
entries, then
the score of <span class="math">\(M\)</span>
will be 0.</p>
<p>To visualize these shifted rows and columns, here is a matrix with a
perfect mean squared residue:</p>
<img alt="./static/images/perfect_msr_matrix.png" src="./static/images/perfect_msr_matrix.png" style="width: 35%;" />
<p>Visualizing the rows and columns of this matrix with a <a class="reference external" href="https://en.wikipedia.org/wiki/Parallel_coordinates">parallel
coordinates</a>
plot makes the pattern much easier to see. Now it is clear that the
row vectors differ from each other by only an additive constant
(left), and similarly for the column vectors (right):</p>
<img alt="./static/images/perfect_msr_parallel_coordinates.png" src="./static/images/perfect_msr_parallel_coordinates.png" style="width: 40%;" />
<img alt="./static/images/perfect_msr_parallel_coordinates_t.png" src="./static/images/perfect_msr_parallel_coordinates_t.png" style="width: 40%;" />
<p>As a corollary, it is easy to see that there are some special cases:
constant biclusters, biclusters with identical rows or columns, and
biclusters with only one row or column clearly have a score of 0.
Also, it is clear that shifting <span class="math">\(A_{I J}\)</span>
to <span class="math">\(A_{I J} + c\)</span>
for any constant <span class="math">\(c\)</span>
does not change its score.</p>
<p>Another corollary is that if the rows and columns of a bicluster are
scaled versions of each other, log-transforming the bicluster makes
its <span class="math">\(H\)</span>
score 0. In other words, if <span class="math">\(u\)</span>
and <span class="math">\(v\)</span>
are
any column vectors, the <span class="math">\(H\)</span>
score of <span class="math">\(u v^\top\)</span>
may be
large, but the <span class="math">\(H\)</span>
score of <span class="math">\(\log(u v^\top)\)</span>
is 0.
Therefore, Cheng and Church can also find biclusters with scaling
patterns, if the dataset is first log-transformed.</p>
<p>To visualize the effect of the log-transformation, here are some
scaled vectors (left) and the same vectors after log-transforming them (right):</p>
<img alt="./static/images/scaled_msr_parallel_coordinates.png" src="./static/images/scaled_msr_parallel_coordinates.png" style="width: 40%;" />
<img alt="./static/images/log_scaled_msr_parallel_coordinates.png" src="./static/images/log_scaled_msr_parallel_coordinates.png" style="width: 40%;" />
<p>Let us now see how to find these kinds of biclusters.</p>
</div>
<div class="section" id="the-algorithm">
<h2>The algorithm</h2>
<p>Cheng and Church tries to find biclusters that are as large as
possible, with the restriction that their <span class="math">\(H\)</span>
score must be less
than some threshold <span class="math">\(\delta\)</span>
. Like most biclustering problems,
this one is <span class="caps">NP</span>-hard. Therefore, the method proceeds via a simple
greedy approach: it starts with the largest possible bicluster, then
removes rows and columns that most reduce <span class="math">\(H(I, J)\)</span> .</p>
<p>As Cheng and Church prove in their paper, this greedy removal can be
done efficiently, without the need to recalculate <span class="math">\(H\)</span>
for every
possible row and column removal. To do so, we define the mean squared
residue of any row <span class="math">\(i\)</span>
or any column <span class="math">\(j\)</span>
in the bicluster as:</p>
<div class="math">
\begin{equation*}
d(i) = \frac{1}{|J|} \sum_{j \in J} \left (a_{ij} - a_{iJ} - a_{Ij} + a_{I J} \right )^2
\end{equation*}
</div>
<div class="math">
\begin{equation*}
d(j) = \frac{1}{|I|} \sum_{i \in I} \left (a_{ij} - a_{iJ} - a_{Ij} + a_{I J} \right )^2
\end{equation*}
</div>
<p>Then this part of the algorithm, the single node deletion step,
proceeds as follows:</p>
<blockquote>
<p><strong>Algorithm</strong>: node deletion</p>
<p><strong>input</strong>: matrix <span class="math">\(A\)</span>
, row set <span class="math">\(I\)</span>
, column set <span class="math">\(J\)</span>
, <span class="math">\(\delta>=0\)</span>
</p>
<p><strong>output</strong>: row set <span class="math">\(I’\)</span>
and column set <span class="math">\(J’\)</span>
so that <span class="math">\(H(I’, J’) <= \delta\)</span>
</p>
<p><strong>while</strong> <span class="math">\(H(I, J) > \delta\)</span> :</p>
<blockquote>
<p>find the row <span class="math">\(r = \mathop{\arg\max}\limits_{i \in I} d(i)\)</span>
and the column <span class="math">\(c = \mathop{\arg\max}\limits_{j \in J} d(j)\)</span>
</p>
<p><strong>if</strong> <span class="math">\(d(r) > d(c)\)</span>
<strong>then</strong> remove <span class="math">\(r\)</span>
from <span class="math">\(I\)</span>
<strong>else</strong> remove <span class="math">\(c\)</span>
from <span class="math">\(J\)</span>
</p>
</blockquote>
<p><strong>return</strong> <span class="math">\(I\)</span>
and <span class="math">\(J\)</span>
</p>
</blockquote>
<p>In order to speed up node deletion, Cheng and Church also uses a
method to remove multiple rows and columns at once. This multiple node
deletion step proceeds as follows:</p>
<blockquote>
<p><strong>Algorithm</strong>: multiple node deletion</p>
<p><strong>input</strong>: matrix <span class="math">\(A\)</span>
, row set <span class="math">\(I\)</span>
, column set <span class="math">\(J\)</span>
, <span class="math">\(\delta>=0\)</span>
, <span class="math">\(\alpha > 0\)</span>
</p>
<p><strong>output</strong>: row set <span class="math">\(I’\)</span>
and column set <span class="math">\(J’\)</span>
so that <span class="math">\(H(I’, J’) <= \delta\)</span>
</p>
<p><strong>while</strong> <span class="math">\(H(I, J) > \delta\)</span> :</p>
<blockquote>
<p>remove from <span class="math">\(I\)</span>
all rows with <span class="math">\(d(i) > \alpha H(I, J)\)</span>
</p>
<p>remove from <span class="math">\(J\)</span>
all columns with <span class="math">\(d(j) > \alpha H(I, J)\)</span>
</p>
<p><strong>if</strong> <span class="math">\(I\)</span>
and <span class="math">\(J\)</span>
have not changed <strong>then</strong> switch to single node deletion</p>
</blockquote>
<p><strong>return</strong> <span class="math">\(I\)</span>
and <span class="math">\(J\)</span>
</p>
</blockquote>
<p>Multiple and single node deletion stop when <span class="math">\(H(I, J) <= \delta\)</span>
.
At this point, the algorithm tries a node addition step to add any
rows or columns that do not make <span class="math">\(H(I, J)\)</span>
worse. This step
optionally also adds rows if their inverses match the pattern of the
bicluster. In the context of microarray data, such rows represent
genes with the same, but opposite, regulation patterns. The following
two vectors, each the inverse of the other, demonstrates this concept:</p>
<img alt="./static/images/inverse_msr_parallel_coordinates.png" src="./static/images/inverse_msr_parallel_coordinates.png" style="width: 40%;" />
<p>The node addition algorithm proceeds as follows:</p>
<blockquote>
<p><strong>Algorithm</strong>: node addition</p>
<p><strong>input</strong>: matrix <span class="math">\(A\)</span>
, row set <span class="math">\(I\)</span>
, column set <span class="math">\(J\)</span>
</p>
<p><strong>output</strong>: row set <span class="math">\(I’\)</span>
and column set <span class="math">\(J’\)</span>
so that <span class="math">\(H(I’, J’) <= H(I, J)\)</span>
</p>
<p><strong>iterate</strong>:</p>
<blockquote>
<p>compute <span class="math">\(a_{i J}\)</span>
for all <span class="math">\(i\)</span>
, <span class="math">\(a_{I j}\)</span>
for all <span class="math">\(j\)</span>
, <span class="math">\(a_{I J}\)</span>
, and <span class="math">\(H(I, J)\)</span>
</p>
<p>add to <span class="math">\(J\)</span>
any column <span class="math">\(j \notin J\)</span>
with <span class="math">\(\frac{1}{|I|} \sum_{i \in I} \left (a_{ij} - a_{iJ} - a_{Ij} + a_{I J} \right )^2 <= H(I, J)\)</span>
</p>
<p>recompute <span class="math">\(a_{i J}\)</span>
, <span class="math">\(a_{I J}\)</span>
, and <span class="math">\(H(I, J)\)</span>
</p>
<p>add to <span class="math">\(I\)</span>
any row <span class="math">\(i \notin I\)</span>
with <span class="math">\(\frac{1}{|J|} \sum_{j \in J} \left (a_{ij} - a_{iJ} - a_{Ij} + a_{I J} \right )^2 <= H(I, J)\)</span>
</p>
<p>add to <span class="math">\(I\)</span>
any row <span class="math">\(i \notin I\)</span>
with <span class="math">\(\frac{1}{|J|} \sum_{j \in J} \left (-a_{ij} + a_{iJ} - a_{Ij} + a_{I J} \right )^2 <= H(I, J)\)</span>
</p>
<p><strong>if</strong> <span class="math">\(I\)</span>
and <span class="math">\(J\)</span>
have not changed <strong>then</strong> break</p>
</blockquote>
<p><strong>return</strong> <span class="math">\(I\)</span>
and <span class="math">\(J\)</span>
</p>
</blockquote>
<p>After node addition, the bicluster is added to the list of results and
the algorithm starts again from the beginning. However, because this
is a deterministic procedure, it would just find the same bicluster
again. Therefore, after finding a bicluster, its entries in the
original data are replaced by entries drawn from a uniform random
distribution over the range determined by the minimum and maximum of
the original dataset.</p>
<p>The whole algorithm looks like this:</p>
<blockquote>
<p><strong>Algorithm</strong>: Cheng and Church</p>
<p><strong>input</strong>: matrix <span class="math">\(A\)</span>
, number of clusters <span class="math">\(N\)</span>
, <span class="math">\(\delta>=0\)</span>
, <span class="math">\(\alpha > 0\)</span>
</p>
<p><strong>output</strong>: a set of <span class="math">\(N\)</span>
biclusters, each with <span class="math">\(H(I, J) <= \delta\)</span>
</p>
<p><strong>for</strong> <span class="math">\(i \leftarrow 1 \dots N\)</span> :</p>
<blockquote>
<p>initialize <span class="math">\((I, J)\)</span>
to all rows and all columns</p>
<p>multiple node deletion</p>
<p>single node deletion</p>
<p>node addition</p>
<p>append <span class="math">\((I, J)\)</span>
to results</p>
<p>mask <span class="math">\(A_{I J}\)</span>
with random entries</p>
</blockquote>
<p><strong>return</strong> results</p>
</blockquote>
<p>Now that we understand how the algorithm works, let us see it in action.</p>
</div>
<div class="section" id="example">
<h2>Example</h2>
<p>I have implemented Cheng and Church for scikit-learn. To see it in
action, we cluster the same lymphoma/leukemia dataset <a class="footnote-reference" href="#id4" id="id2">[2]</a> clustered
in the original paper and available at the <a class="reference external" href="http://arep.med.harvard.edu/biclustering/">supplementary information
page</a>. The data has
shape <span class="math">\(4096 \times 96\)</span>
. The parameters used for Cheng and Church
were taken directly from the paper.</p>
<div class="highlight"><pre><span class="kn">import</span> <span class="nn">urllib</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">pandas</span> <span class="kn">import</span> <span class="n">DataFrame</span>
<span class="kn">from</span> <span class="nn">pandas.tools.plotting</span> <span class="kn">import</span> <span class="n">parallel_coordinates</span>
<span class="kn">import</span> <span class="nn">matplotlib.pylab</span> <span class="kn">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">sklearn.cluster.bicluster</span> <span class="kn">import</span> <span class="n">ChengChurch</span>
<span class="c"># get data</span>
<span class="n">url</span> <span class="o">=</span> <span class="s">"http://arep.med.harvard.edu/biclustering/lymphoma.matrix"</span>
<span class="n">lines</span> <span class="o">=</span> <span class="n">urllib</span><span class="o">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">url</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
<span class="c"># insert a space before all negative signs</span>
<span class="n">lines</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="s">' -'</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">'-'</span><span class="p">))</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">' '</span><span class="p">)</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">lines</span><span class="p">)</span>
<span class="n">lines</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">line</span> <span class="k">if</span> <span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">lines</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">lines</span><span class="p">)</span>
<span class="c"># replace missing values, just as in the paper</span>
<span class="n">generator</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">RandomState</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">idx</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">data</span> <span class="o">==</span> <span class="mi">999</span><span class="p">)</span>
<span class="n">data</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span> <span class="o">=</span> <span class="n">generator</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="o">-</span><span class="mi">800</span><span class="p">,</span> <span class="mi">801</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">idx</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
<span class="c"># cluster with same parameters as original paper</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">ChengChurch</span><span class="p">(</span><span class="n">n_clusters</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">max_msr</span><span class="o">=</span><span class="mi">1200</span><span class="p">,</span>
<span class="n">deletion_threshold</span><span class="o">=</span><span class="mf">1.2</span><span class="p">,</span> <span class="n">inverse_rows</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="c"># find bicluster with smallest msr and plot it</span>
<span class="n">msr</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">a</span><span class="p">:</span> <span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">power</span><span class="p">(</span><span class="n">a</span> <span class="o">-</span> <span class="n">a</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="o">-</span>
<span class="n">a</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="n">a</span><span class="o">.</span><span class="n">mean</span><span class="p">(),</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
<span class="n">msrs</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">msr</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">get_submatrix</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">data</span><span class="p">))</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">))</span>
<span class="n">arr</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">get_submatrix</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">argmin</span><span class="p">(</span><span class="n">msrs</span><span class="p">),</span> <span class="n">data</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">DataFrame</span><span class="p">(</span><span class="n">arr</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s">'row'</span><span class="p">]</span> <span class="o">=</span> <span class="nb">map</span><span class="p">(</span><span class="nb">str</span><span class="p">,</span> <span class="nb">range</span><span class="p">(</span><span class="n">arr</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
<span class="n">parallel_coordinates</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="s">'row'</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mf">1.5</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'column'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'expression level'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">gca</span><span class="p">()</span><span class="o">.</span><span class="n">legend_</span> <span class="o">=</span> <span class="bp">None</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div>
<p>The biclustering step took 33.9 seconds on a laptop with a Core
i5-2520M processor.</p>
<p>By visualizing the parallel coordinates plot of the bicluster with the
smallest <span class="math">\(H\)</span>
score, it is easy to see that its rows are
approximately shifted versions of each other. Each of the six rows
represents the expression profile of a gene. This bicluster has an
<span class="math">\(H\)</span>
score of 853.14.</p>
<img alt="./static/images/cheng_church_parallel_coordinates.png" src="./static/images/cheng_church_parallel_coordinates.png" style="width: 100%;" />
<p>This functionality is not yet merged into scikit-learn. Until it is,
check out my <a class="reference external" href="https://github.com/kemaleren/scikit-learn/tree/cheng_church">personal branch</a>.</p>
</div>
<div class="section" id="conclusion">
<h2>Conclusion</h2>
<p>We have defined the mean squared residue, which is used to find
biclusters with additive shift patterns in their rows and columns. We
have seen how Cheng and Church works, and seen how to use the version
implemented in scikit-learn.</p>
<p>Though subsequent algorithms have improved on it in various ways,
Cheng and Church remains a popular benchmarking method. The next post
will cover BiMax, a simple algorithm designed specifically for
benchmarking that nevertheless gets good results.</p>
</div>
<div class="section" id="references">
<h2>References</h2>
<table class="docutils footnote" frame="void" id="id3" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>Cheng, Y., <span class="amp">&</span> Church, <span class="caps">G. M.</span>(2000, August). <a class="reference external" href="ftp://samba.ad.sdsc.edu/pub/sdsc/biology/ISMB00/157.pdf">Biclustering of
expression data</a>. In
Ismb (Vol. 8, pp. 93-103).</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id4" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id2">[2]</a></td><td>Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I.
S., Rosenwald, A., … <span class="amp">&</span> Staudt, <span class="caps">L. M.</span>(2000). <a class="reference external" href="http://llmpp.nih.gov/lymphoma/images/naturepaper.pdf">Distinct types
of diffuse large B-cell lymphoma identified by gene expression
profiling</a>.
Nature, 403(6769), 503-511.</td></tr>
</tbody>
</table>
</div>
Spectral biclustering, part 22013-07-12T18:30:00-04:00Kemal Erentag:kemaleren.com,2013-07-12:spectral-biclustering-part-2.html<div class="section" id="introduction">
<h2>Introduction</h2>
<p>Continuing the discussion of spectral biclustering methods, this post
covers the Spectral Biclustering algorithm (Kluger, et. al., 2003)
<a class="footnote-reference" href="#id2" id="id1">[1]</a>. This method was originally formulated for clustering tumor
profiles collected via <span class="caps">RNA</span> <a class="reference external" href="https://en.wikipedia.org/wiki/Microarray">microarrays</a>. This section introduces
the checkerboard bicluster structure that the algorithm fits. The next
section describes the algorithm in detail. Finally, in the last
section we will see how it can be used for clustering real microarray data.</p>
<p>The data collected from a gene expression microarray experiment may be
arranged in a matrix <span class="math">\(A\)</span>
, where the rows represent genes and the
columns represent individual microarrays. Entry <span class="math">\(A_{ij}\)</span>
measures the amount of <span class="caps">RNA</span> produced by gene <span class="math">\(i\)</span>
that was
measured by microarray <span class="math">\(j\)</span>
. If each microarray was used to
measure tumor tissue, then each column of <span class="math">\(A\)</span>
represents the
gene expression profile of that tumor.</p>
<p>For each kind of tumor, we expect subsets of genes to behave
differently. For instance, in Liposarcoma a certain subset of genes
may be highly active, while in Chondrosarcoma that same subset of
genes may show almost no activity. Assuming these patterns of
differing expression levels are consistent, then the data would
exhibit a checkerboard pattern. Each block represents a subset of
genes that is similarly expressed in a subset of tumors.</p>
<p>Let us see an artificial example. Rearranging the rows and columns of
the original matrix (left) reveals the checkerboard (right):</p>
<img alt="./static/images/checkerboard_before.png" src="./static/images/checkerboard_before.png" style="width: 40%;" />
<img alt="./static/images/checkerboard_after.png" src="./static/images/checkerboard_after.png" style="width: 40%;" />
<p>The Spectral Biclustering algorithm was created to find these
checkerboard patterns, if they exist. Let us see how it works.</p>
</div>
<div class="section" id="the-algorithm">
<h2>The algorithm</h2>
<p>To motivate the algorithm, let us start with a perfect checkerboard
matrix <span class="math">\(A\)</span>
. Let us also assume we have row and column
classification vectors <span class="math">\(x\)</span>
and <span class="math">\(y\)</span>
, with
piecewise-constant values corresponding to the row and column
partitions of the matrix. Then applying <span class="math">\(A\)</span>
to <span class="math">\(c\)</span>
yields
a new row classification vector <span class="math">\(r’\)</span>
with the same block
pattern as the original, and vice versa:</p>
<div class="math">
\begin{align*}
A c = r’ \\
A^{\top} r = c’
\end{align*}
</div>
<p>Substitution gives:</p>
<div class="math">
\begin{align*}
A A^{\top} r = r’ \\
A^{\top} A c = c’
\end{align*}
</div>
<p>where <span class="math">\(r\)</span>
, <span class="math">\(r’\)</span>
, <span class="math">\(c\)</span>
, and <span class="math">\(c’\)</span>
are not
necessary the same as in the previous equations. These equations look
like the coupled eigenvalue problem:</p>
<div class="math">
\begin{align*}
A A^{\top} r = \lambda^2 r \\
A^{\top} A c = \lambda^2 c
\end{align*}
</div>
<p>where eigenvectors <span class="math">\(r\)</span>
and <span class="math">\(c\)</span>
have the same eigenvalue
<span class="math">\(\lambda^2\)</span>
. This problem can be solved via the singular value
decomposition. So if the matrix has a checkerboard structure, a pair
of singular vectors will give us the appropriate row and column
classification vectors. This is the heart of the algorithm.</p>
<p>Before applying the <span class="caps">SVD</span>, however, it is useful to normalize the matrix
to make the checkerboard pattern more obvious. For instance, if each
row has been multiplied by a different scaling factor, applying the
matrix to the column classification vector gives a noisy row
classification vector. If we first scale each row of the matrix so
that they have the same mean, the resulting row classification vector
will be closer to a piecewise-constant vector.</p>
<p>It turns out that we can perform this scaling using the diagonal
matrix <span class="math">\(R^{-1/2}\)</span>
that we first saw in the <a class="reference external" href="./spectral-biclustering-part-1.html">last post</a>. Similarly, we can scale the
columns using <span class="math">\(C^{-1/2}\)</span>
. Therefore, we can enhance the
checkerboard pattern by normalizing the matrix to <span class="math">\(A_n =
R^{-1/2} A C^{-1/2}\)</span>
, exactly in Dhillon’s Spectral Co-Clustering
algorithm. The rows of matrix <span class="math">\(A_n\)</span>
all share the same mean, and
its columns share a different mean.</p>
<p>To demonstrate how normalization can be useful, here is a
visualization of perfect checkerboard a matrix in which each row and
column has been multiplied by some random scaling factor (left). After
normalization, the checkerboards are more uniform (right):</p>
<img alt="./static/images/before_normalization.png" src="./static/images/before_normalization.png" style="width: 40%;" />
<img alt="./static/images/scale_normalization.png" src="./static/images/scale_normalization.png" style="width: 40%;" />
<p>Kluger, et. al., introduced another normalization method, which they
called bistochastization. This method makes all rows and all columns
have the same mean value. Matrices with this property are called
bistochastic. In this method, the matrix is repeatedly normalized
until convergence. The first step is the same:</p>
<div class="math">
\begin{equation*}
A_1 = R_0^{-1/2} A C_0^{-1/2}
\end{equation*}
</div>
<p>Each subsequent step repeats the normalization of the matrix obtained
in the previous step:</p>
<div class="math">
\begin{equation*}
A_{t+1} = R_t^{-1/2} A_t C_t^{-1/2}
\end{equation*}
</div>
<p>Again, let us visualize the result of normalizing the original matrix
(left). Bistochastization makes the pattern even more obvious (right):</p>
<img alt="./static/images/before_normalization.png" src="./static/images/before_normalization.png" style="width: 40%;" />
<img alt="./static/images/bistochastic_normalization.png" src="./static/images/bistochastic_normalization.png" style="width: 40%;" />
<p>Finally, Kluger, et. al., also introduced a third method, log
normalization. First, the log of the data matrix is computed, <span class="math">\(L
= \log A\)</span>
, then the final matrix is computed according to the formula:</p>
<div class="math">
\begin{equation*}
K_{ij} = L_{ij} - \overline{L_{i \cdot}} - \overline{L_{\cdot
j}} + \overline{L_{\cdot \cdot}}
\end{equation*}
</div>
<p>where <span class="math">\(\overline{L_{i \cdot}}\)</span>
is the column mean,
<span class="math">\(\overline{L_{\cdot j}}\)</span>
is the row mean, and
<span class="math">\(\overline{L_{\cdot \cdot}}\)</span>
is the overall mean of <span class="math">\(L\)</span>
.
This transformation removes the systematic variability among rows and
columns. The remaining value <span class="math">\(K_{ij}\)</span>
captures the interaction
between row <span class="math">\(i\)</span>
and column <span class="math">\(j\)</span>
that cannot be explained by
systematic variability among rows, among columns, or within the entire matrix.</p>
<p>An interesting consequence of this method is that adding a positive
constant to <span class="math">\(K\)</span>
makes the it bistochastic.</p>
<p>To visualize the result, we first transform the matrix by taking
<span class="math">\(\exp(A)\)</span>
(left). Then the log normalization reveals the
checkerboard pattern (right).</p>
<img alt="./static/images/before_log_normalization.png" src="./static/images/before_log_normalization.png" style="width: 40%;" />
<img alt="./static/images/log_normalization.png" src="./static/images/log_normalization.png" style="width: 40%;" />
<p>After the matrix has been normalized according to one of these
methods, its first <span class="math">\(p\)</span>
singular vectors are computed, as we have
seen earlier. Now the problem is to somehow convert these singular
vectors into partitioning vectors.</p>
<p>If log normalization was used, all the singular vectors are
meaningful, because the contribution of the overall matrix has already
been discarded. If independent normalization or bistochastization were
used, however, the first singular vectors <span class="math">\(u_1\)</span>
and
<span class="math">\(v_1\)</span>
are discarded. From now on, the “first” singular vectors
refers to <span class="math">\(U = [u_2 \dots u_{p+1}]\)</span>
and <span class="math">\(V = [v_2 \dots
v_{p+1}]\)</span>
except in the case of log normalization.</p>
<p>There are two methods described in the paper for converting these
singular vectors into row and column partitions. Both rely on
determining the vectors that can be best approximated by a
piecewise-constant vector. In my implementation, the approximations
for each vector are found using one-dimensional k-means, and the best
are found using the Euclidean distance between each vector and its approximation.</p>
<p>In the first method, the best left singular vector <span class="math">\(u_b\)</span>
determines the row partitioning and the best right singular vector
<span class="math">\(v_b\)</span>
determines the column partitioning. The assignments of
rows and columns to biclusters is determined by the k-means clustering
of <span class="math">\(u_b\)</span>
and <span class="math">\(v_b\)</span> .</p>
<p>The second method is the one I chose to implement. In this method, the
data is projected to the best subset of singular vectors. For
instance, if <span class="math">\(p\)</span>
singular vectors were calculated, the <span class="math">\(q\)</span>
best are found as described, where <span class="math">\(q<p\)</span>
. Let <span class="math">\(U_b\)</span>
be the
matrix with columns the <span class="math">\(q\)</span>
best left singular vectors, and
similarly <span class="math">\(V_b\)</span>
for the right. To partition the rows, the rows
of <span class="math">\(A\)</span>
are projected to a <span class="math">\(q\)</span>
dimensional space:
<span class="math">\(A V_{b}\)</span>
. Treating the <span class="math">\(m\)</span>
rows of this <span class="math">\(m \times
q\)</span>
matrix as samples and clustering using k-means yields the row
labels. Similarly, projecting the columns to <span class="math">\(A^{\top} U_{b}\)</span>
and clustering this <span class="math">\(n \times q\)</span>
matrix yields the column labels.</p>
<p>Here, then, is the complete algorithm. First, normalize according to
one of the three methods:</p>
<ol class="arabic simple">
<li>Independent row and column normalization: <span class="math">\(A_n = R^{-1/2} A C^{-1/2}\)</span>
</li>
<li>Bistochastization: repeated row and column normalization until convergence.</li>
<li>Log normalization: <span class="math">\(K_{ij} = L_{ij} - \overline{L_{i \cdot}} - \overline{L_{\cdot j}} + \overline{L_{\cdot \cdot}}\)</span>
</li>
</ol>
<p>Then compute the first few singular vectors, <span class="math">\(U\)</span>
and <span class="math">\(V\)</span>
.
Determine the best subset, <span class="math">\(U_b\)</span>
and <span class="math">\(V_b\)</span>
. Project the
rows to <span class="math">\(A V_b\)</span>
and cluster using k-means to obtain row
partitions. Project the columns to <span class="math">\(A^{T} U_b\)</span>
and cluster using
k-means to obtain the column partitions.</p>
</div>
<div class="section" id="example-clustering-microarray-data">
<h2>Example: clustering microarray data</h2>
<p>Now we will demonstrate the algorithm on a real microarray dataset
from the <a class="reference external" href="https://www.ncbi.nlm.nih.gov/geo/">Gene Expression Omnibus</a> (<span class="caps">GEO</span>). Here we download and
cluster <a class="reference external" href="https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS4407"><span class="caps">GDS4407</span></a>, which
contains gene expression profiles of mononuclear blood cells of thirty
patients: eleven have acute myeloid leukemia (<span class="caps">AML</span>), another nine have
<span class="caps">AML</span> with multilineage dysplasia, and ten are the control group. Since
there are three types of samples, we will try to cluster the data into
a <span class="math">\(3 \times 3\)</span> checkerboard.</p>
<div class="highlight"><pre><span class="kn">import</span> <span class="nn">urllib</span>
<span class="kn">import</span> <span class="nn">gzip</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">Bio</span> <span class="kn">import</span> <span class="n">Geo</span>
<span class="kn">from</span> <span class="nn">sklearn.cluster.bicluster</span> <span class="kn">import</span> <span class="n">SpectralBiclustering</span>
<span class="c"># download and parse text file</span>
<span class="n">name</span> <span class="o">=</span> <span class="s">"<span class="caps">GDS4407</span>"</span>
<span class="n">filename</span> <span class="o">=</span> <span class="s">"{}.soft.gz"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">name</span><span class="p">)</span>
<span class="n">url</span> <span class="o">=</span> <span class="p">(</span><span class="s">"ftp://ftp.ncbi.nlm.nih.gov/"</span>
<span class="s">"geo/datasets/GDS4nnn/{}/soft/{}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">filename</span><span class="p">))</span>
<span class="n">urllib</span><span class="o">.</span><span class="n">urlretrieve</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">filename</span><span class="p">)</span>
<span class="n">handle</span> <span class="o">=</span> <span class="n">gzip</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span>
<span class="n">records</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">Geo</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="n">handle</span><span class="p">))</span>
<span class="n">handle</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="nb">map</span><span class="p">(</span><span class="nb">float</span><span class="p">,</span> <span class="n">row</span><span class="p">[</span><span class="mi">2</span><span class="p">:])</span>
<span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">records</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">table_rows</span><span class="p">[</span><span class="mi">1</span><span class="p">:]))</span>
<span class="c"># do clustering</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">SpectralBiclustering</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s">"bistochastic"</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
</pre></div>
<p>This code requires <a class="reference external" href="http://biopython.org/wiki/Main_Page">BioPython</a>
for parsing the text file downloaded from the <span class="caps">GEO</span>. Note that the
method used here for downloading the data and converting it to a NumPy
array is ad-hoc and probably fragile. The fitting step took 6.91
seconds on a laptop with a Core i5-2520M processor.</p>
<p>Here is a visualization of the checkerboard pattern found by the model
(left), along with the corresponding rearrangement of the
data, normalized by bistochastization (right):</p>
<img alt="./static/images/microarray_checkerboard.png" src="./static/images/microarray_checkerboard.png" style="width: 40%;" />
<img alt="./static/images/microarray_sorted.png" src="./static/images/microarray_sorted.png" style="width: 40%;" />
<p>As I mentioned in the last post, this implementation will eventually
be available in scikit-learn. In the meantime, it is available in my
<a class="reference external" href="https://github.com/kemaleren/scikit-learn/tree/spectral_biclustering">personal branch</a>.</p>
</div>
<div class="section" id="conclusion">
<h2>Conclusion</h2>
<p>We have now covered both spectral biclustering methods. As you have
seen, there are some similarities between them. Both involve
normalizing the data, finding the first few singular vectors, and
using k-means to partition the rows and columns.</p>
<p>The methods differ in their bicluster structure. In fact, the
checkerboard structure may have a different number of row clusters
than column clusters, whereas the Spectral Co-Clustering algorithm
requires that they have the same number. The methods also differ in
how many singular vectors they use and how they project and cluster
the rows and columns. Spectral Co-Clustering simultaneously projects
and clustered rows and columns, whereas Spectral Biclustering
does each separately.</p>
<p>Like the other algorithms we have seen, the Spectral Biclustering
algorithm was developed to address a specific problem — in this case,
analyzing microarray data — but it can be applied to any data matrix.</p>
<p>In the next post we will return to microarrays in order to cover the
first biclustering algorithm developed specifically for microarray
data: Cheng and Church.</p>
</div>
<div class="section" id="references">
<h2>References</h2>
<table class="docutils footnote" frame="void" id="id2" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>Kluger, Y., Basri, R., Chang, J. T., <span class="amp">&</span> Gerstein, M. (2003).
<a class="reference external" href="http://pdf.aminer.org/000/303/275/global_biclustering_of_microarray_data.pdf">Spectral biclustering of microarray data: coclustering genes and
conditions.</a>
Genome research, 13(4), 703-716.</td></tr>
</tbody>
</table>
</div>
Spectral biclustering, part 12013-07-04T13:00:00-04:00Kemal Erentag:kemaleren.com,2013-07-04:spectral-biclustering-part-1.html<div class="section" id="introduction">
<h2>Introduction</h2>
<p>This is the second entry in the series on biclustering algorithms,
this time covering spectral biclustering. This is the first part,
focusing on the Spectral Co-Clustering algorithm (Dhillon, 2001) <a class="footnote-reference" href="#id4" id="id1">[1]</a>.
The next part will focus on a related algorithm, Spectral Biclustering
(Kluger et. al., 2003) <a class="footnote-reference" href="#id5" id="id2">[2]</a>.</p>
<p>To motivate the spectral biclustering problem, let us return to our
friend Bob, who hosted the party in the previous article. Bob is
thinking about music again, but this time he is writing song lyrics.
Bob plans to imitate popular songs by including similar words in his
lyrics. To learn which words to use, he collects the lyrics to many
popular songs into a <span class="math">\(w \times d\)</span>
term frequency matrix
<span class="math">\(A\)</span>
, where <span class="math">\(A_{ij}\)</span>
is the number of times word <span class="math">\(i\)</span>
appears in song <span class="math">\(j\)</span> .</p>
<p>Bob knows that his song database contains songs of many different
genres. Since the words that are popular in one genre may not be
popular in another, he decides to use biclustering again. Bob wants to
jointly cluster the rows and columns of <span class="math">\(A\)</span>
to find subsets of
words that are used more frequently in subsets of songs.</p>
<p>This problem can be converted into a graph partitioning problem:
create a graph with <span class="math">\(w\)</span>
word vertices, <span class="math">\(d\)</span>
song vertices,
and <span class="math">\(wd\)</span>
edges, each between a song and a word. The edge between
song <span class="math">\(i\)</span>
and word <span class="math">\(j\)</span>
has weight <span class="math">\(A_{ij}\)</span>
. This
graph is <a class="reference external" href="https://duckduckgo.com/Bipartite_graph">bipartite</a>
because there are two disjoint sets of vertices (songs and edges) with
no edges within sets, and every song is connected to every word. Here
is a small example of such a bipartite graph.</p>
<img alt="./static/images/bipartite.png" src="./static/images/bipartite.png" style="width: 80%;" />
<p>To find biclusters of words and songs, Bob wants to partition the
graph so that edges within partitions have heavy weights and edges
between partitions have light weights. Furthermore, he would like each
partition to be about the same size. This task problem, known as
normalized cut, will be formulated more rigorously in the next
section. Here is the partitioning of the example graph that achieves
Bob’s goal:</p>
<img alt="./static/images/bipartite-clustered.png" src="./static/images/bipartite-clustered.png" style="width: 80%;" />
<p>This same data process be visualized as a matrix, too, of course. The
original matrix <span class="math">\(A\)</span>
corresponds to the unpartitioned
graph (left). After rearranging rows and columns, the biclusters
corresponding to heavy graph partitions are visible on the diagonal (right).</p>
<img alt="./static/images/song_data.png" src="./static/images/song_data.png" style="width: 40%;" />
<img alt="./static/images/song_biclusters.png" src="./static/images/song_biclusters.png" style="width: 40%;" />
<p>The idea of converting a data matrix into a bipartite graph motivates
many biclustering algorithms. In this case, the goal of finding the
optimal normalized cut leads us to the spectral clustering family of
algorithms. Let us see how they work.</p>
</div>
<div class="section" id="spectral-clustering">
<h2>Spectral clustering</h2>
<p>To demonstrate spectral clustering, we will use a toy dataset
consisting of concentric rings:</p>
<img alt="./static/images/circles.png" src="./static/images/circles.png" style="width: 40%;" />
<p>We would like the samples within each ring to cluster together.
However, algorithms like <a class="reference external" href="https://en.wikipedia.org/wiki/K-means_clustering">k-means</a> will not work,
because samples in different rings are closer to each other than
samples in the same ring. It might lead, for example, to this result:</p>
<img alt="./static/images/circles_kmeans.png" src="./static/images/circles_kmeans.png" style="width: 40%;" />
<p>Spectral clustering will allow us to convert this data to a new space
in which k-means gives better results. To find this new space, we start
by building a graph <span class="math">\(G = \{ V, E \}\)</span>
with a vertex for each
sample. Each pair of samples <span class="math">\(x_i, x_j\)</span>
is connected by an edge
with weight <span class="math">\(s(x_i, x_j)\)</span>
equal to the similarity between them.
The goal when building <span class="math">\(G\)</span>
is to capture the local neighborhood
relationships in the data. For simplicity, we achieve this goal by
setting <span class="math">\(s(x_i, x_j) = 0\)</span>
for all but the nearest neighbors.</p>
<p>Let square matrix <span class="math">\(W\)</span>
be the weighted adjacency matrix for
<span class="math">\(G\)</span>
, so that <span class="math">\(W_{ij}=s(x_i, x_j)\)</span>
when <span class="math">\(i \neq j\)</span>
and <span class="math">\(W_{i, i} = 0\)</span>
. Let <span class="math">\(D\)</span>
be the degree matrix for
<span class="math">\(G\)</span>
, which is a diagonal matrix containing the sum of all
weights for edges incident to vertex <span class="math">\(i\)</span>
: <span class="math">\(D_{i, i} =
\sum_{j=1}^{n} W_{ij}\)</span>
, and <span class="math">\(D_{i, j}=0\)</span>
if <span class="math">\(i \ne j\)</span>
.
Then the <em>Laplacian matrix</em> of graph <span class="math">\(G\)</span>
is defined as
<span class="math">\(L=D-W\)</span>
. The Laplacian is used in many graph problems because of
its useful properties. For this problem we will use the following
property: the smallest eigenvalue of <span class="math">\(L\)</span>
is 0, with eigenvector
<span class="math">\(\mathbb{1}\)</span> .</p>
<p>To see why this property is useful, consider the ring problem again.
In this particular case, we were lucky when building graph <span class="math">\(G\)</span>
,
because it contains two connected components, each corresponding to
one of the clusters we would like to find. The Laplacian makes it easy
to recover those clusters. Since there are no edges between
components, <span class="math">\(W\)</span>
is block diagonal, and therefore <span class="math">\(L\)</span>
is
block diagonal. Each block of <span class="math">\(L\)</span>
is the Laplacian for a
connected component. In this case, since the two submatrices
<span class="math">\(L_1\)</span>
and <span class="math">\(L_2\)</span>
are also Laplacians, they each have
eigenvalues of 0 and eigenvectors of <span class="math">\(\mathbb{1}\)</span>
. Therefore, we
know that the eigenvalue 0 of <span class="math">\(L\)</span>
has multiplicity two, and its
eigenspace is spanned by <span class="math">\([0, \dots , 0, 1, \dots , 1]^\top\)</span>
and
<span class="math">\([1, \dots , 1, 0, \dots , 0]^\top\)</span>
, where the number of
<span class="math">\(1\)</span>
entries in each vector is equal to the number of vertices in
that connected component.</p>
<p>This realization suggests a strategy for finding <span class="math">\(k\)</span>
clusters if
they appear in the graph as connected components:</p>
<ol class="arabic simple">
<li>Build graph <span class="math">\(G\)</span>
and compute its Laplacian <span class="math">\(L\)</span> .</li>
<li>Compute the first <span class="math">\(k\)</span>
eigenvectors of <span class="math">\(L\)</span> .</li>
<li>The nonzero entries of eigenvector <span class="math">\(u_i\)</span>
indicate the
samples belonging to cluster <span class="math">\(i\)</span> .</li>
</ol>
<p>Applying this method to the circles dataset does indeed locate the
correct clusters:</p>
<img alt="./static/images/circles_spectral.png" src="./static/images/circles_spectral.png" style="width: 40%;" />
<p>Unfortunately, most of the time <span class="math">\(G\)</span>
will not be so conveniently
separated into connected components. For instance, if we add a bridge
between the rings, the resulting graph has only one connected
component, even with the nearest-neighbor affinities:</p>
<img alt="./static/images/circles_bridged.png" src="./static/images/circles_bridged.png" style="width: 40%;" />
<p>We can deal with this case by removing a few edges to create two
connected components again. To do so, we introduce the notion of a
graph cut. The following definition will be needed: if <span class="math">\(V_i\)</span>
and
<span class="math">\(V_j\)</span>
are sets of vertices,</p>
<div class="math">
\begin{equation*}
W(V_i) = \sum_{a, b \in V_i, a \neq b} W_{ab}
\end{equation*}
</div>
<p>is the sum of edge weights within a partition and</p>
<div class="math">
\begin{equation*}
W(V_i, V_j) = \sum_{a \in V_i, b \in V_j} W_{ab}
\end{equation*}
</div>
<p>is the sum of edge weights between the two partitions. Then given a
partition of the set of vertices <span class="math">\(V = V_1 \cup V_2 \dots \cup
V_k\)</span> ,</p>
<div class="math">
\begin{equation*}
\text{cut}(V_1 \dots V_n) = \frac{1}{2} \sum_{i=1}^k W(V_i, \overline{V_i})
\end{equation*}
</div>
<p>where <span class="math">\(\overline{V_i}\)</span>
denotes the complement of <span class="math">\(V_i\)</span>
. The
factor of <span class="math">\(\frac{1}{2}\)</span>
corrects for double counting.</p>
<p>We want to partition the graph to minimize the cut, but the minimum is
trivially achieved without loss of generality by setting one <span class="math">\(V_i =
V\)</span>
and <span class="math">\(V_j = \emptyset\)</span>
for <span class="math">\(i \neq j\)</span>
. To ensure that each
partition is approximately balanced, we instead minimize the normalized
cut, which normalizes by the weight of each partition:</p>
<div class="math">
\begin{equation*}
\text{ncut}(V_1 \dots V_n) = \sum_{i=1}^k \frac{\text{cut}(V_i, \overline{V_i})}{W(V_i)}
\end{equation*}
</div>
<p>There are other balanced cuts, such as ratio cut, but here we focus on
normalized cut because it will be used again in the Spectral
Co-Clustering algorithm.</p>
<p>Unfortunately, the normalized cut problem is in <span class="caps">NP</span>-hard. However, the
relaxed version of this integer optimization problem can be solved
efficiently. Moreover, it turns out that the first <span class="math">\(k\)</span>
eigenvectors of the generalized eigenproblem <span class="math">\(Lu = \lambda D u\)</span>
provide the solution to the relaxed problem. This real solution can
then be converted into a (potentially suboptimal) integer solution by
a clustering; here we will use k-means.</p>
<p>The steps for spectral clustering, minimzing <span class="math">\(\text{ncut}\)</span>
, are:</p>
<ol class="arabic simple">
<li>Build graph <span class="math">\(G\)</span>
and compute its Laplacian <span class="math">\(L\)</span> .</li>
<li>Compute the first <span class="math">\(k\)</span>
eigenvectors of <span class="math">\(Lu = \lambda D
u\)</span> .</li>
<li>Treat the eigenvectors as a new data set with <span class="math">\(n\)</span>
samples and
<span class="math">\(k\)</span>
feature. Use k-means to cluster it.</li>
</ol>
<p>In our ring problem, the optimal normalized cut is the one that cuts
through the bridge. The solution found by spectral clustering is the
optimal one:</p>
<img alt="./static/images/circles_bridged_spectral.png" src="./static/images/circles_bridged_spectral.png" style="width: 40%;" />
<p>The properties of the graph Laplacian are what make this method work.
The first <span class="math">\(k\)</span>
eigenvectors provide an alternative representation
of the data in a space in which it is more amenable to clustering by
algorithms such as k-means.</p>
<p>This section gave a simplified introduction to spectral clustering on
a toy example. Hopefully it has provided the reader with some
intuition for how these methods work. However, it has necessarily
skipped many important details. For a more complete overview, see
<a class="footnote-reference" href="#id6" id="id3">[3]</a>, from which this section has been adapted.</p>
</div>
<div class="section" id="spectral-co-clustering">
<h2>Spectral Co-Clustering</h2>
<p>Let us return to Bob’s songwriting to see how our new insights can be
applied to his problem. Remember that Bob had converted his word
frequency matrix into a bipartite graph <span class="math">\(G\)</span>
. Remember also that
he wants to partition it into <span class="math">\(k\)</span>
partitions by finding the
optimal normalized cut. We saw in the previous section how to find a
solution to this problem, which involved finding the eigenvectors of
the matrix <span class="math">\(L\)</span>
. We could do that directly here, too, but we can
avoid working on the <span class="math">\((w+d) \times (w+d)\)</span>
matrix <span class="math">\(L\)</span>
and
instead work with the smaller <span class="math">\(w \times d\)</span>
matrix <span class="math">\(A\)</span> .</p>
<p>If we normalize <span class="math">\(A\)</span>
to <span class="math">\(A_n\)</span>
in a particular way, the
<a class="reference external" href="https://en.wikipedia.org/wiki/Singular_value_decomposition">singular value decomposition</a> of the
resulting matrix <span class="math">\(A_n = U \Sigma V^\top\)</span>
will give us the desired
partitions of the rows and columns of <span class="math">\(A\)</span>
. A subset of the left
singular vectors will give the word partitions, and a subset of the
right singular vectors will give the song partitions.</p>
<p>The normalization is performed as follows:</p>
<div class="math">
\begin{equation*}
A_n = R^{-1/2} A C^{-1/2}
\end{equation*}
</div>
<p>Where <span class="math">\(R\)</span>
is the diagonal matrix with entry <span class="math">\(i\)</span>
equal to
<span class="math">\(\sum_{j} A_{ij}\)</span>
and <span class="math">\(C\)</span>
is the diagonal matrix with
entry <span class="math">\(j\)</span>
equal to <span class="math">\(\sum_{i} A_{ij}\)</span> .</p>
<p>If <span class="math">\(k=2\)</span>
, we can use the singular vectors <span class="math">\(u_2\)</span>
and
<span class="math">\(v_2\)</span>
(<span class="math">\(u_1\)</span>
and <span class="math">\(v_1\)</span>
are discarded) to compute the
second eigenvector of <span class="math">\(L\)</span> :</p>
<div class="math">
\begin{equation*}
z_2 = \begin{bmatrix} R^{-1/2} u_2 \\
C^{-1/2} v_2
\end{bmatrix}
\end{equation*}
</div>
<p>We know how to proceed from here: as we saw in the previous section,
clustering this eigenvector provides the desired bipartitioning.</p>
<p>If <span class="math">\(k>2\)</span>
, then the <span class="math">\(\ell = \lceil \log_2 k \rceil\)</span>
singular vectors, starting from the second, provide the desired
partitioning information. In this case, we cluster the matrix
<span class="math">\(Z\)</span> :</p>
<div class="math">
\begin{equation*}
Z = \begin{bmatrix} R^{-1/2} U \\
C^{-1/2} V
\end{bmatrix}
\end{equation*}
</div>
<p>where the columns of <span class="math">\(U\)</span>
are <span class="math">\(u_2, \dots, u_{\ell +
1}\)</span>
, and similarly for <span class="math">\(V\)</span> .</p>
<p>Here is full algorithm for finding <span class="math">\(k\)</span>
clusters, adapted from
the original paper:</p>
<ol class="arabic simple">
<li>Given <span class="math">\(A\)</span>
, normalize it to <span class="math">\(A_n = R^{-1/2} A
C^{-1/2}\)</span> .</li>
<li>Compute <span class="math">\(\ell = \lceil \log_2 k \rceil\)</span>
singular vectors of
<span class="math">\(A_n\)</span>
, <span class="math">\(u_2 \dots u_{\ell+1}\)</span>
and <span class="math">\(v_2 \dots v_{\ell+1}\)</span>
.
Form the matrix <span class="math">\(Z\)</span>
as just shown.</li>
<li>Cluster with k-means the <span class="math">\(\ell\)</span>
-dimensional data <span class="math">\(Z\)</span>
to
obtain the desired <span class="math">\(k\)</span> biclusters.</li>
</ol>
<p>Notice that the <span class="math">\(w\)</span>
rows and <span class="math">\(d\)</span>
columns of the original
data matrix are converted to the <span class="math">\(w+d\)</span>
rows in matrix <span class="math">\(Z\)</span>
.
Therefore, both rows and columns are treated as samples and clustered together.</p>
</div>
<div class="section" id="example-clustering-documents-and-words">
<h2>Example: clustering documents and words</h2>
<p>To see the algorithm in action, let us try clustering some documents.</p>
<div class="highlight"><pre><span class="kn">from</span> <span class="nn">sklearn.feature_extraction.text</span> <span class="kn">import</span> <span class="n">TfidfVectorizer</span>
<span class="kn">from</span> <span class="nn">sklearn.cluster.bicluster</span> <span class="kn">import</span> <span class="n">SpectralCoclustering</span>
<span class="kn">import</span> <span class="nn">requests</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="n">url_base</span> <span class="o">=</span> <span class="s">"http://www.gutenberg.org/cache/epub/{0}/pg{0}.txt"</span>
<span class="n">ids</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1228</span><span class="p">,</span> <span class="mi">22728</span><span class="p">,</span> <span class="mi">5273</span><span class="p">,</span> <span class="mi">19192</span><span class="p">,</span> <span class="mi">2926</span><span class="p">,</span> <span class="mi">2929</span><span class="p">,</span> <span class="mi">6475</span><span class="p">,</span> <span class="mi">20818</span><span class="p">,</span> <span class="mi">24648</span><span class="p">,</span>
<span class="mi">22428</span><span class="p">,</span> <span class="mi">35937</span><span class="p">,</span> <span class="mi">27477</span><span class="p">,</span> <span class="mi">6630</span><span class="p">,</span> <span class="mi">15636</span><span class="p">,</span> <span class="mi">35744</span><span class="p">,</span> <span class="mi">26556</span><span class="p">,</span> <span class="mi">6574</span><span class="p">,</span> <span class="mi">28247</span><span class="p">,</span>
<span class="mi">25992</span><span class="p">,</span> <span class="mi">28613</span><span class="p">]</span>
<span class="n">docs</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">ids</span><span class="p">:</span>
<span class="n">docs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url_base</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">i</span><span class="p">))</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
<span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">30</span><span class="p">)</span> <span class="c"># be nice to the server</span>
<span class="n">vec</span> <span class="o">=</span> <span class="n">TfidfVectorizer</span><span class="p">(</span><span class="n">stop_words</span><span class="o">=</span><span class="s">'english'</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">SpectralCoclustering</span><span class="p">(</span><span class="n">n_clusters</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">svd_method</span><span class="o">=</span><span class="s">'arpack'</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">vec</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">docs</span><span class="p">)</span><span class="o">.</span><span class="n">T</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
</pre></div>
<p>In this example, we download twenty books from Project Gutenberg, ten
from the astronomy category and ten from the evolution category. The
word frequency matrix is created with <a class="reference external" href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">tf-idf</a> values. The dataset
is a <span class="math">\(38,121 \times 20\)</span>
sparse scipy matrix. The
<cite>SpectralCoclustering</cite> model supports sparse matrices, so we can pass
it directly. The biclustering step took 0.4 seconds on a laptop with a
Core i5-2520M processor.</p>
<p>To examine whether the biclusters are reasonable, we determine the
most important words in each bicluster by subtracting the sum of
frequencies outside the bicluster from the sum of frequencies within
the bicluster. The first bicluster contained thirteen documents, of
which eight were on astronomy. The most important words in this
bicluster were “et”, “copernican”, “bruno”, “paris”, “speech”,
“congregation”, “acid”, “1616”, “scriptures”, and “ii”. The other
bicluster contained the remaining documents, with the following
important words: “species”, “selection”, “case”, “different”,
“number”, “darwin”, “characters”, “size”, “genera”, and “conditions”.
These seem like reasonable words to find in old texts on astronomy and
evolution, respectively.</p>
<p>The functionality demonstrated in this section will eventually be
available in scikit-learn. In the meantime, if you would like to try
the code, check out my <a class="reference external" href="https://github.com/kemaleren/scikit-learn/tree/spectral_biclustering">personal branch</a>.</p>
</div>
<div class="section" id="conclusion">
<h2>Conclusion</h2>
<p>This article introduced spectral clustering. It then demonstrated how
the same method was adapted to biclustering in the Spectral
Co-Clustering algorithm. We have seen how the algorithms work on toy
datasets. Finally, we saw how to use scikit-learn to bicluster a more
realistic dataset.</p>
<p>The next post, part two of this series, will continue the discussion
of spectral methods. It will introduce a new type of bicluster, the
checkerboard structure, and a new algorithm for finding checkerboard
patterns, the Spectral Biclustering algorithm (Kluger, 2003).</p>
</div>
<div class="section" id="references">
<h2>References</h2>
<table class="docutils footnote" frame="void" id="id4" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>Dhillon, <span class="caps">I. S.</span>(2001, August). <a class="reference external" href="ftp://ftp.cis.upenn.edu/pub/datamining/public_html/ReadingGroup/papers/dhillon2001.pdf">Co-clustering documents and words
using bipartite spectral graph partitioning.</a>
In Proceedings of the seventh <span class="caps">ACM</span> <span class="caps">SIGKDD</span> international conference
on Knowledge discovery and data mining (pp. 269-274). <span class="caps">ACM</span>.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id5" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id2">[2]</a></td><td>Kluger, Y., Basri, R., Chang, J. T., <span class="amp">&</span> Gerstein, M. (2003).
<a class="reference external" href="http://pdf.aminer.org/000/303/275/global_biclustering_of_microarray_data.pdf">Spectral biclustering of microarray data: coclustering genes and
conditions.</a>
Genome research, 13(4), 703-716.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id6" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id3">[3]</a></td><td>Von Luxburg, U. (2007). <a class="reference external" href="http://www.cs.columbia.edu/~jebara/4772/papers/Luxburg07_tutorial.pdf">A tutorial on spectral clustering.</a>
Statistics and computing, 17(4), 395-416.</td></tr>
</tbody>
</table>
</div>
An introduction to biclustering2013-06-25T08:00:00-04:00Kemal Erentag:kemaleren.com,2013-06-25:an-introduction-to-biclustering.html<div class="section" id="introduction">
<h2>Introduction</h2>
<p>This is the first in my series of biclustering posts. It serves as an
introduction to the concept. The later posts will cover the individual
algorithms that I am implementing for scikit-learn.</p>
<p>Before talking about biclustering, it is necessary to cover the basics
of clustering. Clustering is a fundamental problem in data mining:
given a set of objects, group them according to some measure of
similarity. This deceptively simple concept has given rise to a wide
variety of problems and the algorithms to solve them. scikit-learn
provides <a class="reference external" href="http://scikit-learn.org/stable/modules/clustering.html">a number of clustering methods</a>, but these
represent only a small fraction of the diversity of the field. For a
more detailed overview of clustering, there are several surveys
available <a class="footnote-reference" href="#id8" id="id1">[1]</a>, <a class="footnote-reference" href="#id9" id="id2">[2]</a>, <a class="footnote-reference" href="#id10" id="id3">[3]</a>.</p>
<p>In many clustering problems, each of the <span class="math">\(n\)</span>
samples to be
clustered is represented by a <span class="math">\(p\)</span>
-dimensional feature vector.
The entire dataset may then be represented by a matrix of shape
<span class="math">\(n \times p\)</span>
. After applying some clustering algorithm, each
sample belongs to one cluster.</p>
<p>To demonstrate, consider a clustering problem with 20 samples and 20
dimensions. Here is a scatterplot of the first two dimensions, with
cluster membership designated by color:</p>
<img alt="/static/images/clusters_scatter.png" src="/static/images/clusters_scatter.png" style="width: 50%;" />
<p>By rearranging the rows of the data matrix, the samples belonging to
each cluster can be made contiguous. In the original data (left) the
clusters are not visible, but in the rearranged data (right), the
correct partition is more obvious:</p>
<img alt="/static/images/clusters_data.png" src="/static/images/clusters_data.png" style="width: 30%;" />
<img alt="/static/images/clusters_sorted.png" src="/static/images/clusters_sorted.png" style="width: 30%;" />
<p>This view of clustering — partitioning the rows of the data matrix —
leads to the definition of biclustering.</p>
<blockquote>
<strong>Biclustering</strong>: a data mining method that simultaneously
clusters both rows and columns of a matrix.</blockquote>
<p>Clustering rows and columns together may seem a strange and
unintuitive thing to do. To see why it can be useful, let us consider
a simple example.</p>
</div>
<div class="section" id="motivating-example-throwing-a-party">
<h2>Motivating example: throwing a party</h2>
<p>Bob is planning a housewarming party for his new three-room house.
Each room has a separate sound system, so he wants to play different
music in each room. As a conscientious host, Bob wants everyone to
enjoy the music. Therefore, he needs to distribute albums and guests
to each room in order to ensure that each guest hears their favorite songs.</p>
<p>Our host has invited fifty guests, and he owns thirty albums. He sends
out a survey to each guest asking if they like or dislike each album.
After receiving their responses, he collects the data into a <span class="math">\(50
\times 30\)</span>
binary matrix <span class="math">\(\boldsymbol M\)</span>
, where <span class="math">\(M_{ij}=1\)</span>
if guest <span class="math">\(i\)</span>
likes album <span class="math">\(j\)</span> .</p>
<p>In addition to ensuring everyone is happy with the music, Bob wants to
distribute people and albums evenly among the rooms of his house. All
the guests will not fit in one room, and there should be enough albums
in each room to avoid repetitions. Therefore, Bob decides to bicluster
his data to maximize the following objective function:</p>
<div class="math">
\begin{equation*}
s(\boldsymbol M, \boldsymbol r, \boldsymbol c) = b(\boldsymbol r,
\boldsymbol c) \cdot \sum_{i, j, k} M_{ij} r_{ki} c_{kj}
\end{equation*}
</div>
<p>where <span class="math">\(r_{ki}\)</span>
is an indicator variable for membership of guest
<span class="math">\(i\)</span>
in cluster <span class="math">\(k\)</span>
, <span class="math">\(c_{kj}\)</span>
is an indicator
variable for album membership, and <span class="math">\(b \in [0, 1]\)</span>
penalizes
unbalanced solutions, i.e., those with biclusters of different sizes.
As the difference in sizes between the largest and the smallest
bicluster grows, <span class="math">\(b\)</span>
decays as:</p>
<div class="math">
\begin{equation*}
b(\boldsymbol r, \boldsymbol c) =
\exp \left ( \frac{ - \left ( \max \left ( \mathcal S \right ) -
\min \left ( \mathcal S \right ) \right )}{ \epsilon } \right )
\end{equation*}
</div>
<p>where</p>
<div class="math">
\begin{equation*}
\mathcal S = \left \{ \left (\sum_{i} r_{ki} \right ) \cdot \left
(\sum_{j} c_{kj} \right ) | k = 1..3 \right \}
\end{equation*}
</div>
<p>is the set of bicluster sizes, and <span class="math">\(\epsilon > 0\)</span>
is a parameter that
sets the aggressiveness of the penalty.</p>
<p>Bob uses the following algorithm to find his solution: starting with a
random assignment of rows and columns to clusters, he reassigns row
and columns to improve the objective function until convergence. In a
simulated annealing fashion, he allows suboptimal reassignments to
avoid local minima. However, this algorithm is not guaranteed to be
optimal. Had Bob wanted the best solution, the naive approach would
require trying every possible clustering, resulting in <span class="math">\(k^{n+p}
= 3^{80}\)</span>
candidate solutions. This suggests that Bob’s problem is in
a nonpolynomial complexity class. In fact, most formulations of
biclustering problems are in <span class="caps">NP</span>-complete <a class="footnote-reference" href="#id12" id="id4">[5]</a>.</p>
<p>After biclustering, the rows and columns of the data matrix may be
reordered to show the assignment of guests and albums to rooms. Here
are the original data set (left) and the clusters that Bob found (right):</p>
<img alt="/static/images/party_data.png" src="/static/images/party_data.png" style="width: 40%;" />
<img alt="/static/images/party_result.png" src="/static/images/party_result.png" style="width: 40%;" />
<p>Although not everyone will enjoy every album, clearly this solution
ensures that most guests will enjoy most albums.</p>
</div>
<div class="section" id="conclusion">
<h2>Conclusion</h2>
<p>This article introduced biclustering through a contrived example, but
these methods is not limited to throwing great parties. Any data that
can be represented as a matrix is amenable to biclustering. It is a
popular technique for analyzing gene expression data from microarray
experiments. It has also been applied to recommendation systems,
market research, databases, financial data, and agricultural data, as
well as many other problems.</p>
<p>These diverse problems require more sophisticated algorithms than
Bob’s party planning algorithm. His algorithm only works for binary
data and assigns each row and each column to exactly one cluster.
However, many other types of bicluster structures have been proposed.
Those interested in a more detailed overview of the field may be
interested in the surveys <a class="footnote-reference" href="#id11" id="id5">[4]</a>, <a class="footnote-reference" href="#id12" id="id6">[5]</a>, <a class="footnote-reference" href="#id13" id="id7">[6]</a>.</p>
<p>In the following weeks I will introduce some popular biclustering
algorithms as I implement them. The next post will cover spectral biclustering.</p>
</div>
<div class="section" id="references">
<h2>References</h2>
<table class="docutils footnote" frame="void" id="id8" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>Jain, A. K., Murty, M. N., <span class="amp">&</span> Flynn, <span class="caps">P. J.</span>(1999). <a class="reference external" href="http://www.cs.rutgers.edu/~mlittman/courses/lightai03/jain99data.pdf">Data
clustering: a review</a>.
<em><span class="caps">ACM</span> computing surveys (<span class="caps">CSUR</span>)</em>, 31(3), 264-323.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id9" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id2">[2]</a></td><td>Berkhin, P. (2006). <a class="reference external" href="http://www-static.cc.gatech.edu/fac/Charles.Isbell/classes/reading/papers/berkhin02survey.pdf">A survey of clustering data mining
techniques</a>.
In <em>Grouping multidimensional data</em> (pp. 25-71). Springer
Berlin Heidelberg.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id10" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id3">[3]</a></td><td>Jain, <span class="caps">A. K.</span>(2010). <a class="reference external" href="http://biometrics.cse.msu.edu/Presentations/FuLectureDec5.pdf">Data clustering: 50 years beyond K-means</a>.
<em>Pattern Recognition Letters</em>, 31(8), 651-666.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id11" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id5">[4]</a></td><td>Tanay, A., Sharan, R., <span class="amp">&</span> Shamir, R. (2005). <a class="reference external" href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.8302&rep=rep1&type=pdf">Biclustering
algorithms: A survey</a>.
<em>Handbook of computational molecular biology</em>, 9, 26-1.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id12" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label">[5]</td><td><em>(<a class="fn-backref" href="#id4">1</a>, <a class="fn-backref" href="#id6">2</a>)</em> Madeira, S. C., <span class="amp">&</span> Oliveira, <span class="caps">A. L.</span>(2004). <a class="reference external" href="http://cs-people.bu.edu/panagpap/Research/Bio/bicluster_survey.pdf">Biclustering
algorithms for biological data analysis: a survey</a>.
<em>Computational Biology and Bioinformatics, <span class="caps">IEEE</span>/<span class="caps">ACM</span>
Transactions on</em>, 1(1), 24-45.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id13" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id7">[6]</a></td><td>Busygin, S., Prokopyev, O., <span class="amp">&</span> Pardalos, <span class="caps">P. M.</span>(2008).
<a class="reference external" href="http://www.sciencedirect.com/science/article/pii/S0305054807000159">Biclustering in data mining</a>.
<em>Computers <span class="amp">&</span> Operations Research</em>, 35(9), 2964-2987.</td></tr>
</tbody>
</table>
</div>
Google Summer of Code 20132013-06-06T15:02:00-04:00Kemal Erentag:kemaleren.com,2013-06-06:google-summer-of-code-2013.html<p>This summer I will be participating in the <a class="reference external" href="http://www.google-melange.com/gsoc/homepage/google/gsoc2013">Google Summer of Code</a> by
implementing biclustering algorithms for <a class="reference external" href="http://scikit-learn.org/">scikit-learn</a>. As required by the Python Foundation, I
will be reporting my progress on this blog.</p>
Biclustering paper accepted2012-05-21T17:30:00-04:00Kemal Erentag:kemaleren.com,2012-05-21:biclustering-paper-accepted.html<p>Our paper, <em>A Comparative Analysis of Biclustering Algorithms for Gene
Expression Data</em> has been accepted by <a class="reference external" href="http://bib.oxfordjournals.org/">Briefings in Bioinformatics</a>.</p>
New position at Heidelberg University2012-03-29T16:30:00-04:00Kemal Erentag:kemaleren.com,2012-03-29:new-position-at-heidelberg-university.html<p>On 7 March 2012 I successfully defended my Master’s Thesis,
<em>Application of biclustering algorithms to biological data</em>, which
will soon be available at the <a class="reference external" href="http://etd.ohiolink.edu/">OhioLINK <span class="caps">ETD</span> Center</a>.</p>
<p>Next month I will officially begin a new position in the
<a class="reference external" href="http://hci.iwr.uni-heidelberg.de/MIP/">Multidimensional Image Processing lab</a> at <a class="reference external" href="http://www.uni-heidelberg.de/">Heidelberg University</a>.</p>
New software: BiBench2011-12-13T04:44:00-05:00Kemal Erentag:kemaleren.com,2011-12-13:new-software-bibench.html<p>My associates and I at the <span class="caps">HPC</span> Lab have just released some new
software: <a class="reference external" href="http://bmi.osu.edu/hpc/software/bibench/index.html">BiBench</a>, a Python
package for biclustering. In addition to lots of other functionality,
It provides a common interface to twelve biclustering algorithms,
including our own <a class="reference external" href="http://bmi.osu.edu/hpc/software/cpb/index.html"><span class="caps">CPB</span></a> algorithm.</p>
<p>BiBench makes biclustering gene expression datasets easy. In only a
few lines of code you can download a <a class="reference external" href="http://www.ncbi.nlm.nih.gov/geo/"><span class="caps">GDS</span></a> dataset, bicluster it, and
perform <a class="reference external" href="http://www.geneontology.org/"><span class="caps">GO</span></a> enrichment analysis on
the genes of the resulting biclusters. Though we developed BiBench to
specifically target gene expression data, it is useful for any data
that may be meaningfully biclustered.</p>
<p>A download link, installation instructions, a tutorial, and complete
documentation of all of BiBench’s modules are available at <a class="reference external" href="http://bmi.osu.edu/hpc/software/bibench/index.html">the
project page</a>.</p>
We got runner up for best poster at ACM BCB 20112011-08-07T13:14:00-04:00Kemal Erentag:kemaleren.com,2011-08-07:we-got-runner-up-for-best-poster-at-acm-bcb-2011.html<p>I just got back from presenting a poster at the <span class="caps">ACM</span> Conference on
Bioinformatics, Computational Biology and Biomedicine in Chicago, <span class="caps">IL</span>. We
were voted runner up for best poster! The poster will soon be available
at the <a class="reference external" href="http://bmi.osu.edu/hpc/"><span class="caps">HPC</span> lab website</a>.</p>