Hierarchical cost

dasgupta_cost(tree, edge_weights, leaf_graph)

Dasgupta's cost is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph.

dendrogram_purity(tree, leaf_labels)

Weighted average of the purity of each node of the tree with respect to a ground truth labelization of the tree leaves.

tree_sampling_divergence(tree, edge_weights, ...)

Tree sampling divergence is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph.

dasgupta_cost(tree, edge_weights, leaf_graph, mode='dissimilarity')[source]

Dasgupta’s cost is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph.

Let \(T\) be a tree representing a hierarchical clustering of the graph \(G=(V, E)\).

If \(w\) is a dissimilarity function on the edges \(E\) of the graph (mode is equal to "dissimilarity"), then the Dasgupta’s cost is defined as:

\[dasgupta(T, V, E, w) = \sum_{\{x,y\}\in E} \frac{area(lca_T(x,y))}{w(\{x,y\})}\]

where \(area\) is the area of a node in the tree and \(lca_T\) is the lowest common ancestor of two nodes.

If \(w\) is a similarity function on the edges \(E\) of the graph (mode is equal to "similarity"), then the Dasgupta’s cost is defined as:

\[dasgupta(T, V, E, w) = \sum_{\{x,y\}\in E} area(lca_T(x,y)) \times w(\{x,y\})\]
See:

S. Dasgupta. “A cost function for similarity-based hierarchical clustering .” In Proc. STOC, pages 118–127, Cambridge, MA, USA, 2016

Complexity:

The runtime complexity is \(\mathcal{O}(n\log(n) + m)\) with \(n\) the number of nodes in \(T\) and \(m\) the number of edges in \(E\).

Parameters:
  • tree – Input tree

  • edge_weights – Edge weights on the leaf graph of the input tree

  • leaf_graph – Leaf graph of the input tree (deduced from CptHierarchy)

  • mode – “dissimilarity” or “similarity” (default: “dissimilarity”)

Returns:

a real number

dendrogram_purity(tree, leaf_labels)[source]

Weighted average of the purity of each node of the tree with respect to a ground truth labelization of the tree leaves.

Let \(T\) be a tree with leaves \(V=\{1, \ldots, n\}\). Let \(C=\{C_1, \ldots, C_K\}\) be a partition of \(V\) into \(k\) (label) sets.

The purity of a subset \(X\) of \(V\) with respect to class \(C_\ell\in C\) is the fraction of elements of \(X\) that belongs to class \(C_\ell\):

\[pur(X, C_\ell) = \frac{| X \cap C_\ell |}{| X |}.\]

The purity of \(T\) is the defined as:

\[pur(T) = \frac{1}{Z}\sum_{k=1}^{K}\sum_{x,y\in C_k, x\neq y} pur(lca_T(x,y), C_k)\]

with \(Z=| \{\{x,y\} \subseteq V \mid x\neq y, \exists k, \{x,y\}\subseteq C_k\} |\).

See:

Heller, Katherine A., and Zoubin Ghahramani. “Bayesian hierarchical clustering .” Proc. ICML. ACM, 2005.

Complexity:

The dendrogram purity is computed in \(\mathcal{O}(N\times K \times C^2)\) with \(N\) the number of nodes in the tree, \(K\) the number of classes, and \(C\) the maximal number of children of a node in the tree.

Parameters:
  • tree – input tree

  • leaf_labels – a 1d integral array of length tree.num_leaves()

Returns:

a score between 0 and 1 (higher is better)

tree_sampling_divergence(tree, edge_weights, leaf_graph)[source]

Tree sampling divergence is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph. It measures how well the given edge weighted graph can be reconstructed from the tree alone. It is equal to 0 if and only if the given graph can be fully recovered from the tree.

It is defined as the Kullback-Leibler divergence between the edge sampling model \(p\) and the independent (null) sampling model \(q\) of the nodes of a tree (see attribute_tree_sampling_probability()).

The tree sampling divergence on a tree \(T\) is then

\[TSD(T) = \sum_{x \in T} p(x) \log\frac{p(x)}{q(x)}\]

The definition of the tree sampling divergence was proposed in:

Complexity:

The tree sampling divergence is computed in \(\mathcal{O}(N (\log(N) + C^2) + M)\) with \(N\) the number of nodes in the tree, \(M\) the number of edges in the leaf graph, and \(C\) the maximal number of children of a node in the tree.

Parameters:
  • tree – Input tree

  • edge_weights – Edge weights on the leaf graph (similarities)

  • leaf_graph – Leaf graph of the input tree (deduced from CptHierarchy)

Returns:

a real number