# Hierarchical cost¶

 dasgupta_cost(tree, edge_weights, leaf_graph) Dasgupta’s cost is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph. dendrogram_purity(tree, leaf_labels) Weighted average of the purity of each node of the tree with respect to a ground truth labelization of the tree leaves. tree_sampling_divergence(tree, edge_weights, …) Tree sampling divergence is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph.
dasgupta_cost(tree, edge_weights, leaf_graph)[source]

Dasgupta’s cost is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph.

Let $$T$$ be a tree representing a hierarchical clustering of the graph $$G=(V, E)$$. Let $$w$$ be a dissimilarity function on the edges $$E$$ of the graph.

The Dasgupta’s cost is define as:

$dasgupta(T, V, E, w) = \sum_{\{x,y\}\in E} \frac{area(lca_T(x,y))}{w(\{x,y\})}$
See

S. Dasgupta. “A cost function for similarity-based hierarchical clustering .” In Proc. STOC, pages 118–127, Cambridge, MA, USA, 2016

Complexity

The runtime complexity is $$\mathcal{O}(n\log(n) + m)$$ with $$n$$ the number of nodes in $$T$$ and $$m$$ the number of edges in $$E$$.

Parameters
• tree – Input tree

• edge_weights – Edge weights on the leaf graph (dissimilarities)

• leaf_graph – Leaf graph of the input tree (deduced from CptHierarchy)

Returns

a real number

dendrogram_purity(tree, leaf_labels)[source]

Weighted average of the purity of each node of the tree with respect to a ground truth labelization of the tree leaves.

Let $$T$$ be a tree with leaves $$V=\{1, \ldots, n\}$$. Let $$C=\{C_1, \ldots, C_K\}$$ be a partition of $$V$$ into $$k$$ (label) sets.

The purity of a subset $$X$$ of $$V$$ with respect to class $$C_\ell\in C$$ is the fraction of elements of $$X$$ that belongs to class $$C_\ell$$:

$pur(X, C_\ell) = \frac{| X \cap C_\ell |}{| X |}.$

The purity of $$T$$ is the defined as:

$pur(T) = \frac{1}{Z}\sum_{k=1}^{K}\sum_{x,y\in C_k, x\neq y} pur(lca_T(x,y), C_k)$

with $$Z=| \{\{x,y\} \subseteq V \mid x\neq y, \exists k, \{x,y\}\subseteq C_k\} |$$.

See

Heller, Katherine A., and Zoubin Ghahramani. “Bayesian hierarchical clustering .” Proc. ICML. ACM, 2005.

Complexity

The dendrogram purity is computed in $$\mathcal{O}(N\times K \times C^2)$$ with $$N$$ the number of nodes in the tree, $$K$$ the number of classes, and $$C$$ the maximal number of children of a node in the tree.

Parameters
• tree – input tree

• leaf_labels – a 1d integral array of length tree.num_leaves()

Returns

a score between 0 and 1 (higher is better)

tree_sampling_divergence(tree, edge_weights, leaf_graph)[source]

Tree sampling divergence is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph. It measures how well the given edge weighted graph can be reconstructed from the tree alone. It is equal to 0 if and only if the given graph can be fully recovered from the tree.

It is defined as the Kullback-Leibler divergence between the edge sampling model $$p$$ and the independent (null) sampling model $$q$$ of the nodes of a tree (see attribute_tree_sampling_probability()).

The tree sampling divergence on a tree $$T$$ is then

$TSD(T) = \sum_{x \in T} p(x) \log\frac{p(x)}{q(x)}$

The definition of the tree sampling divergence was proposed in:

Complexity

The tree sampling divergence is computed in $$\mathcal{O}(N (\log(N) + C^2) + M)$$ with $$N$$ the number of nodes in the tree, $$M$$ the number of edges in the leaf graph, and $$C$$ the maximal number of children of a node in the tree.

Parameters
• tree – Input tree

• edge_weights – Edge weights on the leaf graph (similarities)

• leaf_graph – Leaf graph of the input tree (deduced from CptHierarchy)

Returns

a real number