# Hierarchical cost¶

 dasgupta_cost(tree, edge_weights, leaf_graph) Dasgupta’s cost is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph. dendrogram_purity(tree, leaf_labels) Weighted average of the purity of each node of the tree with respect to a ground truth labelization of the tree leaves. tree_sampling_divergence(tree, edge_weights, …) Tree sampling divergence is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph.
dasgupta_cost(tree, edge_weights, leaf_graph, mode='dissimilarity')[source]

Dasgupta’s cost is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph.

Let $$T$$ be a tree representing a hierarchical clustering of the graph $$G=(V, E)$$.

If $$w$$ is a dissimilarity function on the edges $$E$$ of the graph (mode is equal to "dissimilarity"), then the Dasgupta’s cost is defined as:

$dasgupta(T, V, E, w) = \sum_{\{x,y\}\in E} \frac{area(lca_T(x,y))}{w(\{x,y\})}$

where $$area$$ is the area of a node in the tree and $$lca_T$$ is the lowest common ancestor of two nodes.

If $$w$$ is a similarity function on the edges $$E$$ of the graph (mode is equal to "similarity"), then the Dasgupta’s cost is defined as:

$dasgupta(T, V, E, w) = \sum_{\{x,y\}\in E} area(lca_T(x,y)) \times w(\{x,y\})$
See:

S. Dasgupta. “A cost function for similarity-based hierarchical clustering .” In Proc. STOC, pages 118–127, Cambridge, MA, USA, 2016

Complexity:

The runtime complexity is $$\mathcal{O}(n\log(n) + m)$$ with $$n$$ the number of nodes in $$T$$ and $$m$$ the number of edges in $$E$$.

Parameters:
• tree – Input tree

• edge_weights – Edge weights on the leaf graph of the input tree

• leaf_graph – Leaf graph of the input tree (deduced from CptHierarchy)

• mode – “dissimilarity” or “similarity” (default: “dissimilarity”)

Returns:

a real number

dendrogram_purity(tree, leaf_labels)[source]

Weighted average of the purity of each node of the tree with respect to a ground truth labelization of the tree leaves.

Let $$T$$ be a tree with leaves $$V=\{1, \ldots, n\}$$. Let $$C=\{C_1, \ldots, C_K\}$$ be a partition of $$V$$ into $$k$$ (label) sets.

The purity of a subset $$X$$ of $$V$$ with respect to class $$C_\ell\in C$$ is the fraction of elements of $$X$$ that belongs to class $$C_\ell$$:

$pur(X, C_\ell) = \frac{| X \cap C_\ell |}{| X |}.$

The purity of $$T$$ is the defined as:

$pur(T) = \frac{1}{Z}\sum_{k=1}^{K}\sum_{x,y\in C_k, x\neq y} pur(lca_T(x,y), C_k)$

with $$Z=| \{\{x,y\} \subseteq V \mid x\neq y, \exists k, \{x,y\}\subseteq C_k\} |$$.

See:

Heller, Katherine A., and Zoubin Ghahramani. “Bayesian hierarchical clustering .” Proc. ICML. ACM, 2005.

Complexity:

The dendrogram purity is computed in $$\mathcal{O}(N\times K \times C^2)$$ with $$N$$ the number of nodes in the tree, $$K$$ the number of classes, and $$C$$ the maximal number of children of a node in the tree.

Parameters:
• tree – input tree

• leaf_labels – a 1d integral array of length tree.num_leaves()

Returns:

a score between 0 and 1 (higher is better)

tree_sampling_divergence(tree, edge_weights, leaf_graph)[source]

Tree sampling divergence is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph. It measures how well the given edge weighted graph can be reconstructed from the tree alone. It is equal to 0 if and only if the given graph can be fully recovered from the tree.

It is defined as the Kullback-Leibler divergence between the edge sampling model $$p$$ and the independent (null) sampling model $$q$$ of the nodes of a tree (see attribute_tree_sampling_probability()).

The tree sampling divergence on a tree $$T$$ is then

$TSD(T) = \sum_{x \in T} p(x) \log\frac{p(x)}{q(x)}$

The definition of the tree sampling divergence was proposed in:

Complexity:

The tree sampling divergence is computed in $$\mathcal{O}(N (\log(N) + C^2) + M)$$ with $$N$$ the number of nodes in the tree, $$M$$ the number of edges in the leaf graph, and $$C$$ the maximal number of children of a node in the tree.

Parameters:
• tree – Input tree

• edge_weights – Edge weights on the leaf graph (similarities)

• leaf_graph – Leaf graph of the input tree (deduced from CptHierarchy)

Returns:

a real number