Hierarchical cost¶

Dasgupta’s cost is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph. 

Weighted average of the purity of each node of the tree with respect to a ground truth labelization of the tree leaves. 

Tree sampling divergence is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph. 

dasgupta_cost
(tree, edge_weights, leaf_graph)[source]¶ Dasgupta’s cost is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph.
Let \(T\) be a tree representing a hierarchical clustering of the graph \(G=(V, E)\). Let \(w\) be a dissimilarity function on the edges \(E\) of the graph.
The Dasgupta’s cost is define as:
\[dasgupta(T, V, E, w) = \sum_{\{x,y\}\in E} \frac{area(lca_T(x,y))}{w(\{x,y\})}\] See
S. Dasgupta. “A cost function for similaritybased hierarchical clustering .” In Proc. STOC, pages 118–127, Cambridge, MA, USA, 2016
 Complexity
The runtime complexity is \(\mathcal{O}(n\log(n) + m)\) with \(n\) the number of nodes in \(T\) and \(m\) the number of edges in \(E\).
 Parameters
tree – Input tree
edge_weights – Edge weights on the leaf graph (dissimilarities)
leaf_graph – Leaf graph of the input tree (deduced from
CptHierarchy
)
 Returns
a real number

dendrogram_purity
(tree, leaf_labels)[source]¶ Weighted average of the purity of each node of the tree with respect to a ground truth labelization of the tree leaves.
Let \(T\) be a tree with leaves \(V=\{1, \ldots, n\}\). Let \(C=\{C_1, \ldots, C_K\}\) be a partition of \(V\) into \(k\) (label) sets.
The purity of a subset \(X\) of \(V\) with respect to class \(C_\ell\in C\) is the fraction of elements of \(X\) that belongs to class \(C_\ell\):
\[pur(X, C_\ell) = \frac{ X \cap C_\ell }{ X }.\]The purity of \(T\) is the defined as:
\[pur(T) = \frac{1}{Z}\sum_{k=1}^{K}\sum_{x,y\in C_k, x\neq y} pur(lca_T(x,y), C_k)\]with \(Z= \{\{x,y\} \subseteq V \mid x\neq y, \exists k, \{x,y\}\subseteq C_k\} \).
 See
Heller, Katherine A., and Zoubin Ghahramani. “Bayesian hierarchical clustering .” Proc. ICML. ACM, 2005.
 Complexity
The dendrogram purity is computed in \(\mathcal{O}(N\times K \times C^2)\) with \(N\) the number of nodes in the tree, \(K\) the number of classes, and \(C\) the maximal number of children of a node in the tree.
 Parameters
tree – input tree
leaf_labels – a 1d integral array of length tree.num_leaves()
 Returns
a score between 0 and 1 (higher is better)

tree_sampling_divergence
(tree, edge_weights, leaf_graph)[source]¶ Tree sampling divergence is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph. It measures how well the given edge weighted graph can be reconstructed from the tree alone. It is equal to 0 if and only if the given graph can be fully recovered from the tree.
It is defined as the KullbackLeibler divergence between the edge sampling model \(p\) and the independent (null) sampling model \(q\) of the nodes of a tree (see
attribute_tree_sampling_probability()
).The tree sampling divergence on a tree \(T\) is then
\[TSD(T) = \sum_{x \in T} p(x) \log\frac{p(x)}{q(x)}\]The definition of the tree sampling divergence was proposed in:
Charpentier, B. & Bonald, T. (2019). “Tree Sampling Divergence: An InformationTheoretic Metric for Hierarchical Graph Clustering.” Proceedings of IJCAI.
 Complexity
The tree sampling divergence is computed in \(\mathcal{O}(N (\log(N) + C^2) + M)\) with \(N\) the number of nodes in the tree, \(M\) the number of edges in the leaf graph, and \(C\) the maximal number of children of a node in the tree.
 Parameters
tree – Input tree
edge_weights – Edge weights on the leaf graph (similarities)
leaf_graph – Leaf graph of the input tree (deduced from
CptHierarchy
)
 Returns
a real number