Hierarchical cost¶
|
Dasgupta’s cost is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph. |
|
Weighted average of the purity of each node of the tree with respect to a ground truth labelization of the tree leaves. |
|
Tree sampling divergence is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph. |
-
dasgupta_cost
(tree, edge_weights, leaf_graph)[source]¶ Dasgupta’s cost is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph.
Let \(T\) be a tree representing a hierarchical clustering of the graph \(G=(V, E)\). Let \(w\) be a dissimilarity function on the edges \(E\) of the graph.
The Dasgupta’s cost is define as:
\[dasgupta(T, V, E, w) = \sum_{\{x,y\}\in E} \frac{area(lca_T(x,y))}{w(\{x,y\})}\]- See:
S. Dasgupta. “A cost function for similarity-based hierarchical clustering .” In Proc. STOC, pages 118–127, Cambridge, MA, USA, 2016
- Complexity:
The runtime complexity is \(\mathcal{O}(n\log(n) + m)\) with \(n\) the number of nodes in \(T\) and \(m\) the number of edges in \(E\).
- Parameters:
tree – Input tree
edge_weights – Edge weights on the leaf graph (dissimilarities)
leaf_graph – Leaf graph of the input tree (deduced from
CptHierarchy
)
- Returns:
a real number
-
dendrogram_purity
(tree, leaf_labels)[source]¶ Weighted average of the purity of each node of the tree with respect to a ground truth labelization of the tree leaves.
Let \(T\) be a tree with leaves \(V=\{1, \ldots, n\}\). Let \(C=\{C_1, \ldots, C_K\}\) be a partition of \(V\) into \(k\) (label) sets.
The purity of a subset \(X\) of \(V\) with respect to class \(C_\ell\in C\) is the fraction of elements of \(X\) that belongs to class \(C_\ell\):
\[pur(X, C_\ell) = \frac{| X \cap C_\ell |}{| X |}.\]The purity of \(T\) is the defined as:
\[pur(T) = \frac{1}{Z}\sum_{k=1}^{K}\sum_{x,y\in C_k, x\neq y} pur(lca_T(x,y), C_k)\]with \(Z=| \{\{x,y\} \subseteq V \mid x\neq y, \exists k, \{x,y\}\subseteq C_k\} |\).
- See:
Heller, Katherine A., and Zoubin Ghahramani. “Bayesian hierarchical clustering .” Proc. ICML. ACM, 2005.
- Complexity:
The dendrogram purity is computed in \(\mathcal{O}(N\times K \times C^2)\) with \(N\) the number of nodes in the tree, \(K\) the number of classes, and \(C\) the maximal number of children of a node in the tree.
- Parameters:
tree – input tree
leaf_labels – a 1d integral array of length tree.num_leaves()
- Returns:
a score between 0 and 1 (higher is better)
-
tree_sampling_divergence
(tree, edge_weights, leaf_graph)[source]¶ Tree sampling divergence is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph. It measures how well the given edge weighted graph can be reconstructed from the tree alone. It is equal to 0 if and only if the given graph can be fully recovered from the tree.
It is defined as the Kullback-Leibler divergence between the edge sampling model \(p\) and the independent (null) sampling model \(q\) of the nodes of a tree (see
attribute_tree_sampling_probability()
).The tree sampling divergence on a tree \(T\) is then
\[TSD(T) = \sum_{x \in T} p(x) \log\frac{p(x)}{q(x)}\]The definition of the tree sampling divergence was proposed in:
Charpentier, B. & Bonald, T. (2019). “Tree Sampling Divergence: An Information-Theoretic Metric for Hierarchical Graph Clustering.” Proceedings of IJCAI.
- Complexity:
The tree sampling divergence is computed in \(\mathcal{O}(N (\log(N) + C^2) + M)\) with \(N\) the number of nodes in the tree, \(M\) the number of edges in the leaf graph, and \(C\) the maximal number of children of a node in the tree.
- Parameters:
tree – Input tree
edge_weights – Edge weights on the leaf graph (similarities)
leaf_graph – Leaf graph of the input tree (deduced from
CptHierarchy
)
- Returns:
a real number