Hierarchical cost
|
Dasgupta's cost is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph. |
|
Weighted average of the purity of each node of the tree with respect to a ground truth labelization of the tree leaves. |
|
Tree sampling divergence is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph. |
- dasgupta_cost(tree, edge_weights, leaf_graph, mode='dissimilarity')[source]
Dasgupta’s cost is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph.
Let \(T\) be a tree representing a hierarchical clustering of the graph \(G=(V, E)\).
If \(w\) is a dissimilarity function on the edges \(E\) of the graph (
mode
is equal to"dissimilarity"
), then the Dasgupta’s cost is defined as:\[dasgupta(T, V, E, w) = \sum_{\{x,y\}\in E} \frac{area(lca_T(x,y))}{w(\{x,y\})}\]where \(area\) is the area of a node in the tree and \(lca_T\) is the lowest common ancestor of two nodes.
If \(w\) is a similarity function on the edges \(E\) of the graph (
mode
is equal to"similarity"
), then the Dasgupta’s cost is defined as:\[dasgupta(T, V, E, w) = \sum_{\{x,y\}\in E} area(lca_T(x,y)) \times w(\{x,y\})\]- See:
S. Dasgupta. “A cost function for similarity-based hierarchical clustering .” In Proc. STOC, pages 118–127, Cambridge, MA, USA, 2016
- Complexity:
The runtime complexity is \(\mathcal{O}(n\log(n) + m)\) with \(n\) the number of nodes in \(T\) and \(m\) the number of edges in \(E\).
- Parameters:
tree – Input tree
edge_weights – Edge weights on the leaf graph of the input tree
leaf_graph – Leaf graph of the input tree (deduced from
CptHierarchy
)mode – “dissimilarity” or “similarity” (default: “dissimilarity”)
- Returns:
a real number
- dendrogram_purity(tree, leaf_labels)[source]
Weighted average of the purity of each node of the tree with respect to a ground truth labelization of the tree leaves.
Let \(T\) be a tree with leaves \(V=\{1, \ldots, n\}\). Let \(C=\{C_1, \ldots, C_K\}\) be a partition of \(V\) into \(k\) (label) sets.
The purity of a subset \(X\) of \(V\) with respect to class \(C_\ell\in C\) is the fraction of elements of \(X\) that belongs to class \(C_\ell\):
\[pur(X, C_\ell) = \frac{| X \cap C_\ell |}{| X |}.\]The purity of \(T\) is the defined as:
\[pur(T) = \frac{1}{Z}\sum_{k=1}^{K}\sum_{x,y\in C_k, x\neq y} pur(lca_T(x,y), C_k)\]with \(Z=| \{\{x,y\} \subseteq V \mid x\neq y, \exists k, \{x,y\}\subseteq C_k\} |\).
- See:
Heller, Katherine A., and Zoubin Ghahramani. “Bayesian hierarchical clustering .” Proc. ICML. ACM, 2005.
- Complexity:
The dendrogram purity is computed in \(\mathcal{O}(N\times K \times C^2)\) with \(N\) the number of nodes in the tree, \(K\) the number of classes, and \(C\) the maximal number of children of a node in the tree.
- Parameters:
tree – input tree
leaf_labels – a 1d integral array of length tree.num_leaves()
- Returns:
a score between 0 and 1 (higher is better)
- tree_sampling_divergence(tree, edge_weights, leaf_graph)[source]
Tree sampling divergence is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph. It measures how well the given edge weighted graph can be reconstructed from the tree alone. It is equal to 0 if and only if the given graph can be fully recovered from the tree.
It is defined as the Kullback-Leibler divergence between the edge sampling model \(p\) and the independent (null) sampling model \(q\) of the nodes of a tree (see
attribute_tree_sampling_probability()
).The tree sampling divergence on a tree \(T\) is then
\[TSD(T) = \sum_{x \in T} p(x) \log\frac{p(x)}{q(x)}\]The definition of the tree sampling divergence was proposed in:
Charpentier, B. & Bonald, T. (2019). “Tree Sampling Divergence: An Information-Theoretic Metric for Hierarchical Graph Clustering.” Proceedings of IJCAI.
- Complexity:
The tree sampling divergence is computed in \(\mathcal{O}(N (\log(N) + C^2) + M)\) with \(N\) the number of nodes in the tree, \(M\) the number of edges in the leaf graph, and \(C\) the maximal number of children of a node in the tree.
- Parameters:
tree – Input tree
edge_weights – Edge weights on the leaf graph (similarities)
leaf_graph – Leaf graph of the input tree (deduced from
CptHierarchy
)
- Returns:
a real number