Hierarchical cost¶

`dasgupta_cost`(tree, edge_weights, leaf_graph)	Dasgupta’s cost is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph.
`dendrogram_purity`(tree, leaf_labels)	Weighted average of the purity of each node of the tree with respect to a ground truth labelization of the tree leaves.
`tree_sampling_divergence`(tree, edge_weights, …)	Tree sampling divergence is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph.

dasgupta_cost(tree, edge_weights, leaf_graph, mode='dissimilarity')[source]¶

Dasgupta’s cost is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph.

Let \(T\) be a tree representing a hierarchical clustering of the graph \(G=(V, E)\).

If \(w\) is a dissimilarity function on the edges \(E\) of the graph (mode is equal to "dissimilarity"), then the Dasgupta’s cost is defined as:

\[dasgupta(T, V, E, w) = \sum_{\{x,y\}\in E} \frac{area(lca_T(x,y))}{w(\{x,y\})}\]

where \(area\) is the area of a node in the tree and \(lca_T\) is the lowest common ancestor of two nodes.

If \(w\) is a similarity function on the edges \(E\) of the graph (mode is equal to "similarity"), then the Dasgupta’s cost is defined as:

\[dasgupta(T, V, E, w) = \sum_{\{x,y\}\in E} area(lca_T(x,y)) \times w(\{x,y\})\]

See:: S. Dasgupta. “A cost function for similarity-based hierarchical clustering .” In Proc. STOC, pages 118–127, Cambridge, MA, USA, 2016
Complexity:

The runtime complexity is \(\mathcal{O}(n\log(n) + m)\) with \(n\) the number of nodes in \(T\) and \(m\) the number of edges in \(E\).

Parameters:

tree – Input tree
edge_weights – Edge weights on the leaf graph of the input tree
leaf_graph – Leaf graph of the input tree (deduced from CptHierarchy)
mode – “dissimilarity” or “similarity” (default: “dissimilarity”)

Returns:

a real number

dendrogram_purity(tree, leaf_labels)[source]¶

Weighted average of the purity of each node of the tree with respect to a ground truth labelization of the tree leaves.

Let \(T\) be a tree with leaves \(V=\{1, \ldots, n\}\). Let \(C=\{C_1, \ldots, C_K\}\) be a partition of \(V\) into \(k\) (label) sets.

The purity of a subset \(X\) of \(V\) with respect to class \(C_\ell\in C\) is the fraction of elements of \(X\) that belongs to class \(C_\ell\):

\[pur(X, C_\ell) = \frac{| X \cap C_\ell |}{| X |}.\]

The purity of \(T\) is the defined as:

\[pur(T) = \frac{1}{Z}\sum_{k=1}^{K}\sum_{x,y\in C_k, x\neq y} pur(lca_T(x,y), C_k)\]

with \(Z=| \{\{x,y\} \subseteq V \mid x\neq y, \exists k, \{x,y\}\subseteq C_k\} |\).

See:: Heller, Katherine A., and Zoubin Ghahramani. “Bayesian hierarchical clustering .” Proc. ICML. ACM, 2005.
Complexity:

The dendrogram purity is computed in \(\mathcal{O}(N\times K \times C^2)\) with \(N\) the number of nodes in the tree, \(K\) the number of classes, and \(C\) the maximal number of children of a node in the tree.

Parameters:

tree – input tree
leaf_labels – a 1d integral array of length tree.num_leaves()

Returns:

a score between 0 and 1 (higher is better)

tree_sampling_divergence(tree, edge_weights, leaf_graph)[source]¶

Tree sampling divergence is an unsupervised measure of the quality of a hierarchical clustering of an edge weighted graph. It measures how well the given edge weighted graph can be reconstructed from the tree alone. It is equal to 0 if and only if the given graph can be fully recovered from the tree.

It is defined as the Kullback-Leibler divergence between the edge sampling model \(p\) and the independent (null) sampling model \(q\) of the nodes of a tree (see attribute_tree_sampling_probability()).

The tree sampling divergence on a tree \(T\) is then

\[TSD(T) = \sum_{x \in T} p(x) \log\frac{p(x)}{q(x)}\]

The definition of the tree sampling divergence was proposed in:

Charpentier, B. & Bonald, T. (2019). “Tree Sampling Divergence: An Information-Theoretic Metric for Hierarchical Graph Clustering.” Proceedings of IJCAI.

Complexity:

The tree sampling divergence is computed in \(\mathcal{O}(N (\log(N) + C^2) + M)\) with \(N\) the number of nodes in the tree, \(M\) the number of edges in the leaf graph, and \(C\) the maximal number of children of a node in the tree.

Parameters:

tree – Input tree
edge_weights – Edge weights on the leaf graph (similarities)
leaf_graph – Leaf graph of the input tree (deduced from CptHierarchy)

Returns:

a real number