# Advanced Approach To Data Interpretation By Rs Agarwal.11

An exemplary illustration of how maps of cell types and states can support different levels of resolution is the structure-rich topologies generated by PAGA based on scRNA-seq [14], see Fig. 1Footnote 1. At the highest levels of resolution, these topologies also reflect intermediate cell states and the developmental trajectories passing through them. A similar approach that also allows for consistently zooming into more detailed levels of resolution is provided by hierarchical stochastic neighbor embedding (HSNE, Pezzotti et al. [15]), a method pioneered on mass cytometry datasets [16, 17]. In addition, manifold learning [18, 19] and metric learning [20, 21] may provide further theoretical support for even more accurate maps, because they provide sound theories about reasonable, continuous distance metrics, instead of just distinct, discrete clusters.

## advanced approach to data interpretation by rs agarwal.11

Measurement error requires denoising methods or approaches that quantify uncertainty and propagate it down analysis pipelines. Where methods cannot deal with abundant missing values, imputation approaches may be useful. While the true population manifold that generated data is never known, one can usually obtain some estimation of it that can be used for both denoising and imputation

The first category of methods generally seeks to infer a probabilistic model that captures the data generation mechanism. Such generative models can be used to probabilistically determine which observed zeros correspond to technical zeros (to be imputed) and which correspond to biological zeros (to be left alone). There are many model-based imputation methods already available that use ideas from clustering (e.g., k-means), dimension reduction, regression, and other techniques to impute technical zeros, oftentimes combining ideas from several of these approaches (Table 2 (A)).

Accounting for uncertainty in cell type assignment and for double use of data will require, first of all, a systematic study of their impact. Integrative approaches in which clustering and differential testing are simultaneously performed [113] can address both issues. However, integrative methods typically require bespoke implementations, precluding a direct combination between arbitrary clustering and differential testing tools. In such cases, the adaptation of selective inference methods [114] could provide an alternative solution, with an approach based on correcting the selection bias recently proposed [115].

However, unsupervised approaches involve manual cluster annotation. There are two major caveats: (i) manual annotation is a time-consuming process, which also (ii) puts certain limits to the reproducibility of the results. Cell atlases, as reference systems that systematically capture cell types and states, either tissue specific or across different tissues, remedy this issue (see data integration approach +X+S in Fig. 6). They will need to be able to embed new data points into a stable reference framework that allows for different levels of resolution and will have to eventually capture transitional cell states that fall in between clearly annotated cell clusters (see Fig. 1 for an idea of what cell atlas type reference systems could look like).

Approaches for integrating single-cell measurement datasets across measurement types, samples, and experiments, as also described in Table 4. 1S: clustering of cells from one sample from one experiment requires no data integration. +S: integration of one measurement type across samples requires the linking of cell populations/clusters. +X+S: integration of one measurement type across experiments conducted in separate laboratories requires stable reference systems like cell atlases (compare Fig. 1). +M1C: integration of multiple measurement types obtained from the same cell highlights the problem of data sparsity of all available measurement types and the dependency of measurement types that needs to be accounted for. +M+C: integration of different measurement types from different cells of the same cell population requires special care in matching cells through meaningful profiles. +all: one possibility for easing data integration across measurement types from separate cells would be to have a stable reference (cell atlas) across multiple measurement types, capturing different cell states, cell populations, and organisms. Effectively, this combines the challenges and promises of the approaches +X+S, +M1C, and +M+C

Some basic approaches to CNV calling from scDNA-seq data are available. These are usually based on hidden Markov models (HMMs) where the hidden variables correspond to copy number states, as, for example, in Aneufinder [235]. Another tool, Ginkgo, provides interactive CNV detection using circular binary segmentation, but is only available as a web-based tool [236]. ScRNA-seq data, which does not suffer from the errors and biases of WGA, can also be used to call CNVs or loss of heterozygosity events: an approach called HoneyBADGER [237] utilizes a probabilistic hidden Markov model, whereas the R package inferCNV simply averages the expression over adjacent genes [238].

For copy number variation calling, software has previously been published mostly in conjunction with data-driven studies. Here, a systematic analysis of biases in the most common WGA methods for copy number variation calling (including newer methods to come) could further inform method development. The already mentioned approach of leveraging amplification bias for phasing could also be informative [241].

(ii) In addition to the growing number of cells (taxa), the breadth of genomic sites and genomic alterations that can be queried per genome also increases. Classical approaches thus need not only scale with the number of single cells queried (see above), but also with the length of the input MSA. Here, previous efforts for parallelization [259, 260] and other optimization efforts [261] exist and can be built upon. The breadth of sequencing data also allows determination of large numbers of invariant sites, which further raises the question of whether including them will change results of phylogenetic inferences in the context of cancer. Excluding invariant sites from the inference has been coined ascertainment bias. For phylogenetic analyses of closely related individuals from a few populations, it has been shown that accounting for ascertainment bias alters branch lengths, but not the resulting tree topologies per se [262].

Incorporating CNVs in the reconstruction of tumor phylogeny can be helpful for understanding tumor progressions, as they represent one of the most common mutation types associated to tumor hypermutability [272]. CNVs in single cells were extensively studied in the context of tumor evolution and clonal dynamics [273, 274]. Reconstructing a phylogeny with CNVs is not straightforward. The challenges not only are related to experimental limits, such as the complexity of bulk sequencing data [275] and amplification biases [276], but also involve computational constraints. First of all, the causal mechanisms, such as breakage-fusion-bridge cycles [277] and chromosome missegregation [278], can lead to overlapping copy number events [279]. Secondly, inferring a phylogeny with CNV data requires quantifying biologically motivated transition probabilities for changes in copy numbers. Towards that goal, approaches to calculate the distance between whole copy number profiles [280] are a first step. But for them, a number of challenges remain, with several of the underlying problems known to be NP-hard [280].

However, co-measuring different types of quantities in the same cell can be experimentally challenging or even just impossible at this point in time. An exit strategy to this problem is to analyze a population of cells that is homogeneous in terms of some cell type or state, taking different measurement types in different single cells (approach +M+C). After collecting different measurement types in different single cells, one needs to combine the data in a way that is biologically meaningful. An example is to group cells based on commonalities in their genotype profile (Fig. 6), having become evident only after the application of a scDNA-seq experiment. This will require careful validation of the assumptions made when matching cells via such a grouping, possibly including functional validation of group differences.

For integrating across multiple measurement types from separate cells (approach +M+C), all of which stem from a population of cells that is homogeneous with respect to some selection criterion, technologies such as 10X genomics [171] for scRNA-seq and direct library preparation (DLP [341]) for scDNA-seq establish a scalable experimental basis. The greater analytical challenge is to identify subpopulations that had so far remained invisible, and whose identification is crucial so as to not combine different types of data in mistaken ways. An example for this is the identification of distinct cancer clones from cells sampled from seemingly homogeneous tumor tissue. Here, only performing scDNA-seq experiments can definitively reveal the clonal structure of a tumor. If one wishes to correctly link mutation with transcription profiles, ignoring the clonal structure of a tumor could be misleading. Several analytical methods that address this problem have recently emerged: (i) clonealign [91] assumes a copy number dosage effect on transcription to assign gene expression states to clones, (ii) cardelino [342] aligns clone-specific SNVs in scRNA-seq to those inferred from bulk exome data in order to infer clone-specific expression patterns, and (iii) MATCHER [18] uses manifold alignment to combine scM&T-seq [336] with sc-GEM [337], leveraging the common set of loci. All of these methods are based on biologically meaningful assumptions on how to summarize data measurements across different measurement types and samples, despite their different physical origin.

Experimental technologies that enable taking multiple measurement types in the same cell (approach +M1C in Fig. 6 and Table 4) are on the rise and will allow to assay more cells at higher fidelity and reduced cost. While this type of data naturally links measurement types within single cells, the SCDS challenge is to account for dependencies among those measurement types for any obtainable combinations of them. As a prominent example, consider how gene expression increases with higher genomic copy number, a phenomenon known as measurement linkage [343], which has not been addressed for different measurement types taken in the same cell. Statistical models for leveraging those measurement type combinations thus pose formidable SCDS challenges.