% Generated by roxygen2: do not edit by hand % Please edit documentation in R/TopDom.R \name{TopDom} \alias{TopDom} \title{Identify Topological Domains from a Hi-C Contact Matrix} \usage{ TopDom( data, window.size, outFile = NULL, statFilter = TRUE, ..., debug = getOption("TopDom.debug", FALSE) ) } \arguments{ \item{data}{A TopDomData object, or the pathname to a normalized Hi-C contact matrix file as read by \code{\link[=readHiC]{readHiC()}}, that specify N bins.} \item{window.size}{The number of bins to extend (as a non-negative integer). Recommended range is in {5, ..., 20}.} \item{outFile}{(optional) The filename without extension of the three result files optionally produced. See details below.} \item{statFilter}{(logical) Specifies whether non-significant topological-domain boundaries should be dropped or not.} \item{...}{Additional arguments passed to \code{\link[=readHiC]{readHiC()}}.} \item{debug}{If \code{TRUE}, debug output is produced.} } \value{ A named list of class \code{TopDom} with data.frame elements \code{binSignal}, \code{domain}, and \code{bed}. \itemize{ \item The \code{binSignal} data frame (N-by-7) holds mean contact frequency, local extreme, and p-value for every bin. The first four columns represent basic bin information given by matrix file, such as bin id (\code{id}), chromosome(\code{chr}), start coordinate (\code{from.coord}), and end coordinate (\code{to.coord}) for each bin. The last three columns (\code{local.ext}, \code{mean.cf}, and \code{p-value}) represent computed values by the TopDom algorithm. The columns are: \itemize{ \item \code{id}: Bin ID \item \code{chr}: Chromosome \item \code{from.coord}: Start coordinate of bin \item \code{to.coord}: End coordinate of bin \item \code{local.ext}: \itemize{ \item \code{-1}: Local minima. \item \code{-0.5}: Gap region. \item \code{0}: General bin. \item \code{1}: Local maxima. } \item \code{mean.cf}: Average of contact frequencies between lower and upper regions for bin \emph{i = 1,2,...,N}. \item \code{p-value}: Computed p-value by Wilcox rank sum test. See Shin et al. (2016) for more details. } \item The \code{domain} data frame (D-by-7): Every bin is categorized by basic building block, such as gap, domain, or boundary. Each row indicates a basic building block. The first five columns include the basic information about the block, 'tag' column indicates the class of the building block. \itemize{ \item \code{id}: Identifier of block \item \code{chr}: Chromosome \item \code{from.id}: Start bin index of the block \item \code{from.coord}: Start coordinate of the block \item \code{to.id}: End bin index of the block \item \code{to.coord}: End coordinate of the block \item \code{tag}: Categorized name of the block. Three possible blocks exists: \itemize{ \item \code{gap} \item \code{domain} \item \code{boundary} } \item \code{size}: size of the block } \item The \code{bed} data frame (D-by-4) is a representation of the \code{domain} data frame in the \href{https://genome.ucsc.edu/FAQ/FAQformat.html#format1}{BED file format}. It has four columns: \itemize{ \item \code{chrom}: The name of the chromosome. \item \code{chromStart}: The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0. \item \code{chromEnd}: The ending position of the feature in the chromosome. The \code{chromEnd} base is \emph{not} included in the feature. For example, the first 100 bases of a chromosome are defined as \code{chromStart=0}, \code{chromEnd=100}, and span the bases numbered 0-99. \item \code{name}: Defines the name of the BED line. This label is displayed to the left of the BED line in the \href{https://genome.ucsc.edu/cgi-bin/hgGateway}{UCSC Genome Browser} window when the track is open to full display mode or directly to the left of the item in pack mode. } } If argument \code{outFile} is non-\code{NULL}, then the three elements (\code{binSignal}, \code{domain}, and \code{bed}) returned are also written to tab-delimited files with file names \file{.binSignal}, \file{.domain}, and \file{.bed}, respectively. None of the files have row names, and all but the BED file have column names. } \description{ Identify Topological Domains from a Hi-C Contact Matrix } \section{Windows size}{ The \code{window.size} parameter is by design the only tuning parameter in the TopDom method and affects the amount of smoothing applied when calculating the TopDom bin signals. The binning window extends symmetrically downstream and upstream from the bin such that the bin signal is the average \code{window.size^2} contact frequencies. For details, see Equation (1) and Figure 1 in Shin et al. (2016). Typically, the number of identified TDs decreases while their average lengths increase as this window-size parameter increases (Figure 2). The default is \code{window.size = 5} (bins), which is motivated as: "Considering the previously reported minimum TD size (approx. 200 kb) (Dixon et al., 2012) and our bin size of 40 kb, \emph{w}[indow.size] = 5 is a reasonable setting" (Shin et al., 2016). } \examples{ path <- system.file("exdata", package = "TopDom", mustWork = TRUE) ## Original count data (on a subset of the bins to speed up example) chr <- "chr19" pathname <- file.path(path, sprintf("nij.\%s.gz", chr)) data <- readHiC(pathname, chr = chr, binSize = 40e3, bins = 1:500) print(data) ## a TopDomData object ## Find topological domains using the TopDom method fit <- TopDom(data, window.size = 5L) print(fit) ## a TopDom object ## Display the largest domain td <- subset(subset(fit$domain, tag == "domain"), size == max(size)) print(td) ## a data.frame ## Subset TopDomData object data_s <- subsetByRegion(data, region = td, margin = 0.9999) print(data_s) ## a TopDomData object vp <- grid::viewport(angle = -45, width = 0.7, y = 0.3) gg <- ggCountHeatmap(data_s) gg <- gg + ggDomain(td, color = "#cccc00") + ggDomainLabel(td) print(gg, newpage = TRUE, vp = vp) gg <- ggCountHeatmap(data_s, colors = list(mid = "white", high = "black")) gg_td <- ggDomain(td, delta = 0.08) dx <- attr(gg_td, "gg_params")$dx gg <- gg + gg_td + ggDomainLabel(td, vjust = 2.5) print(gg, newpage = TRUE, vp = vp) ## Subset TopDom object fit_s <- subsetByRegion(fit, region = td, margin = 0.9999) print(fit_s) ## a TopDom object for (kk in seq_len(nrow(fit_s$domain))) { gg <- gg + ggDomain(fit_s$domain[kk, ], dx = dx * (4 + kk \%\% 2), color = "red", size = 1) } print(gg, newpage = TRUE, vp = vp) gg <- ggCountHeatmap(data_s) gg_td <- ggDomain(td, delta = 0.08) dx <- attr(gg_td, "gg_params")$dx gg <- gg + gg_td + ggDomainLabel(td, vjust = 2.5) fit_s <- subsetByRegion(fit, region = td, margin = 0.9999) for (kk in seq_len(nrow(fit_s$domain))) { gg <- gg + ggDomain(fit_s$domain[kk, ], dx = dx * (4 + kk \%\% 2), color = "blue", size = 1) } print(gg, newpage = TRUE, vp = vp) } \references{ \itemize{ \item Shin et al., TopDom: an efficient and deterministic method for identifying topological domains in genomes, \emph{Nucleic Acids Research}, 44(7): e70, April 2016. DOI: 10.1093/nar/gkv1505, PMCID: \href{https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4838359/}{PMC4838359}, PMID: \href{https://pubmed.ncbi.nlm.nih.gov/26704975/}{26704975} \item Shin et al., \R script \file{TopDom_v0.0.2.R}, 2017 (originally from \code{http://zhoulab.usc.edu/TopDom/}; later available on \url{https://github.com/jasminezhoulab/TopDom} via \url{https://zhoulab.dgsom.ucla.edu/pages/software}) \item Shin et al., TopDom Manual, 2016-07-08 (original from \code{http://zhoulab.usc.edu/TopDom/TopDom\%20Manual_v0.0.2.pdf}; later available on \url{https://github.com/jasminezhoulab/TopDom} via \url{https://zhoulab.dgsom.ucla.edu/pages/software}) \item Hanjun Shin, Understanding the 3D genome organization in topological domain level, Doctor of Philosophy Dissertation, University of Southern California, March 2017, \url{https://digitallibrary.usc.edu/cdm/ref/collection/p15799coll40/id/347735} \item Dixon JR, Selvaraj S, Yue F, Kim A, et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. \emph{Nature}; 485(7398):376-80, April 2012. DOI: 10.1038/nature11082, PMCID: \href{https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3356448/}{PMC3356448}, PMID: 22495300. } } \author{ Hanjun Shin, Harris Lazaris, and Gangqing Hu. \R package, help, and code refactoring by Henrik Bengtsson. }