The HiC class

A class for handling HiC analysis.

class hifive.hic.HiC(filename, mode='r', silent=False)

This is the class for handling HiC analysis.

This class relies on Fend and HiCData for genomic position and interaction count data. Use this class to perform filtering of fends based on coverage, model fend bias and distance dependence, and downstream analysis and manipulation. This includes binning of data, plotting of data, modeling of data, and statistical analysis.

Note

This class is also available as hifive.HiC

When initialized, this class creates an h5dict in which to store all data associated with this object.

Parameters:
  • filename (str.) – The file name of the h5dict. This should end with the suffix ‘.hdf5’
  • mode (str.) – The mode to open the h5dict with. This should be ‘w’ for creating or overwriting an h5dict with name given in filename.
  • silent (bool.) – Indicates whether to print information about function execution for this object.
Returns:

HiC class object.

Attributes:
  • file (str.) - A string containing the name of the file passed during object creation for saving the object to.
  • silent (bool.) - A boolean indicating whether to suppress all of the output messages.
  • history (str.) - A string containing all of the commands executed on this object and their outcome.
  • normalization (str.) - A string stating which type of normalization has been performed on this object. This starts with the value ‘none’.
  • comm (class) - A link to the MPI.COMM_WORLD class from the mpi4py package. If this package isn’t present, this is set to ‘None’.
  • rank (int.) - The rank integer of this process, if running with mpi, otherwise set to zero.
  • num_procs (int.) - The number of processes being executed in parallel. If mpi4py package is not present, this is set to one.

In addition, many other attributes are initialized to the ‘None’ state.

calculate_quality(filename, resolution=1000000, coverage=[1.0, 0.5, 0.25, 0.12, 0.06], noise=[1.0, 0.75, 0.5, 0.25, 0.0], chroms=[])

Write individual chromosome and overall quality metrics for a HiC dataset to a text file.

This function uses a weighted spatial consistency approach to calculate the consistency within a HiC dataset. Briefly, when two regions in the genome occur near each other, the distances from these regions to all other chromosome regions should be similar to each other. Conversely, regions that are spatially far apart should be either have uncorrelated or inversely correlated sets of distances with each other. To assess the inter-dataset consistency, HiC read matrices are into P-values under a negative binomial distribution. For each pairwise combination of bins (intra-chromosomal only), the correlation between the matrix columns corresponding to the bin pair is calculated. The chromosome quality score is the sum of the correlations weighted by the corresponding interactions minus the sum of the unweighted mean of correlations. The overall quality is the euclidean mean of the chromosome quality scores.

In order to make the quality metric more robust, values are calculated for several different coverage and noise values. Low-coverage datasets are generated by randomly removing reads. Noise is modeled as a combination of the overall distance dependence curve and bin-specific correction values found from the matrix balancing. The final quality metric is calculated by finding the overall linear regression line slope across all coverage and noise combinations, finding the 100% coverage intercept using the calculated slope, and finding the 0% noise value. This step adds slight improvements over noise injected to 100% coverage data and the original quality metric.

Parameters:
  • filename (str.) – The name of the file to write the results to.
  • resolution (int.) – The size of bins to partition the genome into prior to calculating the quality values.
  • coverage (list) – A list of floats describing which coverage levels to calculate quality scores for. Values can range from greater than zero to 1.
  • noise (list) – A list of floats describing which noise levels (percentage of reads from noise) to calculate quality scores for. Values can range from greater than zero to 1.
  • chroms (list) – A list of chromosome name to calculate quality scores for.
calculate_replicate_quality(hic2, filename, resolution=1000000, chroms=[])

Write individual chromosome and overall quality metrics comparing two HiC datasets (presumably replicates) to a text file.

This function uses a weighted spatial consistency approach to calculate the consistency between HiC data replicates. Briefly, when two regions in the genome occur near each other, the distances from these regions to all other chromosome regions should be similar to each other. Conversely, regions that are spatially far apart should be either have uncorrelated or inversely correlated sets of distances with each other. To assess the inter-dataset consistency, HiC read matrices are into P-values under a negative binomial distribution. For each pairwise combination of bins (intra-chromosomal only), the correlation between the matrix columns corresponding to the bin pair is calculated. The chromosome quality score is the sum of the correlations in the first HiC dataset weighted by the corresponding interactions from the second HiC dataset and the converse weighted correlations minus the sum of the unweighted means of correlations. The overall quality is the euclidean mean of the chromosome quality scores.

Parameters:
  • hic2 (HiC.) – The HiC file to compare interactions with.
  • filename (str.) – The name of the file to write the results to.
  • resolution (int.) – The size of bins to partition the genome into prior to calculating the quality values.
  • chroms (list) – A list of chromosome name to calculate quality scores for.
cis_heatmap(chrom, start=None, stop=None, startfend=None, stopfend=None, binsize=0, binbounds=None, datatype='enrichment', arraytype='compact', maxdistance=0, skipfiltered=False, returnmapping=False, dynamically_binned=False, minobservations=0, searchdistance=0, expansion_binsize=0, removefailed=False, image_file=None, proportional=False, includediagonal=False, **kwargs)

Return a heatmap of cis data of the type and shape specified by the passed arguments.

This function returns a heatmap for a single chromosome region, bounded by either ‘start’ and ‘stop’ or ‘startfend’ and ‘stopfend’ (‘start’ and ‘stop’ take precedence), or if given, the outer coordinates of the array passed by ‘binbounds’. If none of these are specified, data for the complete chromosome is used. The data in the array is determined by the ‘datatype’, being raw, fend-corrected, distance-corrected, enrichment, or expected data. The array shape is given by ‘arraytype’ and can be compact, upper, or full. See hic_binning for further explanation of ‘datatype’ and ‘arraytype’. The returned data will include interactions ranging from zero to ‘maxdistance’ apart. If maxdistance is zero, all interactions within the requested bounds are returned. If using dynamic binning (‘dynamically_binned’ is set to True), ‘minobservations’, ‘searchdistance’, ‘expansion_binsize’, and ‘removefailed’ are used to control the dynamic binning process. Otherwise these arguments are ignored.

Parameters:
  • chrom (str.) – The name of a chromosome to obtain data from.
  • start (int.) – The smallest coordinate to include in the array, measured from fend midpoints. If both ‘start’ and ‘startfend’ are given, ‘start’ will override ‘startfend’. If unspecified, this will be set to the midpoint of the first fend for ‘chrom’. Optional.
  • stop (int.) – The largest coordinate to include in the array, measured from fend midpoints. If both ‘stop’ and ‘stopfend’ are given, ‘stop’ will override ‘stopfend’. If unspecified, this will be set to the midpoint of the last fend plus one for ‘chrom’. Optional.
  • startfend (int.) – The first fend to include in the array. If unspecified and ‘start’ is not given, this is set to the first fend in ‘chrom’. In cases where ‘start’ is specified and conflicts with ‘startfend’, ‘start’ is given preference. Optional
  • stopfend (str.) – The first fend not to include in the array. If unspecified and ‘stop’ is not given, this is set to the last fend in ‘chrom’ plus one. In cases where ‘stop’ is specified and conflicts with ‘stopfend’, ‘stop’ is given preference. Optional.
  • binsize (int.) – This is the coordinate width of each bin. If ‘binsize’ is zero, unbinned data is returned. If ‘binbounds’ is not None, this value is ignored.
  • binbounds (numpy array) – An array containing start and stop coordinates for a set of user-defined bins. Any fend not falling in a bin is ignored. Optional.
  • datatype (str.) – This specifies the type of data that is processed and returned. Options are ‘raw’, ‘distance’, ‘fend’, ‘enrichment’, and ‘expected’. Observed values are always in the first index along the last axis, except when ‘datatype’ is ‘expected’. In this case, filter values replace counts. Conversely, if ‘raw’ is specified, unfiltered fends return value of one. Expected values are returned for ‘distance’, ‘fend’, ‘enrichment’, and ‘expected’ values of ‘datatype’. ‘distance’ uses only the expected signal given distance for calculating the expected values, ‘fend’ uses only fend correction values, and both ‘enrichment’ and ‘expected’ use both correction and distance mean values.
  • arraytype (str.) – This determines what shape of array data are returned in. Acceptable values are ‘compact’, ‘full’, and ‘upper’. ‘compact’ means data are arranged in a N x M x 2 array where N is the number of fends or bins, M is the maximum number of steps between included fend pairs or bin pairs and data are stored such that bin n,m contains the interaction values between n and n + m + 1. ‘full’ returns a square, symmetric array of size N x N x 2. ‘upper’ returns only the flattened upper triangle of a full array, excluding the diagonal of size (N * (N - 1) / 2) x 2.
  • maxdistance (str.) – This specifies the maximum coordinate distance between bins that will be included in the array. If set to zero, all distances are included.
  • skipfiltered (bool.) – If ‘True’, all interaction bins for filtered out fends are removed and a reduced-size array is returned.
  • returnmapping (bool.) – If ‘True’, a list containing the data array and a 1d array containing fend numbers included in the data array if unbinned or a 2d array of N x 4 containing the first fend and last fend plus one included in each bin and first and last coordinates if binned is return. Otherwise only the data array is returned.
  • dynamically_binned (bool.) – If ‘True’, return dynamically binned data.
  • minobservations (int.) – The fewest number of observed reads needed for a bin to counted as valid and stop expanding.
  • searchdistance (int.) – The furthest distance from the bin minpoint to expand bounds. If this is set to zero, there is no limit on expansion distance.
  • expansion_binsize (int.) – The size of bins to use for data to pull from when expanding dynamic bins. If set to zero, unbinned data is used.
  • removefailed (bool.) – If a non-zero ‘searchdistance’ is given, it is possible for a bin not to meet the ‘minobservations’ criteria before stopping looking. If this occurs and ‘removefailed’ is True, the observed and expected values for that bin are zero.
  • image_file (str.) – If a filename is specified, a PNG image file is written containing the heatmap data. Arguments for the appearance of the image can be passed as additional keyword arguments.
  • proportional (bool.) – Indicates whether interactions should proportionally contribute to bins based on the amount of overlap instead of being attributed solely based on midpoint. Only valid for binned heatmaps and does not work in conjunction with dynamic binning.
  • includediagonal (bool.) – If true, interactions with both ends falling in the same bin are included. This changes the size of the upper array to N * (N + 1) / 2 and increase the compact array’s first axis by one.
Returns:

Array in format requested with ‘arraytype’ containing data requested with ‘datatype’. If returnmapping is True, a list is returned containined the requested data array and an array of associated positions (dependent on the binning options selected).

filter_fends(mininteractions=10, mindistance=0, maxdistance=0, usereads='cis')

Iterate over the dataset and remove fends that do not have ‘minobservations’ within ‘maxdistance’ of themselves using only unfiltered fends.

In order to create a set of fends that all have the necessary number of interactions, after each round of filtering, fend interactions are retallied using only interactions that have unfiltered fends at both ends.

Parameters:
  • mininteractions (int.) – The required number of interactions for keeping a fend in analysis.
  • mindistance (int.) – The minimum inter-fend distance used to count fend interactions.
  • maxdistance (int.) – The maximum inter-fend distance used to count fend interactions. A value of 0 indicates no maximum should be used.
  • usereads (str.) – Specifies which set of interactions to use, ‘cis’, ‘trans’, or ‘all’.
Returns:

None

find_binning_fend_corrections(mindistance=0, maxdistance=0, chroms=[], num_bins=[20, 20, 20], parameters=['even', 'even', 'even-const'], model=['gc', 'len', 'distance'], learning_threshold=1.0, max_iterations=10, usereads='cis', pseudocounts=0)

Using a multivariate binning model, learn correction values for combinations of model parameter bins. This function is MPI compatible.

Parameters:
  • mindistance (int.) – The minimum inter-fend distance to be included in modeling.
  • maxdistance (int.) – The maximum inter-fend distance to be included in modeling.
  • chroms (list) – A list of chromosomes to calculate corrections for. If set as None, all chromosome corrections are found.
  • remove_distance (bool.) – Use distance dependence curve in prior probability calculation for each observation.
  • model (list) – A list of fend features to be used in model. Valid values are ‘len’, ‘distance’, and any features included in the creation of the associated Fend object. The ‘distance’ parameter is only good with ‘cis’ or ‘all’ reads. If used with ‘all’, distances will be partitioned into n - 1 bins and the final distance bin will contain all trans data.
  • num_bins (list) – A list of the number of approximately equal-sized bins two divide model components into.
  • parameters (list) – A list of types, one for each model parameter. Types can be either ‘even’ or ‘fixed’, indicating whether each parameter bin should contain approximately even numbers of interactions or be of fixed width spanning 1 / Nth of the range of the parameter’s values, respectively. Parameter types can also have the suffix ‘-const’ to indicate that the parameter should not be optimized.
  • learning_threshold (float) – The minimum change in log-likelihood needed to continue iterative learning process.
  • max_iterations (int.) – The maximum number of iterations to use for learning model parameters.
  • usereads (str.) – Specifies which set of interactions to use, ‘cis’, ‘trans’, and ‘all’.
  • pseudocounts (int.) – The number of pseudo-counts to add to each bin prior to seeding and learning normalization values.
Returns:

None

Attributes:
  • model_parameters (ndarray) - A numpy array of strings containing model parameter names. If distance was included in the ‘model’ option, it is not included in this array since it is only for learning values, not for subsequent corretion.
  • binning_num_bins (ndarray) - A numpy array of type int32 containing the number of bins for each non-distance model parameter.
  • binning corrections (ndarray) - A numpy array of type float32 and length equal to the sum of binning_num_bins * (binning_num_bins - 1) / 2. This array contains a 1D stack of correction values, ordered according to the parameter order in the ‘model_parameters’ attribute.
  • binning_correction_indices (ndarray) - A numpy array of type int32 and length equal to the number of non-distance model parameters plus one. This array contains the first position in ‘binning_corrections’ for the first bin of the model parameter in the corresponding position in the ‘model_parameters’ array. The last position in the array contains the total number of binning correction values.
  • binning_fend_indices (ndarray) - A numpy array of type int32 and size N x M x 2 where M is the number of non-distance model parameters and N is the number of fends. This array contains the binning index for each parameter for each fend for the first and the second position in the correction array.

The ‘normalization’ attribute is updated to ‘binning’.

find_distance_parameters(numbins=90, minsize=200, maxsize=0, corrected=False)

Count reads and possible interactions from valid fend pairs in each distance bin to find mean bin signals. This function is MPI compatible.

This partitions the range of interaction distances (measured from mipoints of the involved fends) from the ‘minsize’ to ‘maxsize’ into a number of partitions equal to ‘numbins’. The first bin contains all distances less than or equal to ‘minsize’. The remaining bins are defined such that their log ranges are equal to one another. The curve defined by the mean interaction value of each bin can be smoothed using a triangular smoothing operation.

Parameters:
  • numbins (int.) – The number of bins to divide the distance range into. The first bin extends from zero to ‘minsize’, while the remaining bins are divided into evenly-spaced log-sized bins from ‘minsize’ to ‘maxsize’ or the maximum inter-fend distance, whichever is greater.
  • minsize (int.) – The upper size limit of the smallest distance bin.
  • maxsize (int.) – If this value is larger than the largest included chromosome, it will extend bins out to maxsize. If this value is smaller, it is ignored.
  • corrected (bool.) – If True, correction values are applied to counts prior to summing.
Returns:

None

Attributes:
  • distance_parameters (ndarray) - A numpy array of type float32 and size of N x 3 where N is one less than the number of distance bins containing at least one valid observation out of the ‘numbins’ number of bins that the distance range was divided into. The First column contains upper distance cutoff for each bin, the second column contains the slope associated with each bin line segment, and the third column contains the line segment intercepts. Line segments describe the relationship of observation counts versus distance.
  • bin_distance_parameters (ndarray) - A numpy array of type float32 and size of N x 3 where N is one less than the number of distance bins containing at least one valid observation out of the ‘numbins’ number of bins that the distance range was divided into. The First column contains upper distance cutoff for each bin, the second column contains the slope associated with each bin line segment, and the third column contains the line segment intercepts. Line segments describe the relationship of binary observations versus distance.
  • chromosome_means (ndarray) - A numpy array of type float32 and length equal to the number of chromosomes. This is initialized to zeros until fend correction values are found.
find_express_fend_corrections(iterations=100, mindistance=0, maxdistance=0, remove_distance=True, usereads='cis', mininteractions=0, minchange=0.0001, chroms=[], precorrect=False, binary=False, kr=False)

Using iterative matrix-balancing approximation, learn correction values for each valid fend. This function is MPI compatible.

Parameters:
  • iterations (int.) – The minimum number of iterations to use for learning fend corrections.
  • mindistance (int.) – This is the minimum distance between fend midpoints needed to be included in the analysis. All possible and observed interactions with a distance shorter than this are ignored. If ‘usereads’ is set to ‘trans’, this value is ignored.
  • maxdistance (int.) – The maximum inter-fend distance to be included in modeling. If ‘usereads’ is set to ‘trans’, this value is ignored.
  • remove_distance (bool.) – Specifies whether the estimated distance-dependent portion of the signal is removed prior to learning fend corrections.
  • usereads (str.) – Specifies which set of interactions to use, ‘cis’, ‘trans’, or ‘all’.
  • mininteractions (int.) – If a non-zero ‘mindistance’ is specified or only ‘trans’ interactions are used, fend filtering will be performed again to ensure that the data being used is sufficient for analyzed fends. This parameter may specify how many interactions are needed for valid fends. If not given, the value used for the last call to filter_fends() is used or, barring that, one.
  • minchange (float) – The minimum mean change in fend correction parameter values needed to keep running past ‘iterations’ number of iterations. If using the Knight-Ruiz algorithm this is the residual cutoff.
  • chroms (list) – A list of chromosomes to calculate corrections for. If set as None, all chromosome corrections are found.
  • precorrect (bool.) – Use binning-based corrections in expected value calculations, resulting in a chained normalization approach.
  • binary (bool.) – Use binary indicator instead of counts.
  • kr (bool.) – Use the Knight Ruiz matrix balancing algorithm instead of weighted matrix balancing. This option ignores ‘iterations’.
Returns:

None

Attributes:
  • corrections (ndarray) - A numpy array of type float32 and length equal to the number of fends. All invalid fends have an associated correction value of zero.

The ‘normalization’ attribute is updated to ‘express’ or ‘binning-express’, depending on if the ‘precorrect’ option is selected. In addition, the ‘chromosome_means’ attribute is updated such that the mean correction (sum of all valid chromosomal correction value pairs) is adjusted to zero and the corresponding chromosome mean is adjusted the same amount but the opposite sign.

find_probability_fend_corrections(mindistance=0, maxdistance=0, minchange=0.0001, max_iterations=1000, learningstep=0.5, chroms=[], precalculate=True, precorrect=False, model='binomial')

Using gradient descent, learn correction values for each valid fend based on a binomial or Poisson distribution of observations. This function is MPI compatible.

Parameters:
  • mindistance (int.) – The minimum inter-fend distance to be included in modeling.
  • maxdistance (int.) – The maximum inter-fend distance to be included in modeling.
  • minchange (float) – The cutoff threshold for early learning termination for the maximum absolute gradient value.
  • max_iterations (int.) – The maximum number of iterations to carry on gradient descent for.
  • learningstep (float) – The scaling factor for decreasing learning rate by if step doesn’t meet armijo criterion.
  • chroms (list) – A list of chromosomes to calculate corrections for. If set as None, all chromosome corrections are found.
  • precalculate (bool.) – Specifies whether the correction values should be initialized at the fend means.
  • precorrect (bool.) – Use binning-based corrections in expected value calculations, resulting in a chained normalization approach.
  • model (str.) – Which probability model to use, either ‘poisson’ or ‘binomial’. If ‘poisson’ is chosen, read counts are used. If ‘binomial’ is chosen, reads are converted to a 0/1 indicator of observed/unobserved status.
Returns:

None

Attributes:
  • corrections (ndarray) - A numpy array of type float32 and length equal to the number of fends. All invalid fends have an associated correction value of zero.

The ‘normalization’ attribute is updated to ‘probability’ or ‘binning-probability’, depending on if the ‘precorrect’ option is selected. In addition, the ‘chromosome_means’ attribute is updated such that the mean correction (sum of all valid chromosomal correction value pairs) is adjusted to zero and the corresponding chromosome mean is adjusted the same amount but the opposite sign.

find_trans_means()

Calculate the mean signals across all valid fend-pair trans interactions for each chromosome pair.

Returns:

None

Attributes:
  • trans_mean (float) - A float corresponding to the mean signal of inter-chromosome interactions.
learn_fend_3D_model(chrom, minobservations=10)

Learn coordinates for a 3D model of data using an approximate PCA dimensional reduction.

This function makes use of the mlpy function PCAFast() to reduce the data to a set of three coordinates per fend. Cis data for all unfiltered fends for the specified chromosome are dynamically binned to yield a complete distance matrix. The diagonal is set equal to the highest valid enrichment value after dynamic binning. This N x N matrix is passed to PCAFast() and reduced to an N x 3 matrix.

Parameters:
  • chrom (str.) – The chromosome to learn the model for.
  • minobservations (int.) – The minimum number of observed reads needed to cease bin expansion in the dynamic binning phase.
Returns:

Array containing a row for each valid fend and columns containing X coordinate, Y coordinate, Z coordinate, and sequence coordinate (fend midpoint).

load()

Load analysis parameters from h5dict specified at object creation and open h5dicts for associated HiCData and Fend objects.

Any call of this function will overwrite current object data with values from the last save() call.

Returns:None
load_data(filename)

Load fend-pair counts and fend object from HiCData object.

Parameters:

filename (str.) – Specifies the file name of the HiCData object to associate with this analysis.

Returns:

None

Attributes:
  • datafilename (str.) - A string containing the relative path of the HiCData file.
  • fendfilename (str.) - A string containing the relative path of the Fend file associated with the HiCData file.
  • fends (filestream) - A filestream to the hdf5 Fragment file such that all saved Fend attributes can be accessed through this class attribute.
  • data (filestream) - A filestream to the hdf5 FiveCData file such that all saved HiCData attributes can be accessed through this class attribute.
  • chr2int (dict.) - A dictionary that converts chromosome names to chromosome indices.
  • filter (ndarray) - A numpy array of type int32 and size N where N is the number of fends. This contains the inclusion status of each fend with a one indicating included and zero indicating excluded and is initialized with all fends included.

When a HiCData object is associated with the project file, the ‘history’ attribute is updated with the history of the HiCData object.

reset_filter()

Return all fends to a valid filter state.

Returns:None
save(out_fname=None)

Save analysis parameters to h5dict.

Parameters:filename (str.) – Specifies the file name of the HiC object to save this analysis to.
Returns:None
trans_heatmap(chrom1, chrom2, start1=None, stop1=None, startfend1=None, stopfend1=None, binbounds1=None, start2=None, stop2=None, startfend2=None, stopfend2=None, binbounds2=None, binsize=1000000, skipfiltered=False, datatype='enrichment', returnmapping=False, dynamically_binned=False, minobservations=0, searchdistance=0, expansion_binsize=0, removefailed=False, image_file=None, **kwargs)

Return a heatmap of trans data of the type and shape specified by the passed arguments.

This function returns a heatmap for trans interactions between two chromosomes within a region, bounded by either ‘start1’, ‘stop1’, ‘start2’ and ‘stop2’ or ‘startfend1’, ‘stopfend1’, ‘startfend2’, and ‘stopfend2’ (‘start’ and ‘stop’ take precedence), or if given, the outer coordinates of the arrays passed by ‘binbounds1’ and ‘binbounds2’. The data in the array is determined by the ‘datatype’, being raw, fend-corrected, distance-corrected, enrichment, or expected data. The array shape is always rectangular. See hic_binning for further explanation of ‘datatype’. If using dynamic binning (‘dynamically_binned’ is set to True), ‘minobservations’, ‘searchdistance’, ‘expansion_binsize’, and ‘removefailed’ are used to control the dynamic binning process. Otherwise these arguments are ignored.

Parameters:
  • chrom1 (str.) – The name of the first chromosome to obtain data from.
  • chrom2 (str.) – The name of the second chromosome to obtain data from.
  • start1 (int.) – The coordinate at the beginning of the smallest bin from ‘chrom1’. If unspecified, ‘start1’ will be the first multiple of ‘binsize’ below the ‘startfend1’ mid. If there is a conflict between ‘start1’ and ‘startfend1’, ‘start1’ is given preference. Optional.
  • stop1 (int.) – The largest coordinate to include in the array from ‘chrom1’, measured from fend midpoints. If both ‘stop1’ and ‘stopfend1’ are given, ‘stop1’ will override ‘stopfend1’. ‘stop1’ will be shifted higher as needed to make the last bin of size ‘binsize’. Optional.
  • startfend1 (int.) – The first fend from ‘chrom1’ to include in the array. If unspecified and ‘start1’ is not given, this is set to the first valid fend in ‘chrom1’. In cases where ‘start1’ is specified and conflicts with ‘startfend1’, ‘start1’ is given preference. Optional
  • stopfend1 – The first fend not to include in the array from ‘chrom1’. If unspecified and ‘stop1’ is not given, this is set to the last valid fend in ‘chrom1’ + 1. In cases where ‘stop1’ is specified and conflicts with ‘stopfend1’, ‘stop1’ is given preference. Optional.
  • binbounds1 (numpy array) – An array containing start and stop coordinates for a set of user-defined bins to use for partitioning ‘chrom1’. Any fend not falling in a bin is ignored.
  • start2 (int.) – The coordinate at the beginning of the smallest bin from ‘chrom2’. If unspecified, ‘start2’ will be the first multiple of ‘binsize’ below the ‘startfend2’ mid. If there is a conflict between ‘start2’ and ‘startfend2’, ‘start2’ is given preference. Optional.
  • stop2 (int.) – The largest coordinate to include in the array from ‘chrom2’, measured from fend midpoints. If both ‘stop2’ and ‘stopfend2’ are given, ‘stop2’ will override ‘stopfend2’. ‘stop2’ will be shifted higher as needed to make the last bin of size ‘binsize’. Optional.
  • startfend2 (int.) – The first fend from ‘chrom2’ to include in the array. If unspecified and ‘start2’ is not given, this is set to the first valid fend in ‘chrom2’. In cases where ‘start2’ is specified and conflicts with ‘startfend2’, ‘start2’ is given preference. Optional
  • stopfend2 (str.) – The first fend not to include in the array from ‘chrom2’. If unspecified and ‘stop2’ is not given, this is set to the last valid fend in ‘chrom2’ + 1. In cases where ‘stop2’ is specified and conflicts with ‘stopfend2’, ‘stop1’ is given preference. Optional.
  • binbounds2 (numpy array) – An array containing start and stop coordinates for a set of user-defined bins to use for partitioning ‘chrom2’. Any fend not falling in a bin is ignored.
  • binsize (int.) – This is the coordinate width of each bin. If binbounds is not None, this value is ignored.
  • skipfiltered (bool.) – If ‘True’, all interaction bins for filtered out fends are removed and a reduced-size array is returned.
  • datatype (str.) – This specifies the type of data that is processed and returned. Options are ‘raw’, ‘distance’, ‘fend’, ‘enrichment’, and ‘expected’. Observed values are always in the first index along the last axis, except when ‘datatype’ is ‘expected’. In this case, filter values replace counts. Conversely, if ‘raw’ is specified, unfiltered fends return value of one. Expected values are returned for ‘distance’, ‘fend’, ‘enrichment’, and ‘expected’ values of ‘datatype’. ‘distance’ uses only the expected signal given distance for calculating the expected values, ‘fend’ uses only fend correction values, and both ‘enrichment’ and ‘expected’ use both correction and distance mean values.
  • returnmapping (bool.) – If ‘True’, a list containing the data array and two 2d arrays of N x 4 containing the first fend and last fend plus one included in each bin and first and last coordinates for the first and second chromosomes is returned. Otherwise only the data array is returned.
  • dynamically_binned (bool.) – If ‘True’, return dynamically binned data.
  • minobservations (int.) – The fewest number of observed reads needed for a bin to counted as valid and stop expanding.
  • searchdistance (int.) – The furthest distance from the bin minpoint to expand bounds. If this is set to zero, there is no limit on expansion distance.
  • expansion_binsize (int.) – The size of bins to use for data to pull from when expanding dynamic bins. If set to zero, unbinned data is used.
  • removefailed (bool.) – If a non-zero ‘searchdistance’ is given, it is possible for a bin not to meet the ‘minobservations’ criteria before stopping looking. If this occurs and ‘removefailed’ is True, the observed and expected values for that bin are zero.
  • image_file (str.) – If a filename is specified, a PNG image file is written containing the heatmap data. Arguments for the appearance of the image can be passed as additional keyword arguments.
Returns:

Array in format requested with ‘arraytype’ containing data requested with ‘datatype’. If returnmapping is True, a list is returned containined the requested data array and an array of associated positions (dependent on the binning options selected).

write_heatmap(filename, binsize, includetrans=True, datatype='enrichment', chroms=[], dynamically_binned=False, minobservations=0, searchdistance=0, expansion_binsize=0, removefailed=False, format='hdf5')

Create a file containing binned interaction arrays, bin positions, and an index of included chromosomes. This function is MPI compatible.

Parameters:
  • filename (str.) – Location to write heatmap object to. If format is ‘txt’, this should be a filename prefix.
  • binsize (int.) – Size of bins for interaction arrays.
  • includetrans (bool.) – Indicates whether trans interaction arrays should be calculated and saved.
  • datatype (str.) – This specifies the type of data that is processed and returned. Options are ‘raw’, ‘distance’, ‘fend’, ‘enrichment’, and ‘expected’. Observed values are always in the first index along the last axis, except when ‘datatype’ is ‘expected’. In this case, filter values replace counts. Conversely, if ‘raw’ is specified, unfiltered fends return value of one. Expected values are returned for ‘distance’, ‘fend’, ‘enrichment’, and ‘expected’ values of ‘datatype’. ‘distance’ uses only the expected signal given distance for calculating the expected values, ‘fend’ uses only fend correction values, and both ‘enrichment’ and ‘expected’ use both correction and distance mean values.
  • chroms (list) – A list of chromosome names indicating which chromosomes should be included. If left empty, all chromosomes are included. Optional.
  • dynamically_binned (bool.) – If ‘True’, return dynamically binned data.
  • minobservations (int.) – The fewest number of observed reads needed for a bin to counted as valid and stop expanding.
  • searchdistance (int.) – The furthest distance from the bin minpoint to expand bounds. If this is set to zero, there is no limit on expansion distance.
  • expansion_binsize (int.) – The size of bins to use for data to pull from when expanding dynamic bins. If set to zero, unbinned data is used.
  • removefailed (bool.) – If a non-zero ‘searchdistance’ is given, it is possible for a bin not to meet the ‘minobservations’ criteria before stopping looking. If this occurs and ‘removefailed’ is True, the observed and expected values for that bin are zero.
  • format (str.) – A string indicating whether to save heatmaps as text matrices (‘txt’), an HDF5 file of numpy arrays (‘hdf5’), or a numpy npz file (‘npz’).
Returns:

None

The following attributes are created within the hdf5 dictionary file. Arrays are accessible as datasets while the resolution is held as an attribute.

Attributes:
  • resolution (int.) - The bin size that data are accumulated in.
  • chromosomes (ndarray) - A numpy array of strings listing all of the chromosomes included in the heatmaps.
  • N.positions (ndarray) - A series of numpy arrays of type int32, one for each chromosome where N is the chromosome name, containing one row for each bin and four columns denoting the start and stop coordinates and first fend and last fend plus one for each bin.
  • N.counts (ndarray) - A series of numpy arrays of type int32, one for each chromosome where N is the chromosome name, containing the observed counts for valid fend combinations. Arrays are in an upper-triangle format such that they have N * (N - 1) / 2 entries where N is the number of fends or bins in the chromosome.
  • N.expected (ndarray) - A series of numpy arrays of type float32, one for each chromosome where N is the chromosome name, containing the expected counts for valid fend combinations. Arrays are in an upper-triangle format such that they have N * (N - 1) / 2 entries where N is the number of fends in the chromosome.
  • N.enrichment (ndarray) - A series of numpy arrays of type float32, one for each chromosome where N is the chromosome name, containing the observed / expected counts for valid fend combinations. Arrays are in an upper-triangle format such that they have N * (N - 1) / 2 entries where N is the number of fends in the chromosome.
  • N_by_M.counts (ndarray) - A series of numpy arrays of type int32, one for each chromosome pair N and M if trans data are included, containing the observed counts for valid fend combinations. The chromosome name order specifies which axis corresponds to which chromosome.
  • N_by_M.expected (ndarray) - A series of numpy arrays of type float32, one for each chromosome pair N and M if trans data are included, containing the expected counts for valid fend combinations. The chromosome name order specifies which axis corresponds to which chromosome.
write_multiresolution_heatmap(filename, datatype='fend', maxbinsize=1280000, minbinsize=5000, trans_maxbinsize=None, trans_minbinsize=None, minobservations=5, chroms=None, includetrans=True, midbinsize=40000)

Create a multi-resolution heatmap file containing data for each requested chromosome. This function is MPI-compatible.

Parameters:
  • filename (str.) – Location to write the multi-resolution heamtap to.
  • datatype (str.) – This specifies the type of data that is processed and returned. Options are ‘raw’, ‘distance’, ‘fend’, and ‘enrichment’. Observed values are always in the first index along the last axis. If ‘raw’ is specified, unfiltered fends return value of one. Expected values are returned for ‘distance’, ‘fend’, ‘enrichment’, and ‘expected’ values of ‘datatype’. ‘distance’ uses only the expected signal given distance for calculating the expected values, ‘fend’ uses only fend correction values, and ‘enrichment’ uses both correction and distance mean values.
  • maxbinsize (int.) – The maximum sized bin (lowest resolution) heatmap to be produced for each chromosome.
  • minbinsize (int.) – The minimum sized bin (highest resolution) heatmap to be produced for each chromosome. The maxbinsize and minbinsize must differ by an exponent of 2 (e.g. minbinsize * 2^N = maxbinsize for some integer N).
  • trans_maxbinsize (int.) – The maximum sized bin (lowest resolution) heatmap to be produced for inter-chromosomal interactions. If trans_maxbinsize is None, the value maxbinsize is used for inter-chromosomal interactions.
  • trans_minbinsize (int.) – The minimum sized bin (highest resolution) heatmap to be produced for inter-chromosomal interactions. If trans_minbinsize is None, the value minbinsize is used for inter-chromosomal interactions. The trans_maxbinsize and trans_minbinsize must differ by an exponent of 2 (e.g. trans_minbinsize * 2^N = trans_maxbinsize for some integer N).
  • minobservations (int.) – The minimum number of reads needed for a bin to be considered valid and be included in the heatmap.
  • chroms (list) – A list of chromosomes to include in the multi-resolution heatmap. If chroms is None, all chromosomes will be included.
  • includetrans (bool.) – Indicates whether to calculate all of the inter-chromosomal interaction multi-resolution heatmaps.
  • midbinsize (int.) – This is used to determine the smallest bin size (highest resolution) complete heatmap to generate in producing the multi-resolution heatmap. It does not affect the resulting output but can be used to limit the total memory usage, with higher values using less memory but more time.

The multi-resolution heatmap file has the following structure - header: magic number “42054205” - 4 bytes (8 2-bit hexadecimals) flag indicating if trans are included - 4 bytes (1 int32) number of chromosomes “N” - 4 bytes (1 int32) chromosome names - N * 10 bytes (N * 10 chars) chrom index bounds - (N * (N + 1) / 2 + 1) * 4 bytes (N * (N + 1) / 2 + 1 int32) intra-chrom top layer number of paritions - N * 4 bytes (N int32) inter-chrom top layer number of paritions - N * 4 bytes (N int32) chrom total data bins - (N * (N + 1) / 2) * 4 bytes (N int32) chrom total index bins - (N * (N + 1) / 2) * 4 bytes (N int32) intra-chrom start coordinates - N * 4 bytes (N int32) intra-chrom stop coordinates - N * 4 bytes (N int32) inter-chrom start coordinates - N * 4 bytes (N int32) inter-chrom stop coordinates - N * 4 bytes (N int32) chrom min scores - (N * (N + 1) / 2) * 4 bytes (N * (N + 1) / 2 float32) chrom max scores - (N * (N + 1) / 2) * 4 bytes (N * (N + 1) / 2 float32) intra-chromosome largest bin size - 4 bytes (1 int32) intra-chromosome smallest bin size - 4 bytes (1 int32) inter-chromosome largest bin size - 4 bytes (1 int32) inter-chromosome smallest bin size - 4 bytes (1 int32) minimum number of observed reads - 4 bytes (1 int32)

interaction and index arrays: data_array_1_by_1 float32 index_array_1_by_1 int32 shape_array_1_by_1 int32 data_array_1_by_2 float32 index_array_1_by_2 int32 shape_array_1_by_2 int32 ... data_array_1_by_N float32 index_array_1_by_N int32 shape_array_1_by_N int32 data_array_2_by_2 float32 index_array_2_by_2 int32 shape_array_2_by_2 int32 data_array_2_by_3 float32 index_array_2_by_3 int32 shape_array_2_by_3 int32 data_array_2_by_N float32 index_array_2_by_N int32 shape_array_2_by_N int32 ... data_array_N_by_N float32 index_array_N_by_N int32 shape_array_N_by_N int32

Each data array starts with a flattened array of the complete heatmap with the largest bin size. Intra-chromosomal heatmaps are upper-triangle arrays including the diagonal while inter-chromosomal arrays are rectangles. For each bin, there is a corresponding position in the index array pointing to the start index in the data array for the interactions contained within the data bin, partitioned into smaller bin sizes. Bins are always partitioned by a factor of 2. If none of the partitioned bins pass the minimum observation threshold, the index is -1. The shape array contains an interger indicating the number and position of valid bins (and therefore the number of bins, starting with the index number, containing data underneath the higher-level bin). Shape values are converted from a binary number with each bit representing whether or not each subpartition contains valid data, going left to right for the top row and then the bottom row. So a subdivision containing only data in the top-left bin would have a value of 2, whereas a completely full set of subpartitions would have a value of 15. The smallest binsize data does not have corresponding positions in the index array or shape array as there are no further sub-partitionings. Indices in the index array are relative to the data array start position, given in the ‘chrom index bounds’ portion of the header.