The fivec_binning module

This is a module contains scripts for generating compact, upper-triangle and full matrices of 5C interaction data.

Concepts

Data can either be arranged in compact, complete, or flattened (row-major) upper-triangle arrays. Compact arrays are N x M, where N is the number of forward probe fragments and M is the number of reverse probe fragments. Data can be raw, fragment-corrected, distance-dependence removed, or enrichment values. Arrays are 3-dimensional with observed values in the first layer of d3, expected values in the second layer of d3. The exception to this is upper-triangle arrays, which are 2d, dividing observed and expected along the second axis.

API documentation

hifive.fivec_binning.bin_cis_array(data_array, data_mapping, binsize=10000, binbounds=None, start=None, stop=None, arraytype='full', returnmapping=False, **kwargs)

Create an array of format ‘arraytype’ and fill ‘binsize’ bins or bins defined by ‘binbounds’ with data provided in the array passed by ‘data_array’.

Parameters:
  • data_array (numpy array) – A 2d (upper) or 3d (full) array containing data to be binned. Array format will be determined from the number of dimensions.
  • data_mapping (numpy array) – An N x 4 2d integer array containing the start and stop coordinates, and start and stop fragments for each of the N bin ranges in ‘data_array’.
  • binsize (int.) – This is the coordinate width of each bin. If binbounds is not None, this value is ignored.
  • binbounds (numpy array) – An array containing start and stop coordinates for a set of user-defined bins. Any bin from ‘data_array’ not falling in a bin is ignored.
  • start (int.) – The coordinate at the beginning of the first bin of the binned data. If unspecified, ‘start’ will be the first multiple of ‘binsize’ below the first coordinate from ‘data_mapping’. If ‘binbounds’ is given, ‘start’ is ignored. Optional.
  • stop (int.) – The coordinate at the end of the last bin of the binned data. If unspecified, ‘stop’ will be the first multiple of ‘binsize’ after the last coordinate from ‘data_mapping’. If needed, ‘stop’ is adjusted upward to create a complete last bin. If ‘binbounds’ is given, ‘stop’ is ignored. Optional.
  • arraytype (str.) – This determines what shape of array data are returned in. Acceptable values are ‘full’ and ‘upper’. ‘full’ returns a square, symmetric array of size N x N x 2. ‘upper’ returns only the flattened upper triangle of a full array, excluding the diagonal of size (N * (N - 1) / 2) x 2.
  • returnmapping (bool.) – If ‘True’, a list containing the data array and a 2d array containing first coordinate included and excluded from each bin, and the first fragment included and excluded from each bin is returned. Otherwise only the data array is returned.
Returns:

Array in format requested with ‘arraytype’ containing binned data requested with ‘datatype’ pulled from ‘data_array’ or list of binned data array and mapping array.

hifive.fivec_binning.dynamically_bin_cis_array(unbinned, unbinnedpositions, binned, binbounds, minobservations=50, searchdistance=0, removefailed=True, **kwargs)

Expand bins in ‘binned’ to include additional data provided in ‘unbinned’ as necessary to meet ‘minobservations’, or ‘searchdistance’ criteria.

Parameters:
  • unbinned (numpy array) – A full or upper array containing data to be binned. Array format will be determined from the number of dimensions.
  • unbinnedpositions (numpy array) – A 2d integer array indicating the first and last coordinate of each bin in ‘unbinned’ array.
  • binned (numpy array) – A full or upper array containing binned data to be dynamically binned. Array format will be determined from the number of dimensions. Data in this array will be altered by this function.
  • binbounds (numpy array) – A N x 2 integer array indicating the start and end position of each of N bins in ‘binned’ array.
  • minobservations (int.) – The fewest number of observed reads needed for a bin to counted as valid and stop expanding.
  • searchdistance (int.) – The furthest distance from the bin minpoint to expand bounds. If this is set to zero, there is no limit on expansion distance.
  • removefailed (bool.) – If a non-zero ‘searchdistance’ is given, it is possible for a bin not to meet the ‘minobservations’ criteria before stopping looking. If this occurs and ‘removefailed’ is True, the observed and expected values for that bin are zero.
Returns:

None

hifive.fivec_binning.dynamically_bin_trans_array(unbinned, unbinnedpositions1, unbinnedpositions2, binned, binbounds1, binbounds2, minobservations=50, searchdistance=0, removefailed=True, **kwargs)

Expand bins in ‘binned’ to include additional data provided in ‘unbinned’ as necessary to meet ‘minobservations’, or ‘searchdistance’ criteria.

Parameters:
  • unbinned (numpy array) – A full array containing data to be binned.
  • unbinnedpositions1 (numpy array) – A 2d integer array indicating the first and last coordinate of each bin in ‘unbinned’ array along the first axis.
  • unbinnedpositions2 (numpy array) – A 2d integer array indicating the first and last coordinate of each bin in ‘unbinned’ array along the second axis.
  • binned (numpy array) – A full array containing binned data to be dynamically binned. Data in this array will be altered by this function.
  • binbounds1 (numpy array) – A N x 2 integer array indicating the start and end position of each of N bins in ‘binned’ array along the first axis.
  • binbounds2 (numpy array) – A N x 2 integer array indicating the start and end position of each of N bins in ‘binned’ array along the second axis.
  • minobservations (int.) – The fewest number of observed reads needed for a bin to counted as valid and stop expanding.
  • searchdistance (int.) – The furthest distance from the bin minpoint to expand bounds. If this is set to zero, there is no limit on expansion distance.
  • removefailed (bool.) – If a non-zero ‘searchdistance’ is given, it is possible for a bin not to meet the ‘minobservations’ criteria before stopping looking. If this occurs and ‘removefailed’ is True, the observed and expected values for that bin are zero.
Returns:

None

hifive.fivec_binning.find_cis_signal(fivec, region, binsize=0, binbounds=None, start=None, stop=None, startfrag=None, stopfrag=None, datatype='enrichment', arraytype='full', skipfiltered=False, returnmapping=False, **kwargs)

Create an array of format ‘arraytype’ and fill with data requested in ‘datatype’.

Parameters:
  • fivec (FiveC) – A FiveC class object containing fragment and count data.
  • region (int.) – The index of the region to pull data from.
  • binsize (int.) – This is the coordinate width of each bin. A value of zero indicates unbinned. If binbounds is not None, this value is ignored.
  • binbounds (numpy array) – An array containing start and stop coordinates for a set of user-defined bins. Any fragment not falling in a bin is ignored.
  • start (int.) – The smallest coordinate to include in the array, measured from fragment midpoints. If ‘binbounds’ is given, this value is ignored. If both ‘start’ and ‘startfrag’ are given, ‘start’ will override ‘startfrag’. If unspecified, this will be set to the midpoint of the first fragment for ‘region’, adjusted to the first multiple of ‘binsize’ if not zero. Optional.
  • stop (int.) – The largest coordinate to include in the array, measured from fragment midpoints. If ‘binbounds’ is given, this value is ignored. If both ‘stop’ and ‘stopfrag’ are given, ‘stop’ will override ‘stopfrag’. If unspecified, this will be set to the midpoint of the last fragment plus one for ‘region’, adjusted to the last multiple of ‘start’ + ‘binsize’ if not zero. Optional.
  • startfrag (int.) – The first fragment to include in the array. If ‘binbounds’ is given, this value is ignored. If unspecified and ‘start’ is not given, this is set to the first fragment in ‘region’. In cases where ‘start’ is specified and conflicts with ‘startfrag’, ‘start’ is given preference. Optional.
  • stopfrag (int.) – The first fragment not to include in the array. If ‘binbounds’ is given, this value is ignored. If unspecified and ‘stop’ is not given, this is set to the last fragment in ‘region’ plus one. In cases where ‘stop’ is specified and conflicts with ‘stopfrag’, ‘stop’ is given preference. Optional.
  • datatype (str.) – This specifies the type of data that is processed and returned. Options are ‘raw’, ‘distance’, ‘fragment’, ‘enrichment’, and ‘expected’. Observed values are aways in the first index along the last axis, except when ‘datatype’ is ‘expected’. In this case, filter values replace counts. Conversely, if ‘raw’ is specified, unfiltered fragments return value of one. Expected values are returned for ‘distance’, ‘fragment’, ‘enrichment’, and ‘expected’ values of ‘datatype’. ‘distance’ uses only the expected signal given distance for calculating the expected values, ‘fragment’ uses only fragment correction values, and both ‘enrichment’ and ‘expected’ use both correction and distance mean values. ‘enrichment’ also scales both observed and expected by the standard deviation, giving a completely normalized set of values.
  • arraytype (str.) – This determines what shape of array data are returned in. Acceptable values are ‘compact’ (though only when ‘binned’ is zero), ‘full’, and ‘upper’. ‘compact’ means data are arranged in a N x M x 2 array where N and M are the number of forward and reverse probe fragments, respectively. ‘full’ returns a square, symmetric array of size N x N x 2 where N is the total number of fragments. ‘upper’ returns only the flattened upper triangle of a full array, excluding the diagonal of size (N * (N - 1) / 2) x 2, where N is the total number of fragments.
  • skipfiltered (bool.) – If ‘True’, all interaction bins for filtered out fragments are removed and a reduced-size array is returned.
  • returnmapping (bool.) – If ‘True’, a list containing the data array and either one or two 2d arrays containing first coordinate included and excluded from each bin, and the first fragment included and excluded from each bin corresponding to both axes or the first and second axis for an upper or compact array, respectively, is returned. Otherwise only the data array is returned.
Returns:

Array in format requested with ‘arraytype’ containing data requested with ‘datatype’.

hifive.fivec_binning.find_trans_signal(fivec, region1, region2, binsize=0, binbounds1=None, start1=None, stop1=None, startfrag1=None, stopfrag1=None, binbounds2=None, start2=None, stop2=None, startfrag2=None, stopfrag2=None, datatype='enrichment', arraytype='compact', skipfiltered=False, returnmapping=False, **kwargs)

Create an array of format ‘arraytype’ and fill with data requested in ‘datatype’.

Parameters:
  • fivec (FiveC) – A FiveC class object containing fragment and count data.
  • region1 (int.) – The index of the first region to pull data from.
  • region2 (int.) – The index of the second region to pull data from.
  • binsize (int.) – This is the coordinate width of each bin. A value of zero indicates unbinned. If binbounds is not None, this value is ignored.
  • binbounds1 (numpy array) – An array containing start and stop coordinates for a set of user-defined bins for region1. Any fragment not falling in a bin is ignored.
  • start1 (int.) – The smallest coordinate to include in the array from ‘region1’, measured from fragment midpoints. If ‘binbounds1’ is given, this value is ignored. If both ‘start1’ and ‘startfrag1’ are given, ‘start1’ will override ‘startfrag1’. If unspecified, this will be set to the midpoint of the first fragment for ‘region1’, adjusted to the first multiple of ‘binsize’ if not zero. Optional.
  • stop1 (int.) – The largest coordinate to include in the array from ‘region1’, measured from fragment midpoints. If ‘binbounds1’ is given, this value is ignored. If both ‘stop1’ and ‘stopfrag1’ are given, ‘stop1’ will override ‘stopfrag1’. If unspecified, this will be set to the midpoint of the last fragment plus one for ‘region1’, adjusted to the last multiple of ‘start1’ + ‘binsize’ if not zero. Optional.
  • startfrag1 (int.) – The first fragment to include in the array from ‘region1’. If ‘binbounds1’ is given, this value is ignored. If unspecified and ‘start1’ is not given, this is set to the first fragment in ‘region1’. In cases where ‘start1’ is specified and conflicts with ‘startfrag1’, ‘start1’ is given preference. Optional.
  • stopfrag1 (int.) – The first fragment not to include in the array from ‘region1’. If ‘binbounds1’ is given, this value is ignored. If unspecified and ‘stop1’ is not given, this is set to the last fragment in ‘region1’ plus one. In cases where ‘stop1’ is specified and conflicts with ‘stopfrag1’, ‘stop1’ is given preference. Optional.
  • binbounds2 (numpy array) – An array containing start and stop coordinates for a set of user-defined bins for region2. Any fragment not falling in a bin is ignored.
  • start2 (int.) – The smallest coordinate to include in the array from ‘region2’, measured from fragment midpoints. If ‘binbounds2’ is given, this value is ignored. If both ‘start2’ and ‘startfrag2’ are given, ‘start2’ will override ‘startfrag2’. If unspecified, this will be set to the midpoint of the first fragment for ‘region2’, adjusted to the first multiple of ‘binsize’ if not zero. Optional.
  • stop2 (int.) – The largest coordinate to include in the array from ‘region2’, measured from fragment midpoints. If ‘binbounds2’ is given, this value is ignored. If both ‘stop2’ and ‘stopfrag2’ are given, ‘stop2’ will override ‘stopfrag2’. If unspecified, this will be set to the midpoint of the last fragment plus one for ‘region2’, adjusted to the last multiple of ‘start2’ + ‘binsize’ if not zero. Optional.
  • startfrag2 (int.) – The first fragment to include in the array from ‘region2’. If ‘binbounds2’ is given, this value is ignored. If unspecified and ‘start2’ is not given, this is set to the first fragment in ‘region2’. In cases where ‘start2’ is specified and conflicts with ‘startfrag2’, ‘start2’ is given preference. Optional.
  • stopfrag2 (int.) – The first fragment not to include in the array from ‘region2’. If ‘binbounds2’ is given, this value is ignored. If unspecified and ‘stop2’ is not given, this is set to the last fragment in ‘region2’ plus one. In cases where ‘stop2’ is specified and conflicts with ‘stopfrag2’, ‘stop2’ is given preference. Optional.
  • datatype (str.) – This specifies the type of data that is processed and returned. Options are ‘raw’, ‘distance’, ‘fragment’, ‘enrichment’, and ‘expected’. Observed values are aways in the first index along the last axis, except when ‘datatype’ is ‘expected’. In this case, filter values replace counts. Conversely, if ‘raw’ is specified, unfiltered fragments return value of one. Expected values are returned for ‘distance’, ‘fragment’, ‘enrichment’, and ‘expected’ values of ‘datatype’. ‘distance’ uses only the expected signal given distance for calculating the expected values, ‘fragment’ uses only fragment correction values, and both ‘enrichment’ and ‘expected’ use both correction and distance mean values. ‘enrichment’ also scales both observed and expected by the standard deviation, giving a completely normalized set of values.
  • arraytype (str.) – This determines what shape of array data are returned in. Acceptable values are ‘compact’ (though only when ‘binned’ is zero), and ‘full’. ‘compact’ means data are arranged in a N x M x 2 array where N and M are the number of forward and reverse probe fragments, respectively. This will only return the array of forward primers from ‘region1’ and reverse primers from ‘region2’. ‘full’ returns a square, symmetric array of size N x N x 2 where N is the total number of fragments.
  • skipfiltered (bool.) – If ‘True’, all interaction bins for filtered out fragments are removed and a reduced-size array is returned.
  • returnmapping (bool.) – If ‘True’, a list containing the data array and either one or four 2d arrays containing first coordinate included and excluded from each bin, and the first fragment included and excluded from each bin corresponding to both axes or the first and second axis for ‘region1’ forward fragments by ‘region2’ reverse fragments and ‘region1’ reverse fragments by ‘region2’ forward fragments for a full or compact array, respectively, is returned. Otherwise only the data array (or data arrays is compact) is returned.
Returns:

Array in format requested with ‘arraytype’ containing data requested with ‘datatype’.

hifive.fivec_binning.write_heatmap_dict(fivec, filename, binsize, includetrans=True, datatype='enrichment', regions=[], arraytype='full', dynamically_binned=False, minobservations=0, searchdistance=0, expansion_binsize=0, removefailed=False, **kwargs)

Create an h5dict file containing binned interaction arrays, bin positions, and an index of included regions.

Parameters:
  • fivec (FiveC) – A FiveC class object containing fragment and count data.
  • filename (str.) – Location to write h5dict object to.
  • binsize (int.) – Size of bins for interaction arrays. If “binsize” is zero, fragment interactions are returned without binning.
  • includetrans (bool.) – Indicates whether trans interaction arrays should be calculated and saved.
  • datatype – This specifies the type of data that is processed and returned. Options are ‘raw’, ‘distance’, ‘fragment’, ‘enrichment’, and ‘expected’. Observed values are aways in the first index along the last axis, except when ‘datatype’ is ‘expected’. In this case, filter values replace counts. Conversely, if ‘raw’ is specified, non-filtered bins return value of 1. Expected values are returned for ‘distance’, ‘fragment’, ‘enrichment’, and ‘expected’ values of ‘datatype’. ‘distance’ uses only the expected signal given distance for calculating the expected values, ‘fragment’ uses only fragment correction values, and both ‘enrichment’ and ‘expected’ use both correction and distance mean values. :type datatype: str.
  • arraytype (str.) – This determines what shape of array data are returned in if unbinned heatmaps are requested. Acceptable values are ‘compact’ and ‘full’. ‘compact’ means data are arranged in a N x M array where N is the number of bins, M is the maximum number of steps between included bin pairs, and data are stored such that bin n,m contains the interaction values between n and n + m + 1. ‘full’ returns a square, symmetric array of size N x N.
  • regions (list.) – If given, indicates which regions should be included. If left empty, all regions are included.
  • dynamically_binned (bool.) – If ‘True’, return dynamically binned data.
  • minobservations (int.) – The fewest number of observed reads needed for a bin to counted as valid and stop expanding.
  • searchdistance (int.) – The furthest distance from the bin minpoint to expand bounds. If this is set to zero, there is no limit on expansion distance.
  • expansion_binsize (int.) – The size of bins to use for data to pull from when expanding dynamic bins. If set to zero, unbinned data is used.
  • removefailed – If a non-zero ‘searchdistance’ is given, it is possible for a bin not to meet the ‘minobservations’ criteria before stopping looking. If this occurs and ‘removefailed’ is True, the observed and expected values for that bin are zero.
Returns:

None