The HiCData class

A class for handling HiC read data.

class hifive.hic_data.HiCData(filename, mode='r', silent=False)

This class handles interaction count data for HiC experiments.

This class stores mapped paired-end reads, indexing them by fend-end (fend) number, in an h5dict.

Note

This class is also available as hifive.HiCData

When initialized, this class creates an h5dict in which to store all data associated with this object.

Parameters:
  • filename (str.) – The file name of the h5dict. This should end with the suffix ‘.hdf5’
  • mode (str.) – The mode to open the h5dict with. This should be ‘w’ for creating or overwriting an h5dict with name given in filename.
  • silent (bool.) – Indicates whether to print information about function execution for this object.
Returns:

HiCData class object.

Attributes:
  • file (str.) A string containing the name of the file passed during object creation for saving the object to.
  • silent (bool.) - A boolean indicating whether to suppress all of the output messages.
  • history (str.) - A string containing all of the commands executed on this object and their outcomes.
export_to_mat(outfilename)

Write reads loaded in data object to text file in HiCPipe-compatible ‘mat’ format.

Parameters:outfilename (str.) – Specifies the file to save data in.
Returns:None
load()

Load data from h5dict specified at object creation.

Any call of this function will overwrite current object data with values from the last save() call.

Returns:None
load_binned_data_from_matrices(fendfilename, filename, format=None)

Read interaction counts from a tab-separated set of matrix files, one per chromosome, and place in h5dict.

Each file is assumed to contain a complete matrix of integers divided into equal-width bins. If row and column names are present, the bin ranges will be taken from the labels in the format “XXX|XXX|chrX:XXX-XXX”, where only the block of text following the last “|” is looked at. If no labels are present, bins are assumed to begin at coordinate zero.

Parameters:
  • fendfilename (str.) – This specifies the file name of the Fend object to associate with the dataset.
  • filename (str.) – The file containing the data matrices. If data are in individual text files, this should be a filename template with an ‘*’ in place of the chromosome(s) names. Each chromosome and chromosome pair in the fend file will be checked and loaded if present. If format is not passed, the format of the matrices will be inferred from the filename (a ‘*’ will default to ‘txt’, otherwise the filename extension will be used). If data are in hdf5 or npz format, the individual matrices should either be named with the chromosome name or ‘N.counts’ where N is the chromosome name. For inter-chromosomal interactions, names should be ‘N_by_M’ for chromosomes N and M.
  • format (str.) – The format of the file(s) to load data from.
Returns:

None

Attributes:
  • fendfilename (str.) - A string containing the relative path of the fend file.
  • cis_data (ndarray) - A numpy array of type int32 and shape N x 3 where N is the number of valid non-zero intra-chromosomal bin pairings observed in the data. The first column contains the bin index (from the ‘bins’ array in the fend object) of the upstream bin, the second column contains the index of the downstream bin, and the third column contains the number of reads observed for that bin pair.
  • cis_indices (ndarray) - A numpy array of type int64 and a length of the number of bins + 1. Each position contains the first entry for the correspondingly-indexed bin in the first column of ‘cis_data’. For example, all of the downstream cis interactions for the bin at index 5 in the fend object ‘bins’ array are in cis_data[cis_indices[5]:cis_indices[6], :].
  • trans_data (ndarray) - A numpy array of type int32 and shape N x 3 where N is the number of valid non-zero inter-chromosomal bin pairings observed in the data. The first column contains the bin index (from the ‘bins’ array in the fend object) of the upstream bin (upstream also refers to the lower indexed chromosome in this context), the second column contains the index of the downstream bin, and the third column contains the number of reads observed for that bin pair.
  • trans_indices (ndarray) - A numpy array of type int64 and a length of the number of bins + 1. Each position contains the first entry for the correspondingly-indexed bin in the first column of ‘trans_data’. For example, all of the downstream trans interactions for the bin at index 5 in the fend object ‘bins’ array are in trans_data[trans_indices[5]:trans_indices[6], :].
  • fends (ndarray) - A filestream to the hdf5 fend file such that all saved bin attributes can be accessed through this class attribute.

When data is loaded the ‘history’ attribute is updated to include the history of the fend file that becomes associated with it.

load_data_from_bam(fendfilename, filelist, maxinsert, skip_duplicate_filtering=False)

Read interaction counts from pairs of BAM-formatted alignment file(s) and place in h5dict.

Parameters:
  • fendfilename (str.) – This specifies the file name of the Fend object to associate with the dataset.
  • filelist (list of mapped sequencing runs. Each run should be a list of the first and second read end bam files ([[run1_1, run1_2], [run2_1, run2_2]..])) – A list containing lists of paired end bam files. If only one pair of files is needed, the list may contain both file path strings.
  • maxinsert (int.) – A cutoff for filtering paired end reads whose total distance to their respective restriction sites exceeds this value.
  • skip_duplicate_filtering (bool.) – Do not remove PCR duplicates. This allows much lower memoer requirements since files can be processed in chunks.
Returns:

None

Attributes:
  • fendfilename (str.) - A string containing the relative path of the fend file.
  • cis_data (ndarray) - A numpy array of type int32 and shape N x 3 where N is the number of valid non-zero intra-chromosomal fend pairings observed in the data. The first column contains the fend index (from the ‘fends’ array in the fend object) of the upstream fend, the second column contains the idnex of the downstream fend, and the third column contains the number of reads observed for that fend pair.
  • cis_indices (ndarray) - A numpy array of type int64 and a length of the number of fends + 1. Each position contains the first entry for the correspondingly-indexed fend in the first column of ‘cis_data’. For example, all of the downstream cis interactions for the fend at index 5 in the fend object ‘fends’ array are in cis_data[cis_indices[5]:cis_indices[6], :].
  • trans_data (ndarray) - A numpy array of type int32 and shape N x 3 where N is the number of valid non-zero inter-chroosomal fend pairings observed in the data. The first column contains the fend index (from the ‘fends’ array in the fend object) of the upstream fend (upstream also refers to the lower indexed chromosome in this context), the second column contains the index of the downstream fend, and the third column contains the number of reads observed for that fend pair.
  • trans_indices (ndarray) - A numpy array of type int64 and a length of the number of fends + 1. Each position contains the first entry for the correspondingly-indexed fend in the first column of ‘trans_data’. For example, all of the downstream trans interactions for the fend at index 5 in the fend object ‘fends’ array are in cis_data[cis_indices[5]:cis_indices[6], :].
  • fends (ndarray) - A filestream to the hdf5 fend file such that all saved fend attributes can be accessed through this class attribute.
  • maxinsert (int.) - An interger denoting the maximum included distance sum between both read ends and their downstream RE site.

When data is loaded the ‘history’ attribute is updated to include the history of the fend file that becomes associated with it.

load_data_from_mat(fendfilename, filename)

Read interaction counts from a HiCPipe-compatible ‘mat’ text file and place in h5dict.

Parameters:
  • fendfilename (str.) – This specifies the file name of the Fend object to associate with the dataset.
  • filename (str.) – File name of a ‘mat’ file containing fend pair and interaction count data.
Returns:

None

Attributes:
  • fendfilename (str.) - A string containing the relative path of the fend file.
  • cis_data (ndarray) - A numpy array of type int32 and shape N x 3 where N is the number of valid non-zero intra-chromosomal fend pairings observed in the data. The first column contains the fend index (from the ‘fends’ array in the fend object) of the upstream fend, the second column contains the idnex of the downstream fend, and the third column contains the number of reads observed for that fend pair.
  • cis_indices (ndarray) - A numpy array of type int64 and a length of the number of fends + 1. Each position contains the first entry for the correspondingly-indexed fend in the first column of ‘cis_data’. For example, all of the downstream cis interactions for the fend at index 5 in the fend object ‘fends’ array are in cis_data[cis_indices[5]:cis_indices[6], :].
  • trans_data (ndarray) - A numpy array of type int32 and shape N x 3 where N is the number of valid non-zero inter-chroosomal fend pairings observed in the data. The first column contains the fend index (from the ‘fends’ array in the fend object) of the upstream fend (upstream also refers to the lower indexed chromosome in this context), the second column contains the index of the downstream fend, and the third column contains the number of reads observed for that fend pair.
  • trans_indices (ndarray) - A numpy array of type int64 and a length of the number of fends + 1. Each position contains the first entry for the correspondingly-indexed fend in the first column of ‘trans_data’. For example, all of the downstream trans interactions for the fend at index 5 in the fend object ‘fends’ array are in cis_data[cis_indices[5]:cis_indices[6], :].
  • fends (ndarray) - A filestream to the hdf5 fend file such that all saved fend attributes can be accessed through this class attribute.

When data is loaded the ‘history’ attribute is updated to include the history of the fend file that becomes associated with it.

load_data_from_raw(fendfilename, filelist, maxinsert, skip_duplicate_filtering=False)

Read interaction counts from a text file(s) and place in h5dict.

Files should contain both mapped ends of a read, one read per line, separated by tabs. Each line should be in the following format:

chromosome1    coordinate1  strand1   chromosome2    coordinate2  strand2

where strands are given by the characters ‘+’ and ‘-‘.

Parameters:
  • fendfilename (str.) – This specifies the file name of the Fend object to associate with the dataset.
  • filelist (list) – A list containing all of the file names of mapped read text files to be included in the dataset. If only one file is needed, this may be passed as a string.
  • maxinsert (int.) – A cutoff for filtering paired end reads whose total distance to their respective restriction sites exceeds this value. If data was produced without a restriction enzyme (fend object has no fend data, only bin data), this integer specifies the maximum intra-chromosomal insert size that strandedness is considered for filtering. Fragments below the maxinsert size are only kept if they occur on the same orientation strand. This filtering is skipped is maxinsert is None.
  • skip_duplicate_filtering (bool.) – Do not remove PCR duplicates. This allows much lower memoer requirements since files can be processed in chunks.
Returns:

None

Attributes:
  • fendfilename (str.) - A string containing the relative path of the fend file.
  • cis_data (ndarray) - A numpy array of type int32 and shape N x 3 where N is the number of valid non-zero intra-chromosomal fend pairings observed in the data. The first column contains the fend index (from the ‘fends’ array in the fend object) of the upstream fend, the second column contains the idnex of the downstream fend, and the third column contains the number of reads observed for that fend pair.
  • cis_indices (ndarray) - A numpy array of type int64 and a length of the number of fends + 1. Each position contains the first entry for the correspondingly-indexed fend in the first column of ‘cis_data’. For example, all of the downstream cis interactions for the fend at index 5 in the fend object ‘fends’ array are in cis_data[cis_indices[5]:cis_indices[6], :].
  • trans_data (ndarray) - A numpy array of type int32 and shape N x 3 where N is the number of valid non-zero inter-chroosomal fend pairings observed in the data. The first column contains the fend index (from the ‘fends’ array in the fend object) of the upstream fend (upstream also refers to the lower indexed chromosome in this context), the second column contains the index of the downstream fend, and the third column contains the number of reads observed for that fend pair.
  • trans_indices (ndarray) - A numpy array of type int64 and a length of the number of fends + 1. Each position contains the first entry for the correspondingly-indexed fend in the first column of ‘trans_data’. For example, all of the downstream trans interactions for the fend at index 5 in the fend object ‘fends’ array are in cis_data[cis_indices[5]:cis_indices[6], :].
  • fends (ndarray) - A filestream to the hdf5 fend file such that all saved fend attributes can be accessed through this class attribute.
  • maxinsert (int.) - An interger denoting the maximum included distance sum between both read ends and their downstream RE site.

When data is loaded the ‘history’ attribute is updated to include the history of the fend file that becomes associated with it.

save()

Save analysis parameters to h5dict.

Returns:None