pybigwig 安装和使用

pybigwig 安装和使用

pyBigWig是用C编写的调用libBigWig库的python扩展,可以快速访问和处理bigBed和bigWig文件。

 

安装

pip install pybigwig

 

调用

import pyBigWig

读取和写入文件

pyBigWig.open

#本地文件
bw = pyBigWig.open("test/test.bw")
#远程文件
bb = pyBigWig.open("https://www.encodeproject.org/files/ENCFF001JBR/@@download/ENCFF001JBR.bigBed")
#写入文件
bw = pyBigWig.open("test/output.bw", "w")

 

查看文件类型

>>> bw = pyBigWig.open("test/test.bw")
>>> bw.isBigWig()
True
>>> bw.isBigBed()
False

bigwig.stats

 

# 查看染色体和长度

>>> bw.chroms()
dict_proxy({'1': 195471971L, '10': 130694993L})
>>> bw.chroms("1")
195471971L


# 查看header


>>> bw.header()
{'maxVal': 2L, 'sumData': 272L, 'minVal': 0L, 'version': 4L, 'sumSquared': 500L, 'nLevels': 1L, 'nBasesCovered': 154L}

 


# 查看一段位置的总结

min,coverage,std
#输出范围的均值

Typically we want to quickly access the average value over a range, which is very simple:
>>> bw.stats("1", 0, 3)
[0.2000000054637591]

 

Suppose instead of the mean value, we instead wanted the maximum value:

>>> bw.stats("1", 0, 3, type="max")
[0.30000001192092896]

 

Other options are "min" (the minimum value), "coverage" (the fraction of bases covered), and "std" (the standard deviation of the values).

 

It's often the case that we would instead like to compute values of some number of evenly spaced bins in a given interval, which is also simple:
>>> bw.stats("1",99, 200, type="max", nBins=2)
[1.399999976158142, 1.5]

nBins defaults to 1, just as type defaults to mean.

 

If the start and end positions are omitted then the entire chromosome is used:
#未设置起始终止位置,输出整条染色体信息
>>> bw.stats("1")
[1.3351851569281683]

A note on statistics and zoom levels

>>> import pyBigWig
>>> from numpy import mean
>>> bw = pyBigWig.open("http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign75mer.bigWig")
>>> bw.stats('chr1', 89294, 91629)
[0.20120902053804418]
>>> mean(bw.values('chr1', 89294, 91629))
0.22213841940688142
>>> bw.stats('chr1', 89294, 91629, exact=True)
[0.22213841940688142]


# 一段位置内单碱基的值
While the stats() method can be used to retrieve the original values for each base (e.g., by setting nBins to the number of bases), it's preferable to instead use the values() accessor.

使用stats() nBins设置为位置范围内碱基总数也可以实现此功能。
>>> bw.values("1", 0, 3)
[0.10000000149011612, 0.20000000298023224, 0.30000001192092896]

 

The list produced will always contain one value for every base in the range specified. If a particular base has no associated value in the bigWig file then the returned value will be nan.
>>> bw.values("1", 0, 4)
[0.10000000149011612, 0.20000000298023224, 0.30000001192092896, nan]


# Retrieve all intervals in a range

Sometimes it's convenient to retrieve all entries overlapping some range. This can be done with the intervals() function:
>>> bw.intervals("1", 0, 3)
((0, 1, 0.10000000149011612), (1, 2, 0.20000000298023224), (2, 3, 0.30000001192092896))


#返回值组成为起始位置,终止位置,值
# Retrieving bigBed entries

As opposed to bigWig files, bigBed files hold entries, which are intervals with an associated string. You can access these entries using the entries() function:
>>> bb = pyBigWig.open("https://www.encodeproject.org/files/ENCFF001JBR/@@download/ENCFF001JBR.bigBed")
>>> bb.entries('chr1', 10000000, 10020000)
[(10009333, 10009640, '61035\t130\t-\t0.026\t0.42\t404'), (10014007, 10014289, '61047\t136\t-\t0.029\t0.42\t404'), (10014373, 10024307, '61048\t630\t-\t5.420\t0.00\t2672399')]
#返回起始位置,终止位置,其他信息组成的string

 

The output is a list of entry tuples. The tuple elements are the start and end position of each entry, followed by its associated string. The string is returned exactly as it's held in the bigBed file, so parsing it is left to you. To determine what the various fields are in these string, consult the SQL string:
>>> bb.SQL()
table RnaElements
"BED6 + 3 scores for RNA Elements data"
    (
    string chrom;      "Reference sequence chromosome or scaffold"
    uint   chromStart; "Start position in chromosome"
    uint   chromEnd;   "End position in chromosome"
    string name;       "Name of item"
    uint   score;      "Normalized score from 0-1000"
    char[1] strand;    "+ or - or . for unknown"
    float level;       "Expression level such as RPKM or FPKM. Set to -1 for no data."
    float signif;      "Statistical significance such as IDR. Set to -1 for no data."
    uint score2;       "Additional measurement/count e.g. number of reads. Set to 0 for no data."
    )

 


# Add a header to a bigWig file

If you've opened a file for writing then you'll need to give it a header before you can add any entries. The header contains all of the chromosomes, in order, and their sizes. If your genome has two chromosomes, chr1 and chr2, of lengths 1 and 1.5 million bases, then the following would add an appropriate header:
>>> bw.addHeader([("chr1", 1000000), ("chr2", 1500000)])
>>> bw.addHeader([("chr1", 1000000), ("chr2", 1500000)], maxZooms=0)


#Adding entries to a bigWig file

Assuming you've opened a file for writing and added a header, you can then add entries. Note that the entries must be added in order, as bigWig files always contain ordered intervals. There are three formats that bigWig files can use internally to store entries. The most commonly observed format is identical to a bedGraph file:
>>> bw.addEntries(["chr1", "chr1", "chr1"], [0, 100, 125], ends=[5, 120, 126], values=[0.0, 1.0, 200.0])
>>> bw.addEntries("chr1", [500, 600, 635], values=[-2.0, 150.0, 25.0], span=20)
>>> bw.addEntries("chr1", 900, values=[-5.0, -20.0, 25.0], span=20, step=30)
>>> bw.addEntries(["chr1", "chr1", "chr1"], [100, 0, 125], ends=[120, 5, 126], values=[0.0, 1.0, 200.0], validate=False)

 


# Close a bigWig or bigBed file

A file can be closed with a simple bw.close(), as is commonly done with other file types. For files opened for writing, closing a file writes any buffered entries to disk, constructs and writes the file index, and constructs zoom levels. Consequently, this can take a bit of time.
bw.close()
# pyBigWig支持Numpy数据类型输入

    检查是否当前版本pyBigWig是否支持Numpy数据类型输入

>>> import pyBigWig
>>> pyBigWig.numpy
1
# 1表示支持
>>> import pyBigWig
>>> import numpy
>>> bw = pyBigWig.open("/tmp/delete.bw", "w")
>>> bw.addHeader([("1", 1000)], maxZooms=0)
>>> chroms = np.array(["1"] * 10)
>>> starts = np.array([0, 10, 20, 30, 40, 50, 60, 70, 80, 90], dtype=np.int64)
>>> ends = np.array([5, 15, 25, 35, 45, 55, 65, 75, 85, 95], dtype=np.int64)
>>> values0 = np.array(np.random.random_sample(10), dtype=np.float64)
>>> bw.addEntries(chroms, starts, ends=ends, values=values0)
>>> bw.close()

    values()直接返回numpy 对象

>>> bw = bw.open("/tmp/delete.bw")
>>> bw.values('1', 0, 10, numpy=True)
[ 0.74336642  0.74336642  0.74336642  0.74336642  0.74336642         nan
     nan         nan         nan         nan]
>>> type(bw.values('1', 0, 10, numpy=True))
<type 'numpy.ndarray'>



query(grangestat='max'num_bins=Noneexact=True)[source]

Signature rework/wrapper around pyBigWig’s stats() and intervals().

Parameters
  • grange (Dict[strAny]) –

    The genomic range to query. Should have at least the following structure:

    {
        'chrom': str,
        'start': int,
        'end': int
    }
    

     

  • num_bins (Optional[int]) – Pass an integer to split grange into num_bins bins of equal width, and return a summary statistic for each bin. Pass None to return all bigwig features in grange without binning.

  • stat (str) – The summary statistic to use if num_bins is not None.

  • exact (bool) – Pass True to ignore bigwig zoom levels when computing summary statistics and return the exact answer instead.

Returns

A list of bed features with ‘value’ keys representing the results of the query.

Return type

List[Dict[str, Any]]

 

ds = bwf.stats(gr.chrom, gr.initial, gr.final, type="mean", nBins=steps)
tmp=bw.stats(str(region.chrom), region.start,region.stop)

 


REF
链接:https://www.jianshu.com/p/d12756705759

https://github.com/deeptools/pyBigWig

https://github.com/deeptools/pyBigWig/blob/master/libBigWig/README.md

https://lib5c.readthedocs.io/en/stable/lib5c.contrib.pybigwig.bigwig/

 

posted @ 2023-10-10 19:57  emanlee  阅读(319)  评论(0编辑  收藏  举报