K2-lhotse数据读取、训练流程分析
class K2SpeechRecognitionDataset(torch.utils.data.Dataset):
The PyTorch Dataset for the speech recognition task using k2 library.
This dataset expects to be queried with lists of cut IDs,
for which it loads features and automatically collates/batches them.
To use it with a PyTorch DataLoader, set ``batch_size=None``
and provide a :class:`SimpleCutSampler` sampler.
Each item in this dataset is a dict of:
.. code-block::
{
'inputs': float tensor with shape determined by :attr:`input_strategy`:
- single-channel:
- features: (B, T, F)
- audio: (B, T)
- multi-channel: currently not supported
'supervisions': [
{
'sequence_idx': Tensor[int] of shape (S,)
'text': List[str] of len S
# For feature input strategies
'start_frame': Tensor[int] of shape (S,)
'num_frames': Tensor[int] of shape (S,)
# For audio input strategies
'start_sample': Tensor[int] of shape (S,)
'num_samples': Tensor[int] of shape (S,)
# Optionally, when return_cuts=True
'cut': List[AnyCut] of len S
}
]
}
Dimension symbols legend:
* ``B`` - batch size (number of Cuts)
* ``S`` - number of supervision segments (greater or equal to B, as each Cut may have multiple supervisions)
* ``T`` - number of frames of the longest Cut
* ``F`` - number of features<details>
The 'sequence_idx' field is the index of the Cut used to create the example in the Dataset.
def getitem(self, cuts: CutSet) -> Dict[str, Union[torch.Tensor, List[str]]]:
CutSet定义如下
class CutSet(Serializable, AlgorithmMixin):
CutSet->MonoCut-> DataCut【直接通过属性访问】-> Cut【to_dict()方法拿到dict】
CutSet ties together all types of data -- audio, features and supervisions, and is suitable to represent
training/dev/test sets.
.. note::
:class:`~lhotse.cut.CutSet` is the basic building block of PyTorch-style Datasets for speech/audio processing tasks.
When coming from Kaldi, there is really no good equivalent -- the closest concept may be Kaldi's "egs" for training
neural networks, which are chunks of feature matrices and corresponding alignments used respectively as inputs and
supervisions. :class:`~lhotse.cut.CutSet` is different because it provides you with all kinds of metadata,
and you can select just the interesting bits to feed them to your models.
CutSet初始化部分
三种不同方式去cut原始数据【需要对齐信息】
:class:`~lhotse.cut.CutSet` can be created from any combination of :class:`~lhotse.audio.RecordingSet`,
:class:`~lhotse.supervision.SupervisionSet`, and :class:`~lhotse.features.base.FeatureSet`
with :meth:`lhotse.cut.CutSet.from_manifests`::
>>> from lhotse import CutSet
>>> cuts = CutSet.from_manifests(recordings=my_recording_set)
>>> cuts2 = CutSet.from_manifests(features=my_feature_set)
>>> cuts3 = CutSet.from_manifests(
... recordings=my_recording_set,
... features=my_feature_set,
... supervisions=my_supervision_set,
... )
When creating a :class:`.CutSet` with :meth:`.CutSet.from_manifests`, the resulting cuts will have the same duration
as the input recordings or features. For long recordings, it is not viable for training.
We provide several methods to transform the cuts into shorter ones.
Consider the following scenario::
Recording
|-------------------------------------------|
"Hey, Matt!" "Yes?" "Oh, nothing"
|----------| |----| |-----------|
.......... CutSet.from_manifests() ..........
Cut1
|-------------------------------------------|
............. Example CutSet A ..............
Cut1 Cut2 Cut3
|----------| |----| |-----------|
............. Example CutSet B ..............
Cut1 Cut2
|---------------------||--------------------|
............. Example CutSet C ..............
Cut1 Cut2
|---| |------|
The CutSet's A, B and C can be created like::
>>> cuts_A = cuts.trim_to_supervisions()
>>> cuts_B = cuts.cut_into_windows(duration=5.0)
>>> cuts_C = cuts.trim_to_unsupervised_segments()
CutSet注意
- 多线程
- 修改不可传递
.. note::
Some operations support parallel execution via an optional ``num_jobs`` parameter.
By default, all processing is single-threaded.
.. caution::
Operations on cut sets are not mutating -- they return modified copies of :class:`.CutSet` objects,
leaving the original object unmodified (and all of its cuts are also unmodified).
CutSet文件转换及dict信息获取
:class:`~lhotse.cut.CutSet` can be stored and read from JSON, JSONL, etc. and supports optional gzip compression::
>>> cuts.to_file('cuts.jsonl.gz')
>>> cuts4 = CutSet.from_file('cuts.jsonl.gz')
It behaves similarly to a ``dict``::
>>> 'rec1-1-0' in cuts
True
>>> cut = cuts['rec1-1-0']
>>> for cut in cuts:
>>> pass
>>> len(cuts)
127
CutSet属性及相关操作
:class:`~lhotse.cut.CutSet` has some convenience properties and methods to gather information about the dataset::
>>> ids = list(cuts.ids)
>>> speaker_id_set = cuts.speakers
>>> # The following prints a message:
>>> cuts.describe()
Cuts count: 547
Total duration (hours): 326.4
Speech duration (hours): 79.6 (24.4%)
***
Duration statistics (seconds):
mean 2148.0
std 870.9
min 477.0
25% 1523.0
50% 2157.0
75% 2423.0
max 5415.0
dtype: float64
Manipulation examples::
>>> longer_than_5s = cuts.filter(lambda c: c.duration > 5)
>>> first_100 = cuts.subset(first=100)
>>> split_into_4 = cuts.split(num_splits=4)
>>> shuffled = cuts.shuffle()
>>> random_sample = cuts.sample(n_cuts=10)
>>> new_ids = cuts.modify_ids(lambda c: c.id + '-newid')
These operations can be composed to implement more complex operations, e.g.
bucketing by duration:
>>> buckets = cuts.sort_by_duration().split(num_splits=30)
CutSet与原始数据detach解绑
Cuts in a :class:`.CutSet` can be detached from parts of their metadata::
>>> cuts_no_feat = cuts.drop_features()
>>> cuts_no_rec = cuts.drop_recordings()
>>> cuts_no_sup = cuts.drop_supervisions()
CutSet较小时,推荐排序方法
``` Sometimes specific sorting patterns are useful when a small CutSet represents a mini-batch::>>> cuts = cuts.sort_by_duration(ascending=False)
>>> cuts = cuts.sort_like(other_cuts)
</details>
<details>
<summary>CutSet Batch操作pad\truncate</summary>
:class:~lhotse.cut.CutSet
offers some batch processing operations::
>>> cuts = cuts.pad(num_frames=300) # or duration=30.0
>>> cuts = cuts.truncate(max_duration=30.0, offset_type='start') # truncate from start to 30.0s
>>> cuts = cuts.mix(other_cuts, snr=[10, 30], mix_prob=0.5)
</details>
<details>
<summary>CutSet DA操作【speed\vol\rvb】</summary>
且可以组合连续操作
:class:~lhotse.cut.CutSet
supports lazy data augmentation/transformation methods which require adjusting some information
in the manifest (e.g., num_samples
or duration
).
Note that in the following examples, the audio is untouched -- the operations are stored in the manifest,
and executed upon reading the audio::
>>> cuts_sp = cuts.perturb_speed(factor=1.1)
>>> cuts_vp = cuts.perturb_volume(factor=2.)
>>> cuts_24k = cuts.resample(24000)
>>> cuts_rvb = cuts.reverb_rir(rir_recordings)
.. caution::
If the :class:.CutSet
contained :class:~lhotse.features.base.Features
manifests, they will be
detached after performing audio augmentations such as :meth:.CutSet.perturb_speed
,
:meth:.CutSet.resample
, :meth:.CutSet.perturb_volume
, or :meth:.CutSet.reverb_rir
.
</details>
<details>
<summary>CutSet并行计算var、mean</summary>
:class:~lhotse.cut.CutSet
offers parallel feature extraction capabilities
(see meth
:.CutSet.compute_and_store_features: for details),
and can be used to estimate global mean and variance::
>>> from lhotse import Fbank
>>> cuts = CutSet()
>>> cuts = cuts.compute_and_store_features(
... extractor=Fbank(),
... storage_path='/data/feats',
... num_jobs=4
... )
>>> mvn_stats = cuts.compute_global_feature_stats('/data/features/mvn_stats.pkl', max_cuts=10000)
See also:
- :class:`~lhotse.cut.Cut`
</details>
<details>
<summary>Hdf5MemoryIssueFix </summary>
* 每次读取batch后,counter+1
* counter % threshold ==0, 调用 close_cached_file_handles
class Hdf5MemoryIssueFix:
"""
Use this class to limit the growing memory use when reading from HDF5 files.
It should be instantiated within the dataloading worker, i.e., the best place
is likely inside the PyTorch Dataset class.
Every time a new batch/example is returned, call ``.update()``.
Once per ``reset_interval`` updates, this object will close all open HDF5 file
handles, which seems to limit the memory use.
"""
def __init__(self, reset_interval: int = 100) -> None:
self.counter = 0
self.reset_interval = reset_interval
def update(self) -> None:
from lhotse import close_cached_file_handles
if self.counter > 0 and self.counter % self.reset_interval == 0:
close_cached_file_handles()
self.counter = 0
self.counter += 1
</details>