Bitstream•pytorch•sklearn•numpy•pandas•spacy•librosa•opencv

无论是ChatGPT、 LLM大语言模型、还是Meta公司的AI生成音乐🎶，

都需要对 Audio、Video、Bitstream 进行处理。

Text的算法库 SpaCy, numpy, pytorch/Tensorflow/Transformers,…

Audio的算法库 librosa, numpy, pytorch/Tensorflow/Transformers, …；

Image/Video的算法库 Pillow, OpenCV, numpy, pytorch/Tensorflow/Transformers, …;

以Meta(Facebook已改名为Meta)开源的 audiocraft 为例：

ASR(Audio转文本, 人机语音交互与识别)、
TTS(文本合成语音)、
NLP(自然语言处理)、
NLG(自然语言生成)、
Content Generation(智能生成 Text/Image/Audio/Video/…)

audiocraft 的:

NLP 部分用的是Python库SpaCy;
audio/video 部分用的是Python库 av(用 Cython 封装好FFmpeg C/C++ API)，极大的方便 Audio/Video/Bitstream 的上层应用例如 AI/MachinLearning调用.
当然还可以参考Python的 OpenCV / av 库封装其它的多模态内容接口; 实现全媒体覆盖(Article/Text/Image/Audio/Video/…)

SpaCy: Industrial-Strength Natural Language Processing
https://spacy.io/

av 这个库(https://pypi.org/project/av/#description)

FFmpeg: https://ffmpeg.org/documentation.html

 
PyAV is a Pythonic binding for the [FFmpeg][ffmpeg] libraries. 
We aim to provide all of the power and control of the underlying library, but manage the gritty details as much as possible.
PyAV is for direct and precise access to your media via containers, streams, packets, codecs, and frames.
It exposes a few transformations of that data, and helps you get your data to/from other packages (e.g. Numpy and Pillow).
This power does come with some responsibility as working with media is horrendously complicated and PyAV can't abstract it away or make all the best decisions for you.
If the `ffmpeg` command does the job without you bending over backwards, PyAV is likely going to be more of a hindrance than a help.
But where you can't work without it, PyAV is a critical tool.
Installation 
------------
 Due to the complexity of the dependencies, PyAV is not always the easiest Python package to install from source.
Since release 8.0.0 binary wheels are provided on [PyPI][pypi] for Linux, Mac and Windows linked against a modern FFmpeg.
You can install these wheels by running: 
```bash
pip install av 
``` 
If you want to use your existing FFmpeg, the source version of PyAV is on [PyPI][pypi] too: 
```bash
pip install av --no-binary av
``` 
Alternative installation methods 
-------------------------------- 
Another way of installing PyAV is via [conda-forge][conda-forge]:
```bash
 conda install av -c conda-forge 
```


https://github.com/abaelhe/audiocraft
Audiocraft is a PyTorch library for deep learning research on audio generation. At the moment, it contains the code for MusicGen, a state-of-the-art controllable text-to-music model.
MusicGen
Audiocraft provides the code and models for MusicGen, [a simple and controllable model for music generation][arxiv]. MusicGen is a single stage auto-regressive

Transformer model trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. Unlike existing methods like MusicLM, MusicGen doesn't require a self-supervised semantic representation, and it generates

all 4 codebooks in one pass. By introducing a small delay between the codebooks, we show we can predict

them in parallel, thus having only 50 auto-regressive steps per second of audio.

Check out our [sample page][musicgen_samples] or test the available demo!

  


  



We use 20K hours of licensed music to train MusicGen. Specifically, we rely on an internal dataset of 10K high-quality music tracks, and on the ShutterStock and Pond5 music data.
Installation
Audiocraft requires Python 3.9, PyTorch 2.0.0, and a GPU with at least 16 GB of memory (for the medium-sized model). To install Audiocraft, you can run the following:
# Best to make sure you have torch installed first, in particular before installing xformers.
# Don't run this if you already have PyTorch installed.
pip install 'torch>=2.0'
# Then proceed to one of the following
pip install -U audiocraft  # stable release
pip install -U git+https://git@github.com/facebookresearch/audiocraft#egg=audiocraft  # bleeding edge
pip install -e .  # or if you cloned the repo locally
</code>

<code>
Meta 开源音乐生成模型 MusicGen
2023-06-18 10:18  104

Meta 近日在 Github 上开源了其音乐生成模型 MusicGen。据介绍，MusicGen 主要用于音乐生成，它可以将文本和已有的旋律转化为完整乐曲。该模型基于谷歌 2017 年推出的 Transformer 模型。

研发团队表示：“我们使用了 20000 小时的授权音乐来训练该模型，并采用 Meta 的 EnCodec 编码器将音频数据分解为更小的单元进行并行处理，进而让 MusicGen 的运算效率和生成速度都比同类型 AI 模型更为出色。”

除此之外，MusicGen 还支持文本与旋律的组合输入，例如你可以提出生成 “一首轻快的曲目” 并同时要求 “将它与贝多芬的《欢乐颂》结合起来”。

研发团队还对 MusicGen 的实际表现进行了测试。结果显示，与谷歌的 MusicLM 以及 Riffusion、Mousai、Noise2Music 等其他音乐模型相比，MusicGen 在测试音乐与文本提示的匹配度以及作曲的可信度等指标上表现更好，总体而言略高于谷歌 MusicLM 的水平。

Meta 已允许该模型的商业使用，并在 Huggingface 上发布了一个供演示用的网页应用。


延伸阅读

谷歌推出 MusicLM，从文本生成音乐的模型
</code>