Codec 2 : 一款新的低码率语音编码器

Codec 2

David Rowe, VK5DGR

Introduction

Codec2 is an open source low bit rate speech codec designed for communications quality speech at 2400 bit/s and below. Applications include low bandwidth HF/VHF digital radio and VOIP trunking. Codec 2 operating at 2000 bit/s can send 32 phone calls using the bandwidth required for one 64 kbit/s uncompressed phone call. It fills a gap in open source, free-as-in-speech voice codecs beneath 5000 bit/s and is released under the GNU Lesser General Public License (LGPL).

The motivations behind the project are summarised in this blog post.

You can help and support Codec2 development via a Donation.

Status

Alpha release. V0.1A is the current SVN trunk version, a fully function 2550 bit/s codec (51 bits/frame at a 20ms frame rate). Work is in progress on a 1400 bit/s version in the codec-dev SVN branch. The target application for 1400 bit/s is digital voice over HF radio.

Here are some samples from the 2100 and 1400 bit/s versions. As you can hear, it’s possible to get about the same quality at 1400 bit/s as 2550 bit/s.

CodecMaleFemale
Original male female
Codec 2 V0.1A 2550 bit/s male female
Codec 2 2100 bit/s male female
Codec 2 1400 bit/s male female

Here is Codec 2 operating at 2550 bit/s compared to some other codecs:

CodecMaleFemale
Original male female
Codec 2 V0.1 2550 bit/s male female
Codec 2 V0.1A 2550 bit/s male female
MELP 2400 bit/s male female
AMBE 2000 bit/s male female
LPC-10 2400 bit/s male female

Notes: The MELP samples are from an early 1998 simulation. I would welcome any samples processed with a modern version of MELP. The AMBE samples were generated using a DV-Dongle, a USB device containing the DVSI AMBE2000 chip. The LPC-10 samples were generated using the Spandsp library.

Here is a counter example where AMBE really shines compared to Codec 2. In particular the low frequency reproduction is much better. Thank you Kristoff ON1ARF for sending in these samples. Why AMBE works so much better than Codec 2 for this sample compared to say the male sample above (hts1a.wav) is an interesting mystery that I am exploring. Input filtering perhaps? Or a corner case where the Codec 2 parameter estimators (pitch, voicing etc) break down? We shall see.

CodecKristoff
Original Kristoff
Codec 2 V0.1A 2500 bit/s Kristoff
AMBE 2000 bit/s Kristoff

Here are some samples with acoustic background noise, similar to what would be experienced when driving a truck. As you can see (well, hear) background noise is a tough test for low bit rate vocoders. They achieve high compression rates by being highly optimised for human speech, at the expense of performance with non-speech signals like background noise and music. Note that Codec 2 has just one voicing bit, unlike mixed excitation algorithms like AMBE and MELP.

CodecMale with truck noise
Original male
Codec 2 V0.1A 2550 bit/s male
AMBE 2000 bit/s male
LPC-10 2400 bit/s male

Progress to Date

 

    1. Linux/gcc simulation (c2sim) which is a test-bed for various modelling and quantisation options – these are controlled via command line switches. Separate encoder (c2enc) and decoder (c2dec) programs that demo Codec2 on the command line. Runs approximately 10x real-time on a modern X86 PC.

 

    1. Original thesis code has been refactored and brought up to modern gcc standards.

 

    1. LPC modelling is working nicely and there is a first pass scalar LSP quantiser working at 36 bits/frame with good voice quality. Lots could be done to reduce the LSP bit rate. A novel approach to LPC modelling uses a single bit to correct low frequency LPC errors. LSP quantisers (simple uniform, hand designed tables) are designed to simply keep low frequency LSP quantisation errors less than 25Hz. Keeping LSP errors less than 25 Hz was found to be more important than minimum MSE in subjective tests. This is a surprise, as regular quantiser design (e.g. Lloyd-max and k-means for VQ design) assumes minimum MSE is the key.

 

    1. A phase model has been developed that uses 0 bits for phase and 1 bit/frame for voiced/unvoiced decision but delivers OK speech quality. This works suspiciously well – codecs with a single bit voiced/unvoiced decision aren’t meant to sound this good. Usually mixed voicing at several bits/frame is required. However the phase and voicing still represent the largest quality drop in the codec.

       

      With some improvement in phase modelling and voicing codec2 has the potential for toll quality – i.e. similar performance to 8000 bits/s g.729a at around 2000 bit/s.

To give you an idea of where codec2 is heading this is what the codec sounds like at a 10ms frame rate with original phases. The other codec parameters are fully quantised:

OriginalCodec2 (original phases)g.729a 8000 bit/s
hts1a male hts1a male hts1a male
hts2a female hts2a female hts2a female

If you want to see a really high quality, completely open source, patent free 2000 bit/s codec happen sooner rather than later, please consider sponsoring this work.

 

    1. An experimental post-filter has been developed to improve performance with speech mixed with background noise. The post-filter delivers many of the advantages of mixed voicing but unlike mixed voicing requires zero bits.

 

    1. Non-Linear Pitch (NLP) pitch estimator working OK, and a simple pitch tracker has been developed to help with some problem frames.

 

  1. An algorithm for interpolating sinusoidal model parameters from a 20ms frame rate required for 2400 bit/s coding to the internal native 10ms frame rate. This introduces some artefacts and needs some work. However as speech is highly correlated over 20ms it should be possible to to make interpolation artefact free.

Current work areas:

 

    1. Improve speech quality by work on the phase modelling, voicing estimation, post-filter, and interpolation.

 

    1. Gather feedback from real-world testing of Alpha code.

 

  1. Integration with modem code to develop an open source digital voice system for HF/VHF radio.

Browse:

http://freetel.svn.sourceforge.net/viewvc/freetel/codec2/

Check Out:


$ svn co https://freetel.svn.sourceforge.net/svnroot/freetel/codec2 codec2
 
Development version (latest and greatest but may not compile cleanly at any given time):
 
$ svn co https://freetel.svn.sourceforge.net/svnroot/freetel/codec2-dev codec2-dev
 

The Mailing List

For any questions, comments, support, suggestions for applications by end users and developers please post to the Codec2 Mailing List

Quick Start

To encode the file raw/hts1a.raw then decode to a raw file hts1a_c2.raw:


  $ svn co https://freetel.svn.sourceforge.net/svnroot/freetel/codec2 codec2
  $ cd codec2
  $ ./configure
  $ make
  $ cd src
  $ ./c2demo ../raw/hts1a.raw hts1a_c2.raw
  $ ../script/menu.sh ../raw/hts1a.raw hts1a_c2.raw
 

Development Roadmap

Here is a road map of the planned algorithm development:

Here is a project plan in task list form:

  [X] Milestone 0 - Project kick off
  [X] Milestone 1 - Alpha 2400 bits/s codec
      [X] Spectral amplitudes modelled using LPC
      [X] Phase and voicing model developed
      [X] Pitch estimator
      [X] Spectral amplitudes quantised using LSPs
      [X] Decimation of model parameters from 20ms to 10ms
      [X] Refactor to develop a seperate encoder/decoder functions
      [X] Complete 2400 bits/s codec demonstrated
      [X] Reduced complexity voicing estimator
      [ ] Test phone call over LAN
      [X] Release 0.1 for Alpha Testing
  [ ] Milestone 2 - Algorithm Development
      [ ] Gather samples from the community with different speakers,
          input filtering, and background noise conditions that 
          break codec.
      [X] Improve quality 
          [X] Voicing estimation
          [X] Phase modelling
          [X] Interpolation
      [ ] 2400 bit/s version
      [ ] 2000 bit/s version
      [ ] Add 400 bit/s FEC
      [ ] 2400 bit/s version with FEC
      [ ] 1200 bit/s version
  [ ] Milestone 3 - Fixed point port
  [ ] Milestone 4 - codec2-on-a-chip embedded DSP/CPU port

How it Works

Here is an presentation on Codec 2 in Power Point or Open Office form.

Codec2 uses “harmonic sinusoidal speech coding”. Sinusoidal coding was developed at the MIT Lincoln labs in the mid 1980′s, starting with some gentlemen called R.J. McAulay and T.F. Quatieri. I worked on these codec algorithms for my PhD during the 1990′s. Sinusoidal coding is a close relative of the xMBE codec family and they often use mixed voicing models similar to those used in MELP.

Speech is modelled as a sum of sinusoids:

 
  for(m=1; m<=L; m++)
    s[n] = A[m]*cos(Wo*m + phi[m]);

The sinusoids are multiples of the fundamental frequency Wo (omega-naught), so the technique is known as “harmonic sinusoidal coding”. For each frame, we analyse the speech signal and extract a set of parameters:

 
  Wo, {A}, {phi}

Where Wo is the fundamental frequency (also know as the pitch), { A } is a set of L amplitudes and { phi } is a set of L phases. L is chosen to be equal to the number of harmonics that can fit in a 4 kHz bandwidth:

 
  L = floor(pi/Wo)

Wo is specified in radians normalised to 4 kHz, such that pi radians = 4 kHz. the fundamental frequency in Hz is:

 
  F0 = (8000/(2*pi))*Wo

We then need to encode (quantise) Wo, { A }, { phi } and transmit them to a decoder which reconstructs the speech. A frame might be 10-20ms in length so we update the parameters every 10-20ms (100 to 50 Hz update rate).

The speech quality of the basic harmonic sinusoidal model is pretty good, close to transparent. It is also relatively robust to Wo estimation errors. Unvoiced speech (e.g. consonants) are well modelled by a bunch of harmonics with random phases. Speech corrupted with background noise also sounds OK, the background noise doesn’t introduce any grossly unpleasant artifacts.

As the parameters are quantised to a low bit rate and sent over the channel, the speech quality drops. The challenge is to achieve a reasonable trade off between speech quality and bit rate.

Bit Allocation

Parameterbits/frame
Spectral magnitudes (LSPs) 36
Low frequency LPC correction 1
Energy 5
Voicing (updated each 10ms) 2
Fundamental Frequency (Wo) 7
Total 51

At a 20ms update rate 51 bits/frame is 2550 bits/s. This is just a starting point, plenty of scope to get it down further.

Challenges

The tough bits of this project are:

1. Parameter estimation, in particular voicing estimation.

2. Reduction of a time-varying number of parameters (L changes with Wo each frame) to a fixed number of parameters required for a fixed bit rate. The trick here is that { A } tend to vary slowly with frequency, so we can “fit” a curve to the set of { A } and send parameters that describe that curve.

3. Discarding the phases { phi }. In most low bit rate codecs phases are discarded, and synthesised at the decoder using a rule-based approach. This also implies the need for a “voicing” model as voiced speech (vowels) tends to have a different phase structure to unvoiced (constants). The voicing model needs to be accurate (not introduce distortion), and relatively low bit rate.

4. Quantisation of the amplitudes { A } to a small number of bits while maintaining speech quality. For example 30 bits/frame at a 20ms frame rate is 30/0.02 = 1500 bits/s, a large part of our 2400 bit/s “budget”.

5. Performance with different speakers and background noise conditions. This is where you come in – as codec2 develops please send me samples of it’s performance with various speakers and background noise conditions and together we will improve the algorithm. This approach proved very powerful when developing Oslec. One of the cool things about open source!

Can I help?

Maybe. Check out the the Development Roadmap above and see if there is anything that interests you.

Not all of this project is DSP. There are many general C coding tasks like code review, writing a command line soft phone application for testing the codec over a LAN, and patent review.

I will happily accept sponsorship for this project. For example research grants, or development contracts from companies interested in seeing an open source low bit rate speech codec.

You can also donate to the codec2 project via PayPal (which also allows credit card donations):

Donation in US$:

 

Thanks to the following for your kind PayPal donations:

Brian Morrison, Andreas Weller, Stuart Brock, Bryan Greenawalt, Anthony Cutler, Martin Flynn, Melvyn Whitten, Glen English, William Scholey, Andreas Bier, David Witten, Clive Ousbey, David Bern, Bryan Pollock, Mario Dearanzeta, Gerhard Burian, Tim Rob, Daniel Cussen, Gareth Davies, Simon Eatough, Neil Brewitt, Robert Eugster, Ramon Gandia, A J Carr, Van Jacobson, Eric Muehlstein, Cecil Casey, Nicola Giacobbe, John Ackermann, Joel Kolstad, Curt E. Mills, James Ahlstrom.

Thanks to the following for your kind Equipment donations:

Melvyn Whitten (headphones), Melvyn Whitten (DV Dongle), TAPR (Yaesu FT-817ND radio).

Thanks to the following for submitting patches or helping out with the code and algorithms:

Bruce PerensBill CowleyJean-Marc Valin, Gregory Maxwell, Peter Ross, Edwin Torok, Mathieu Rene, Brian West, Bruce Robertson.

Thanks also to the many people who have sent emails of encouragement, publicised codec 2, and participated in the mailing list. If I have forgotten anyone above, please let me know!

Is it Patent Free?

I think so – much of the work is based on old papers from the 60, 70s and 80′s and the PhD thesis work [2] used as a baseline for this codec was original. A nice little mini project would be to audit the patents used by proprietary 2400 bit/s codecs (MELP and xMBE) and compare.

Proprietary codecs typically have small, novel parts of the algorithm protected by patents. However proprietary codecs also rely heavily on large bodies of public domain work. The patents cover perhaps 5% of the codec algorithms. Proprietary codec designers did not invent most of the algorithms they use in their codec. Typically, the patents just cover enough to make designing an interoperable codec very difficult. These also tend to be the parts that make their codecs sound good.

However there are many ways to make a codec sound good, so we simply need to choose and develop other methods.

Is Codec2 compatible with xMBE or MELP?

Nope – I don’t think it’s possible to build a compatible codec without infringing on patents or access to commercial in confidence information.

Hacking

All of my development is on an Ubuntu Linux box. If you would like to play with Codec2 here are some notes:

 

    • src/sim.sh will perform the several processing steps required to output speech files at various processing steps, for example:

      $ cd codec2/src
      $ ./sim.sh hts1a

       

      will produce hts1a_uq (unquantised, i.e. baseline sinusoidal model), hts1a_phase0 (zero phase model), hts1a_lpc10 (10th order LPC model) etc.

 

    • You can then listen to all of these samples (and the original) using:

        $ ./listensim.sh hts1a

 

    • Specific notes about LPC and Phase modelling are below.

 

  • There are some useful scripts in the scripts directory, for example wav2raw.sh, raw2wav.sh, playraw.sh, menu.sh. Note that sim.sh and listensim.sh are in the src directory as that’s where they get used most of the time.

LPC Modelling Notes

Linear Prediction Coefficient (LPC) modelling is used to model the sine wave amplitudes { A }. The use of LPC in speech coding is common, although the application of LPC modelling to frequency domain coding is fairly novel. They are mainly used for time domain codecs like LPC-10 and CELP.

LPC modelling has a couple of advantages:

 

    • From time domain coding we know a lot about LPC, for example how to quantise them efficiently using Line Spectrum Pairs (LSPs).

 

  • The number of amplitudes varies each frame as Wo and hence L vary. This makes the { A } tricky to quantise and transmit. However it is possible to convey the same information using a fixed number of LPCs which makes the quantisation problem easier.

To test LPC modelling:

  $ ./c2sim ../raw/hts1a.raw --lpc 10 -o hts1a_lpc10.raw

The blog post [4] discusses why LPC modelling works so well when Am recovered via RMS method (Section 5.1 of thesis). Equal area model of LPC spectra versus harmonic seems to work remarkably well, especially compared to sampling the LPC spectrum. SNRs up to 30dB on female frames.

There is a problem with modelling the low order (e.g. m=1, i.e. fundamental) harmonics for males. The amplitude of the m=1 harmonic is raised by as much as 30dB after LPC modelling as (I think) LPC spectra must have zero derivative at DC. This means it’s poor at modelling very low freq harmonics which unfortunately the ear is very sensitive to. To correct this an extra bit has been added to correct LPC modelling errors on the first harmonic. When set this bit instructs the decoder to attenuate the LPC modelled harmonic by 30dB.

Phase Modelling Notes

I have a “zero order” phase model under constant development. This model synthesise the phases of each harmonic at the decoder side. The model is described in source code ofphase.c.

The zero phase model requires just one voicing bit to be transmitted to the decoder, all other phase information is synthesised use a rule based model. It seems to work OK for most speech samples, but adds a “clicky” artefact to some low pitched speakers. For reasons I don’t yet understand, the model quality drops when the zero phase model is combined with LPC based amplitude modelling. Also see the blog posts below for more discussion of phase models.

To determine voicing we use the MBE algorithm on the first 1 kHz. This attempts to fit an “all voiced” harmonic spectrum then compared the fit in a Mean Square Error (MSE) sense. This works OK, and is fast, but like all parameter estimators screws up occasionally. The worst type of error is when voiced speech is accidentally declared unvoiced. So I have biased the threshold towards voiced decisions which reduces the speech quality a little. More work required, maybe a mixture of two estimators so their errors are uncorrelated.

Unvoiced speech can be represented well by random phases and a Wo estimate that jumps around randomly. If Wo is small the number of harmonics is large which makes the noise less periodic and more noise like to the ear. With Wo jumping around phase tracks are discontinuous between frames which also makes the synthesised signal more noise like and prevents time domain pulses forming that the ear is sensitive to.

Running the Phase Model


  $ ./c2sim ../raw/hts1a.raw hts1a.mdl --phase0 -o hts1a_phase0.raw
 

Octave Scripts

 

    • pl.m – plot a segment from a raw file

 

    • pl2.m – plot the same segments from two different files to compare

 

    • plamp.m – menu based GUI interface to “dump” files, move back and forward through file examining time and frequency domain parameters, lpc model etc

                $ CFLAGS=-DDUMP ./configure
                $ make clean
                $ make
                 $ cd src 
                $ ./c2sim ../raw/hts1a.raw --lpc 10 --dump hts1a
                $ cd ../octave
                $ octave
                octave:1> plamp("../src/hts1a",25)

 

    • plphase.m – similar to plamp.m but for analysing phase models

                $ ./c2dec --phase [0|1] --dump hts1a_phase
                $ cd ../octave
                $ octave
                octave:1> plphase("../src/hts1a_phase",25)

 

    • plpitch.m – plot two pitch contours (.p files) and compare

 

  • plnlp.m – plots a bunch of NLP pitch estimator states. Screenshot

Directories


  script   - shell scripts to simply common operations
  speex    - LSP quantisation code borrowed from Speex for testing
  src      - C source code
  octave   - Matlab/Octave scripts
  pitch    - pitch estimator output files
  raw      - speech files in raw format (16 bits signed linear 8 KHz)
  unittest - Unit test source code
  wav      - speech files in wave file format

Other Uses

The DSP algorithms contained in codec2 may be useful for other DSP applications, for example:

 

    • The nlp.c pitch estimator is a working, tested, pitch estimator for human speech. NLP is an open source pitch estimator presented in C code complete with a GUI debugging tool (plnlp.m screenshot). It can be run stand-alone using the tnlp.c unit test program. It could be applied to other speech coding research. Pitch estimation is a popular subject in academia, however most pitch estimators are described in papers, with the fine implementation details left out.

 

  • The basic analysis/synthesis framework could be used for high quality speech synthesis.

Links

  1. Codec 2 presentation in Power Point or Open Office form.
  2. Bruce Perens introducing the codec2 project concept
  3. David’s PhD Thesis, “Techniques for Harmonic Sinusoidal Coding”, used for baseline algorithm
  4. Open Source Low rate Speech Codec Part 1 – Introduction
  5. Open Source Low rate Speech Codec Part 2 – Spectral Magnitudes
  6. Open Source Low rate Speech Codec Part 3 – Phase and Male Speech
  7. Open Source Low rate Speech Codec Part 4 – Zero Phase Model
  8. September 21 2010 – Slashdotted!
  9. Codec 2 – Alpha Release and Voicing
  10. Codec 2 at 1400 bit/s
posted @ 2012-03-16 20:42  杭州桓泽  阅读(5191)  评论(0编辑  收藏  举报