A paper I was reading, http://www.cs.toronto.edu/~ilya/pubs/2011/LANG-RNN.pdf, uses bits per character as a test metric for estimating the quality of generative computer models of text but doesn't reference how it was calculated. Googling around, I can't really find anything about it. Does anyone know how to calculate it? Python preferably, but pseudo-code or anything works. Thanks! |
||||
add comment |
Bits per character is a measure of the performance of compression methods. It's applied by compressing a string and then measuring how many bits the compressed representation takes in total, divided by how many symbols (i.e. characters) there were in the original string. The fewer bits per character the compressed version takes, the more effective the compression method is. In other words, the authors use their generative language model, among other things, for compression and make an assumption that a higheffectiveness of the resulting compression method indicates highaccuracy of the underlying generative model. In section 1 they state:
The Rissanen & Langdon (1979) article is the original description ofarithmetic coding, a well-known method for text compression. Arithmetic coding operates on the basis of a generative language model, such as the one the authors have built. Given a (possibly empty) sequence of characters, the model predicts what character may come next. Humans can do that, too, for example given the input sequence This fits naturally with arithmetic coding: Given an input sequence that has already been encoded, the bit sequence for the next character is determined by the probability distribution of possible characters: Characters with high probability get a short bit sequence, characters with low probability get a longer sequence. Then the next character is read from the input and encoded using the bit sequence that was determined from the probability distribution. If the language model is good, the character will have been predicted with high probability, so the bit sequence will be short. Then the compression continues with the next character, again using the input so far to establish a probability distribution of characters, determining bit sequences, and then reading the actual next character and encoding it accordingly. Note that the generative model is used in every step to establish a new probability distribution. So this is an instance of adaptive arithmetic coding. After all input has been read and encoded, the total length (in bits) of the result is measured and divided by the number of characters in the original, uncompressed input. If the model is good, it will have predicted the characters with high accuracy, so the bit sequence used for each character will have been short on average, hence the total bits per character will be low. Regarding ready-to-use implementations I am not aware of an implementation of arithmetic coding that allows for easy integration of your own generative language model. Most implementations build their own adaptive model on-the-fly, i.e. they adjust character frequency tables as they read input. One option for you may be to start with arcode. I looked at the code, and it seems as though it may be possible to integrate your own model, although it's not very easy. The |
|||||||||||||
|
The sys library has a getsizeof() function, this may be helpful?http://docs.python.org/dev/library/sys |
|||
add comment |
CHAR_BIT
tigcc.ticalc.org/doc/limits.html#CHAR_BIT ? – woozyking Jul 23 '13 at 0:31