要做推荐,用户行为数据是基础。
用户行为数据有哪些字段呢?
mahout的DataModel支持,用户ID,ItemID是必须的,偏好值(用户对当前Item的评分),时间戳 这四个字段
{@code userID,itemID[,preference[,timestamp]]}
mahout数据源支持从文件、DB中读取。
从FileDataModle.java的注释来看,还是做了不少工作的。
1)原文件更新后一定的时间段,才会reload
2)支持增量更新(不用每次都重新copy所有数据)
3)根据字段数目(有无评分)来选择不同的结构存储,节省内存
另外,
4)自己实现基础类型的数据结构,节省内存
~/mahout-core/src/main/java/org/apache/mahout/cf/taste/impl/common/FastIDSet.java
~/mahout-core/src/main/java/org/apache/mahout/cf/taste/impl/common/FastByIDMap.java
自己的实现的两个数据类型,都是通过hash快速查找, 而且避免java的Long class, 直接采用原生态的long行来节省内存空间。
同类型的还有 FastMap.java
* <p> * 增量更新的方式, This class will also look for update "delta" files in the same * directory, with file names that start the same way (up to the first period). * These files have the same format, and provide updated data that supersedes * what is in the main data file. This is a mechanism that allows an application * to push updates to {@link FileDataModel} without re-copying the entire data * file. * * 同一个目录下,数字来区分 * Finds update delta files in the same directory as the data file. This finds * any file whose name starts the same way as the data file (up to first period) * but isn't the data file itself. For example, if the data file is * /foo/data.txt.gz, you might place update files at /foo/data.1.txt.gz, * /foo/data.2.txt.gz, etc. * </p> * * <p> * 表示删除的语法, 偏好为空 One small format difference exists. Update files must also be * able to express deletes. This is done by ending with a blank preference * value, as in "123,456,". * </p> * * <p> * 增量更新的文件中,删除和更新不能混合使用 Note that it's all-or-nothing -- all of the items in the * file must express no preference, or the all must. These cannot be mixed. Put * another way there will always be the same number of delimiters on every line * of the file! * </p>
FileDataModel封装了从文件读取的功能,具体的存储还是由GenericDataModel来实现的。
详细的数据承载有这篇文章,这里就不多着墨了。