用过的baidu空间,太难用了,还是cnblogs好用
GZ.Jackey
学无止境,博采众长。

要做推荐,用户行为数据是基础。

用户行为数据有哪些字段呢?

mahout的DataModel支持,用户ID,ItemID是必须的,偏好值(用户对当前Item的评分),时间戳 这四个字段

{@code userID,itemID[,preference[,timestamp]]}

mahout数据源支持从文件、DB中读取。

从FileDataModle.java的注释来看,还是做了不少工作的。

1)原文件更新后一定的时间段,才会reload

2)支持增量更新(不用每次都重新copy所有数据)

3)根据字段数目(有无评分)来选择不同的结构存储,节省内存

另外,

4)自己实现基础类型的数据结构,节省内存

~/mahout-core/src/main/java/org/apache/mahout/cf/taste/impl/common/FastIDSet.java
~/mahout-core/src/main/java/org/apache/mahout/cf/taste/impl/common/FastByIDMap.java
自己的实现的两个数据类型,都是通过hash快速查找, 而且避免java的Long class, 直接采用原生态的long行来节省内存空间。
同类型的还有 FastMap.java

 

* <p>
 * 增量更新的方式, This class will also look for update "delta" files in the same
 * directory, with file names that start the same way (up to the first period).
 * These files have the same format, and provide updated data that supersedes
 * what is in the main data file. This is a mechanism that allows an application
 * to push updates to {@link FileDataModel} without re-copying the entire data
 * file.
 * 
 * 同一个目录下,数字来区分
 * Finds update delta files in the same directory as the data file. This finds
 * any file whose name starts the same way as the data file (up to first period)
 * but isn't the data file itself. For example, if the data file is
 * /foo/data.txt.gz, you might place update files at /foo/data.1.txt.gz,
 * /foo/data.2.txt.gz, etc.
 * </p>
 *
 * <p>
 * 表示删除的语法, 偏好为空 One small format difference exists. Update files must also be
 * able to express deletes. This is done by ending with a blank preference
 * value, as in "123,456,".
 * </p>
 *
 * <p>
 * 增量更新的文件中,删除和更新不能混合使用 Note that it's all-or-nothing -- all of the items in the
 * file must express no preference, or the all must. These cannot be mixed. Put
 * another way there will always be the same number of delimiters on every line
 * of the file!
 * </p>

 

FileDataModel封装了从文件读取的功能,具体的存储还是由GenericDataModel来实现的。

 

 

 

详细的数据承载有这篇文章,这里就不多着墨了。

 

posted on 2014-11-18 18:18  GZ.Jackey  阅读(1934)  评论(0编辑  收藏  举报