mahout学习笔记2
Representing preference data
1.Thre Preference Object
One object represents one user’s preference for one item.Preference is an interface, and the implementation you’re most likely to use is GenericPreference. For example, the following line creates a representation of user 123’s preference value of 3.0 for item 456: new GenericPreference(123, 456, 3.0f)
2.PreferenceArray, an interface whose implementations represent a collection of preferences with an array-like
API.
3.FastByIDMaps and FastIDSet
Like HashMap, FastByIDMap is hash-based. It uses linear probing, rather than separate chaining, to handle hash collisions. This avoids the need for an additional Map.Entry object per entry; as discussed, Objects consume a surprisingamount of memory.
FastByIDMap 像HashMap一样也是基于哈希的,但它利用线性探测而不是拉链法来解决哈希冲突的。这避免了Map.Entry占用额外空间。
Keys and members are always long primitives in Mahout recommenders, not Objects. Using long keys saves memory and improves performance.
The Set implementation isn’t implemented using a Map underneath.
FastByIDMap can act like a cache, because it has a notion of maximum size;beyond this size, infrequently used entries will be removed when new ones are added.
In-memory DataModels
The abstraction that encapsulates recommender input data in Mahout is DataModel.Implementations of DataModel provide efficient access to data required by various recommender algorithms.
1.GenericDataModel
The simplest DataModel implementation available is an in-memory implementation,GenericDataModel. It’s appropriate when you want to construct your data representation in memory, programmatically, rather than base it on an existing external source of data, such as a file or relational database. It simply accepts preferences as inputs, in the form of a FastByIDMap that maps user IDs to PreferenceArrays with data for those users
2.File-based data
You won’t typically use GenericDataModel directly. Instead, you’ll likely encounter it via FileDataModel, which reads data from a file and stores the resulting preference data in memory, in a GenericDataModel.
3.Refreshable components
实现Refreshable接口的对象只有在要求它refresh的时候,它才会执行这个动作,它不会自动或定期的执行refresh
4.Update files
FileDataModel supports update files.
5.Database-based data
It’s possible to store and access preference data from a relational database; Mahout
supports this.
6.JDBC and MySQL
Preference data is accessed via JDBC, using implementations of JDBCDataModel. At the moment, the primary subclass of JDBCDataModel is one written for use with MySQL 5.x: MySQLJDBCDataModel.
By default, the implementation assumes that all preference data exists in a table called taste_preferences, with a column for user IDs named user_id, a column for item IDs named item_id, and column for preference values named preference. This schema is illustrated in table 3.1. This table could also contain a field called timestamp, whose type should be compatible with Java’s long type.
7.Configuring via JNDI
8.Configuring programmatically
MysqlDataSource dataSource = new MysqlDataSource ();
dataSource.setServerName("my_database_host");
dataSource.setUser("my_user");
dataSource.setPassword("my_password");
dataSource.setDatabaseName("my_database_name");
JDBCDataModel dataModel = new MySQLJDBCDataModel(
dataSource, "my_prefs_table", "my_user_column",
"my_item_column", "my_pref_value_column");
转 http://www.douban.com/note/204399134/
Coping without preference values
有些数据没有偏好值,比如浏览新闻(Boolean 值)
1.When to ignore values
2.In-memory representations without preference values
GenericBooleanPrefDataModel
3.Selecting compatible implementations
PearsonCorrelationSimilarity 和EuclideanDistanceSimilarity不是和boolean 类型的preferences
计算向量的距离时所有数据都是0和1的形式,可能得不到结果,或是结果没有意义(两个数据集的值都是1或0)
LogLikelihoodSimilarity可以解决这个问题,因为它不是基于实际的preference计算相似性的
相似邻居的计算
1.固定数量的
NearestNUserNeighborhood,孤立点计算效果不好
2.Threshold
ThresholdUserNeighborhood,适合稀疏数据,很好的处理孤立点问题
取值范围[-1,1],越大越相似