基于Python的三种Bandit算法的实现
最近在看推荐系统方面的东西,看到Bandit算法的几种基本实现思路,看到网上没有很好的代码实现,将文中的三种经典的代码实现了一下。
算法的具体介绍就不写啦,可以参考一下blog:
https://blog.csdn.net/z1185196212/article/details/53374194
https://blog.csdn.net/dengxing1234/article/details/73188731
e_greedy算法:
以epsilon的概率选择当前最大的,以1-epsilon概率随机选择。
import numpy as np T = 1000 N = 10 true_award = np.random.uniform(0,1,N) estimated_award = np.zeros(N) item_count = np.zeros(N) epsilon = 0.1 def e_greedy(): choose = np.random.binomial(n=1,p=epsilon) if choose: item = np.argmax(estimated_award) award = np.random.binomial(n=1,p=true_award[item]) else: item = np.random.choice(N, 1) award = np.random.binomial(n=1,p=true_award[item]) return item, award total_award = 0 for t in range(T): item, award = e_greedy() total_award+=award estimated_award[item] += award item_count[item]+=1 for i in range(N): estimated_award[i] /= item_count[i] print(true_award) print(estimated_award) print(total_award)
Thompson Sampling算法:
对每个arm以beta(win[arm], lose[arm])产生随机数,选择最大的随机数作为本轮选择的arm。
import numpy as np T = 1000 N = 10 true_award = np.random.uniform(0,1,N) win = np.zeros(N) lose = np.zeros(N) estimated_award = np.zeros(N) def Thompson_sampling(): arm_prob = [np.random.beta(win[i]+1, lose[i]+1) for i in range(N)] item = np.argmax(arm_prob) reward = np.random.binomial(n=1,p=true_award[item]) return item, reward total_reward = 0 for t in range(T): item, reward = Thompson_sampling() if reward==1: win[item]+=1 else: lose[item]+=1 total_reward+=reward for i in range(N): estimated_award[i] = win[i]/(win[i]+lose[i]) print(true_award) print(estimated_award) print(total_reward)
UCB算法:
不断的对概率进行调整,用观测概率 p'+ 误差delta 对真实概率 p进行估计。
import numpy as np T = 1000 N = 10 ## 真实吐钱概率 true_award = np.random.uniform(low=0, high=1,size=N) estimated_award = np.zeros(N) choose_count = np.zeros(N) total_award = 0 def cal_delta(T, item): if choose_count[item] == 0: return 1 else: return np.sqrt(2*np.log(T) / choose_count[item]) def UCB(t, N): upper_bound_probs = [estimated_award[item] + cal_delta(t, item) for item in range(N)] item = np.argmax(upper_bound_probs) reward = np.random.binomial(n=1, p=true_award[item]) return item, reward for t in range(1,T+1): item, reward = UCB(t, N) total_award += reward estimated_award[item] = (choose_count[item]*estimated_award[item] + reward) / (choose_count[item]+1) choose_count[item]+=1 print(true_award) print(estimated_award) print(total_award)