数据采集与融合技术综合实践

这个项目属于哪个课程	https://edu.cnblogs.com/campus/fzu/2024DataCollectionandFusiontechnology/
组名、项目简介	组名：都给爷爬项目目标：为心理疾病患者进行个性化的音乐疗愈项目需求：市面上大多数音乐软件都需要会员而且存在打榜等现象，不能完全个性化推荐，我们希望我们的个性化音乐系统能为心理疾病患者带来音乐疗愈，因此选择了该公益项目项目开展技术路线：python（django，pytorch，tensorflow）、爬虫技术、MySQL
团队成员学号	102202135 102202146 102202127 102202125 102202139 102202109 102202128
这个项目的目标	我们希望免费的公益的为有心理疾病的患者提供音乐疗愈的软件，我们可以根据患者、医生希望的音乐进行推荐，给患者相对舒服的音乐疗愈。
其他参考文献	【1】https://github.com/marl/openl3 【2】https://eva.fing.edu.uy/pluginfile.php/524749/mod_folder/content/0/BERT Pre-training of Deep Bidirectional Transformers for Language Understanding.pdf 【3】https://link.springer.com/chapter/10.1007/978-1-4842-6168-2_6 【4】https://blog.csdn.net/weixin_42645636/article/details/135777479

一、项目整体介绍

码云链接
 项目公网地址
该项目是基于智能推荐算法，为心理障碍人群提供免费公益的音乐播放平台。我们采用爬虫从音乐平台上爬取歌曲、歌单丰富我们的平台，为用户提供丰富的、多类的曲目进行选择。使用librosa对音频进行特征提取、Bert提取文本特征、resnet50提取图像特征，同时我们接入大模型，让大模型对用户现状进行分析，给出适合的音乐类型推荐，再将用户的行为特征一起考虑，构成我们的协同过滤推荐算法，实现音乐个性化的推荐，满足用户需求。

二、团队前期工作

敲定选题

我们首次会议讨论选择什么项目作为我们团队的开发项目。
小组成员积极参与讨论，给出了下述的选题：

智能交流助手（我们想要爬取弱智吧的帖子作为训练样本，得到一个杠精ai）
通过整合新闻文本、图片和视频，快速分析当前热点事件并生成多模态新闻摘要。
通过宠物的图片、视频和主人的描述，诊断宠物健康状态并提供养护建议。
音乐推荐平台，爬取其他音乐平台的音源，搭建属于自己的音乐平台。
最后考虑到大家的综合能力和时间紧迫，我们选择了音乐推荐系统这个项目

需求分析

我们团队成员在确定选题后进行需求分析，发现市面上的音乐平台大多需要会员以及存在打榜现象，不能真正的做到个性化推荐，同时大家也希望有一款免费的音乐平台。

分工

我们将项目工作分为前端、后端、推荐算法、爬虫、云平台五个模块。
分别由不同的队员承担工作，并且由组长进行不同模块之间的沟通与合并。

进度汇报

在项目中期的进度汇报上，得到老师的建议，将音乐推荐平台升级为专为心理障碍患者服务的音乐平台，我们也对推荐算法进行升级，最终得到最后的结果。

三、项目架构展示

详见102202146博客

四、个人分工

我主要负责初期的算法构建——根据用户行为特征与歌曲名的推荐算法（双塔模型的协同过滤算法）以及云平台部署并且协调前后端之间的交互，并且将不同模块（爬虫、算法等）融入项目。

双塔模型协同过滤算法

该项目旨在开发一个个性化的音乐推荐系统，基于用户的隐式反馈行为（如收藏、评论和播放记录）以及双塔推荐模型（Two-Tower Model）。双塔模型通过分别嵌入用户和项目（歌曲）信息，并利用隐式反馈特征来预测用户对未听过的歌曲的兴趣。这种方法既能处理用户的显式评分数据，也能充分利用用户的隐式反馈数据，从而提高推荐系统的准确性。

隐式反馈生成函数

def generate_implicit_feedback(user_id):
    # 获取用户的行为数据（收藏、评论和播放记录）
    collected_songs = set(x.song_id for x in CollectSong.objects.filter(user_id=user_id))
    commented_songs = set(x.song_id for x in CommentSong.objects.filter(user_id=user_id))
    history_records = HistorySong.objects.filter(user_id=user_id)
    play_counts = {x.song_id: x.play_count for x in history_records}

    # 设置动态行为权重
    weight_collect = 3 if len(collected_songs) > 5 else 2
    weight_comment = 2 if len(commented_songs) > 3 else 1
    weight_play = 1 if len(play_counts) < 50 else 0.5

    # 计算隐式反馈得分
    implicit_feedback = {}
    all_song_ids = set(collected_songs) | set(commented_songs) | set(play_counts.keys())

    for song_id in all_song_ids:
        score = 0
        if song_id in collected_songs:
            score += weight_collect
        if song_id in commented_songs:
            score += weight_comment
        if song_id in play_counts:
            score += weight_play * np.log(1 + play_counts[song_id])

        implicit_feedback[song_id] = score

    # 归一化隐式反馈得分
    max_score = max(implicit_feedback.values(), default=1)
    implicit_feedback_normalized = {k: v / max_score for k, v in implicit_feedback.items()}

    return implicit_feedback_normalized

双塔模型

def predict(user_id_to_predict):
    # 获取所有评分数据
    rate_list = RateSong.objects.all()
    user_ids = [x.user_id for x in rate_list]
    item_ids = [x.song_id for x in rate_list]
    ratings = [x.rate for x in rate_list]

    # 获取改进版隐式反馈
    implicit_feedback = generate_implicit_feedback(user_id_to_predict)

    num_users = max(user_ids) + 1
    num_items = max(item_ids) + 1

    # 输入层
    user_input = Input(shape=(1,), name='user_input')
    item_input = Input(shape=(1,), name='item_input')
    implicit_input = Input(shape=(1,), name='implicit_input')

    # 嵌入层
    embedding_size = 10
    user_embedding = Embedding(input_dim=num_users, output_dim=embedding_size)(user_input)
    item_embedding = Embedding(input_dim=num_items, output_dim=embedding_size)(item_input)

    # 扁平化层
    user_flat = Flatten()(user_embedding)
    item_flat = Flatten()(item_embedding)

    # 隐式反馈层
    implicit_dense = Dense(32, activation='relu')(implicit_input)

    # 合并层
    concat = Concatenate()([user_flat, item_flat, implicit_dense])
    dense1 = Dense(64, activation='relu')(concat)
    dense2 = Dense(32, activation='relu')(dense1)

    # 输出层
    output = Dense(num_items, activation='softmax')(dense2)

    # 构建模型
    model = Model(inputs=[user_input, item_input, implicit_input], outputs=output)
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

    # 训练模型
    model.fit(
        [np.array(user_ids), np.array(item_ids), np.array([implicit_feedback.get(x, 0) for x in item_ids])],
        np.array(item_ids),
        epochs=20,
        batch_size=2
    )

    # 预测推荐结果
    inputs = np.array([user_id_to_predict])
    predictions = model.predict([inputs, np.zeros_like(inputs), np.zeros_like(inputs)])

    num_recommendations = 8
    top_item_indices = np.argsort(predictions.flatten())[-num_recommendations:][::-1]
    result_list = []

    song_list = Song.objects.all()
    for top_item_id in top_item_indices:
        for song in song_list:
            if top_item_id == song.songId:
                result_list.append(song)
                break

    return result_list

初期算法取得了一定的效果，但考虑到我们的项目需要多模态数据以及融合推荐，我与负责算法的同学进行沟通讨论查找资料后，敲定了最后的算法，也就是我们的三模态数据融合、ai推荐、用户行为的协同过滤算法。该算法由负责算法模块的同学实现。

云平台部署

我们做的是web端，考虑到要面向用户，所以在华为云平台上租了一台ecs服务器和弹性公网IP来做项目部署。

ECS服务器与公网IP

至此购买结束，启动云服务器。
requirements.txt
登录并配置云服务器
这里使用putty登录

sudo apt update
sudo apt install python3 python3-pip python3-venv git -y
sudo apt install mysql-server -y
使用WinSCP上传项目

这里没有使用git主要是因为gitee免费版不支持单个超过100M的文件的上传，所以直接使用WinSCP。
安装requirements.txt
重新爬取+提取特征
修改settings.py
允许该服务器的公网IP和本地连接
运行项目
python3 manage.py runserver 0.0.0.0:8000
然后打开浏览器，使用公网IP可以登录
http://60.204.234.223:8000

为了防止putty连接中断导致项目终止，我使用nohup 命令来运行项目
nohup python3 manage.py runserver 0.0.0.0:8000 &
使用tail -f nohup.out查看日志。
至此，云平台部署成功。

五、心得体会

在这个项目的过程中，我经历了从本地开发到远程部署的完整流程，遇到了不少挑战。首先，在搭建Django开发环境时，我深入理解了如何配置Python环境以及安装和使用Django框架。在了解项目需求后进行算法的初步设想，深入了解了协同过滤算法，并和专门负责算法的同学沟通，更好的加深了我对多模态特征的理解。然而，在将项目部署到服务器时，我遇到了很多网络相关的问题，例如无法通过公网IP访问服务器，防火墙设置、端口占用、IP绑定等问题一直阻碍着项目的上线。通过逐步排查和调整，我学习了如何在Linux环境下管理防火墙、查看进程并释放占用的端口。尤其是在使用Gunicorn部署时，我学到了如何通过启动多个工作进程来提升项目的性能和稳定性。除此之外，调试和优化TensorFlow等依赖包也让我在处理技术细节上变得更加谨慎。通过这些实践，我不仅对Web开发流程有了更全面的认识，也提高了自己在面对问题时的解决能力和调试技巧，收获颇丰。

posted @ 2024-12-15 23:31 acedia7 阅读(9) 评论(0) 编辑收藏举报

刷新页面返回顶部

acedia7