记一次并发导致的事故

记一次并发导致的事故

简化业务逻辑如下:

  1. 创建一批收件人;
  2. 点击发送按钮时发送邮件给这批发件人;

现象是有一批人收到了两封邮件, 还有少部分人收到了 3 封.

原因分析

每次触发发送任务都有一个唯一的 traceid, 记录日志时会带上 traceid, 将日志按 traceid 分组获取第一条和最后一条记录时间, 绘制甘特图如下:

编写脚本绘制甘特图

logfile="logs/watermark_logger.log"
keyword='lock watermark_mail'
ids=$(grep "$keyword" "$logfile" | awk '{print $5}' | sort | uniq)

function format_date() {
    ret=$(echo "$*" | awk '{print $1,$2}' | awk -F ',' '{printf $1}')
    echo $ret
}

for uuid in $ids; do
    first_line=$(grep "$uuid" "$logfile" | head -n 1)
    last_line=$(grep "$uuid" "$logfile" | tail -n 1)
    start=$(format_date $first_line)
    end=$(format_date $last_line)

    name=$(echo ${start} | awk '{print $2}' | tr -d ':')-$(echo ${end} | awk '{print $2}' | tr -d ':')
    # mermaid gant
    echo ${name} :done, des1, ${start},${end}
done

使用 mermeid 将输出渲染为甘特图

gantt
dateFormat  YYYY-MM-DD HH:mm:ss
axisFormat  %Y-%m-%d %H:%M:%S
title       发送开始时间-结束时间
excludes    weekends

232910-233227 :done, des1, 2024-09-18 23:29:10,2024-09-18 23:32:27
232910-233227 :done, des1, 2024-09-18 23:29:10,2024-09-18 23:32:27
233234-233330 :done, des1, 2024-09-18 23:32:34,2024-09-18 23:33:30
233339-233438 :done, des1, 2024-09-18 23:33:39,2024-09-18 23:34:38
233445-233543 :done, des1, 2024-09-18 23:34:45,2024-09-18 23:35:43
233558-233655 :done, des1, 2024-09-18 23:35:58,2024-09-18 23:36:55
233711-233807 :done, des1, 2024-09-18 23:37:11,2024-09-18 23:38:07
233819-233916 :done, des1, 2024-09-18 23:38:19,2024-09-18 23:39:16
234023-234121 :done, des1, 2024-09-18 23:40:23,2024-09-18 23:41:21
234140-234238 :done, des1, 2024-09-18 23:41:40,2024-09-18 23:42:38
234246-234344 :done, des1, 2024-09-18 23:42:46,2024-09-18 23:43:44
234353-234450 :done, des1, 2024-09-18 23:43:53,2024-09-18 23:44:50
234457-234623 :done, des1, 2024-09-18 23:44:57,2024-09-18 23:46:23
234637-234732 :done, des1, 2024-09-18 23:46:37,2024-09-18 23:47:32
234742-234839 :done, des1, 2024-09-18 23:47:42,2024-09-18 23:48:39
234844-234942 :done, des1, 2024-09-18 23:48:44,2024-09-18 23:49:42
234956-235054 :done, des1, 2024-09-18 23:49:56,2024-09-18 23:50:54
235113-235209 :done, des1, 2024-09-18 23:51:13,2024-09-18 23:52:09
235214-235309 :done, des1, 2024-09-18 23:52:14,2024-09-18 23:53:09
235319-235423 :done, des1, 2024-09-18 23:53:19,2024-09-18 23:54:23
235503-235558 :done, des1, 2024-09-18 23:55:03,2024-09-18 23:55:58
235604-235749 :done, des1, 2024-09-18 23:56:04,2024-09-18 23:57:49
235605-235749 :done, des1, 2024-09-18 23:56:05,2024-09-18 23:57:49
235759-235920 :done, des1, 2024-09-18 23:57:59,2024-09-18 23:59:20
235846-000115 :done, des1, 2024-09-18 23:58:46,2024-09-19 00:01:15
235847-000107 :done, des1, 2024-09-18 23:58:47,2024-09-19 00:01:07
235929-000146 :done, des1, 2024-09-18 23:59:29,2024-09-19 00:01:46
000110-000331 :done, des1, 2024-09-19 00:01:10,2024-09-19 00:03:31
000126-000343 :done, des1, 2024-09-19 00:01:26,2024-09-19 00:03:43
000152-000400 :done, des1, 2024-09-19 00:01:52,2024-09-19 00:04:00
000343-000513 :done, des1, 2024-09-19 00:03:43,2024-09-19 00:05:13
000404-000526 :done, des1, 2024-09-19 00:04:04,2024-09-19 00:05:26
000531-000620 :done, des1, 2024-09-19 00:05:31,2024-09-19 00:06:20
000634-000723 :done, des1, 2024-09-19 00:06:34,2024-09-19 00:07:23
001247-001353 :done, des1, 2024-09-19 00:12:47,2024-09-19 00:13:53
gantt dateFormat YYYY-MM-DD HH:mm:ss axisFormat %Y-%m-%d %H:%M:%S title 发送开始时间-结束时间 excludes weekends 232910-233227 :crit, des1, 2024-09-18 23:29:10,2024-09-18 23:32:27 232910-233227 :crit, des1, 2024-09-18 23:29:10,2024-09-18 23:32:27 233234-233330 :done, des1, 2024-09-18 23:32:34,2024-09-18 23:33:30 233339-233438 :done, des1, 2024-09-18 23:33:39,2024-09-18 23:34:38 233445-233543 :done, des1, 2024-09-18 23:34:45,2024-09-18 23:35:43 233558-233655 :done, des1, 2024-09-18 23:35:58,2024-09-18 23:36:55 233711-233807 :done, des1, 2024-09-18 23:37:11,2024-09-18 23:38:07 233819-233916 :done, des1, 2024-09-18 23:38:19,2024-09-18 23:39:16 234023-234121 :done, des1, 2024-09-18 23:40:23,2024-09-18 23:41:21 234140-234238 :done, des1, 2024-09-18 23:41:40,2024-09-18 23:42:38 234246-234344 :done, des1, 2024-09-18 23:42:46,2024-09-18 23:43:44 234353-234450 :done, des1, 2024-09-18 23:43:53,2024-09-18 23:44:50 234457-234623 :done, des1, 2024-09-18 23:44:57,2024-09-18 23:46:23 234637-234732 :done, des1, 2024-09-18 23:46:37,2024-09-18 23:47:32 234742-234839 :done, des1, 2024-09-18 23:47:42,2024-09-18 23:48:39 234844-234942 :done, des1, 2024-09-18 23:48:44,2024-09-18 23:49:42 234956-235054 :done, des1, 2024-09-18 23:49:56,2024-09-18 23:50:54 235113-235209 :done, des1, 2024-09-18 23:51:13,2024-09-18 23:52:09 235214-235309 :done, des1, 2024-09-18 23:52:14,2024-09-18 23:53:09 235319-235423 :done, des1, 2024-09-18 23:53:19,2024-09-18 23:54:23 235503-235558 :done, des1, 2024-09-18 23:55:03,2024-09-18 23:55:58 235604-235749 :crit, des1, 2024-09-18 23:56:04,2024-09-18 23:57:49 235605-235749 :crit, des1, 2024-09-18 23:56:05,2024-09-18 23:57:49 235759-235920 :crit, des1, 2024-09-18 23:57:59,2024-09-18 23:59:20 235846-000115 :crit, des1, 2024-09-18 23:58:46,2024-09-19 00:01:15 235847-000107 :crit, des1, 2024-09-18 23:58:47,2024-09-19 00:01:07 235929-000146 :crit, des1, 2024-09-18 23:59:29,2024-09-19 00:01:46 000110-000331 :crit, des1, 2024-09-19 00:01:10,2024-09-19 00:03:31 000126-000343 :crit, des1, 2024-09-19 00:01:26,2024-09-19 00:03:43 000152-000400 :crit, des1, 2024-09-19 00:01:52,2024-09-19 00:04:00 000343-000513 :crit, des1, 2024-09-19 00:03:43,2024-09-19 00:05:13 000404-000526 :crit, des1, 2024-09-19 00:04:04,2024-09-19 00:05:26 000531-000620 :done, des1, 2024-09-19 00:05:31,2024-09-19 00:06:20 000634-000723 :done, des1, 2024-09-19 00:06:34,2024-09-19 00:07:23 001247-001353 :done, des1, 2024-09-19 00:12:47,2024-09-19 00:13:53

从图中可以看到两处异常(标红处).

第一个异常点, 上下两个 task 在时间上重合了, 说明次处发生了并发操作.

并发原因如下:

  • 查询状态和更改状态不在一个原子操作中;

修复方案:

  • 更改任务发送状态(加锁)时采用原子操作: 具体操作为, 使用update task set status=1 where status!=1 and id=***执行加锁动作, 根据 affected rows 做进一步操作;

第二个异常点, 下面的任务在上个任务结束之前就开始了.

原因如下:

两个任务同时执行, 第一个任务结束时进行解锁, 此时第二个任务仍在运行, 第二个任务结束时又进行了解锁操作, 由此形成了恶性循环.

修复方案:

  • 加锁时同时记录拥有者信息, 解锁时做判断

总结

此时事故遇到的是锁方案中常见的问题:

  1. 原子加锁;
  2. 解锁操作者必须和加锁操作者相同;
    关键操作需严格按照锁理论进行实现.
    另外, 甘特图在分析并发问题时是一个不错的选择.
posted @ 2024-09-25 11:24  Aloe_n  阅读(1)  评论(0编辑  收藏  举报