两个job同时执行发生的数据问题
job1: 把超过24小时的数据移到另一张表
@Override @Transactional public void moveItemsTo48h() { List<CrawlItem> historyItems = super.baseMapper.getItemsAfter24Hours(); historyItems.stream().forEach(a -> { Long id = a.getId(); CrawlItem48h item = BeanUtil.toBean(a,CrawlItem48h.class); QueryWrapper<CrawlItem48h> wrapper = new QueryWrapper<CrawlItem48h>(); wrapper.eq("mall",a.getMall()); wrapper.eq("goods_source_sn", a.getGoodsSourceSn()); CrawlItem48h exist = crawlItem48hMapper.selectOne(wrapper); if(exist == null) { crawlItem48hMapper.insert(item); }else { BeanUtil.copyProperties(item,exist,"id"); crawlItem48hMapper.updateById(exist); } removeById(id); }); }
@Select("select id,goods_source_sn,goods_info_url,source,url_code," +
"thumb_url,zhi_count,buzhi_count,star_count,comments_count,mall,title,emphsis,detail,detail_brief," +
"label,category_text,item_create_time,item_update_time,main_image_url,big_image_urls,small_image_urls," +
"price_text,price,unit_price,actual_buy_link,transfer_link,transfer_result,transfer_remark,coupon_info,taobao_pwd," +
"score,score_minute,keywords,status,remark,creator," +
"creator_id,last_operator,last_operator_id from crawl_items where TIMESTAMPDIFF(HOUR,item_create_time,now()) > 24 for update")
List<CrawlItem> getItemsAfter24Hours();
job2: 把前两页的数据重新爬取
@Override @Transactional public void reCrawl() { List<String> urllist = crawlItemMapper.getFirstRecrawlItems(); StringBuffer b = new StringBuffer(); urllist.stream().forEach(a -> b.append(a).append(",")); String raw= b.toString(); String urls = raw.substring(0,raw.length()-1); Map<String, Object> paramMap = new HashMap<String,Object>(); paramMap.put("project","smzdmCrawler"); paramMap.put("spider","smzdm_single"); paramMap.put("url",urls String response = HttpUtil.createRequest(Method.POST,"http://42.192.51.99:6801/schedule.json").basicAuth("david_scrapyd","david_2021").form(paramMap).execute().body(); log.info(response); } @Select("select goods_info_url from crawl_items where status=1 order by score_minute desc, id desc limit 0,40 for update") List<String> getFirstRecrawlItems();
这两个查询都有for update,而且方法都加了@Transactional,理论上job1 sql先锁表,job2的sql会等在那,直到moveItemsTo48h()执行完,然后解锁,job2执行getFirstRecrawlItems(),然后job1里被删的数据job2里应该检索不出来,也就不会重新插进去了。但线上结果似乎都是job2的sql先锁表,然后就会把在job1里删掉的数据重新爬取插进去。可惜线上没把sql语句的debug日志打出,mysql也没有开general_log,只能暂且认为job2里先锁表了。
喜欢艺术的码农
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· winform 绘制太阳,地球,月球 运作规律
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 上周热点回顾(3.3-3.9)
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人