记一次完整的SpringBatch批处理数据流程
记一次完整的SpringBatch批处理数据流程
需求
从400多行数据的Excel表格中批量读取数据,根据读取的数据再去调用api,拿到关键返回数据后整合写入新Excel文件。
excel表格仅第一列数据手机号为有效数据,需要读取。通过手机号调用api,获取手机号对应的学生信息-学院,班级,姓名,手机号
导入依赖
需要的关键依赖:
<!-- 批处理框架 --> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-batch</artifactId> </dependency> <!-- excel工具 --> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> </dependency> <!-- Http库 --> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> </dependency> <!-- Json格式操作 --> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> </dependency>
流程
1. 创建配置类
SpringBacth整体流程是:
调用定义的Job-->Job按顺序执行关联的步骤Step-->每一步Step中按预设的Reader,Processor,Writer顺序执行
- 在配置类中配置定义Job和Step:
import org.springframework.batch.core.Job; import org.springframework.batch.core.Step; import org.springframework.batch.core.configuration.annotation.EnableBatchProcessing; import org.springframework.batch.core.configuration.annotation.JobBuilderFactory; import org.springframework.batch.core.configuration.annotation.StepBuilderFactory; import org.springframework.batch.item.ItemProcessor; import org.springframework.batch.item.ItemWriter; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.beans.factory.annotation.Qualifier; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; import java.io.IOException; import java.util.Map; import java.util.function.Function; /** * 配置类配置Job和Step */ @Configuration @EnableBatchProcessing(modular = true) public class BatchConfig { @Autowired private JobBuilderFactory jobBuilderFactory; @Autowired private StepBuilderFactory stepBuilderFactory; /** * 配置Job,决定待执行的步骤 * @param step01 * @return */ @Bean public Job excelJob(@Qualifier("excelStep") Step step01/*,@Qualifier("step02") Step step02*/) { return jobBuilderFactory.get("excelJob") .start(step01) // .next(step02) .build(); } /** * 配置步骤中的reader,processor,writer * @param excelProcessor * @return * @throws IOException */ @Bean public Step excelStep( @Qualifier("excelProcessor") ItemProcessor<String, Map<String ,String>> excelProcessor) throws IOException { return stepBuilderFactory.get("excelStep") .<String, Map<String ,String>>chunk(1000) .reader(new ExcelItemReader("/Users/wangchenbo/Desktop/ytt.xls")) .processor(excelProcessor) .writer(new ExcelChunkWriter("/Users/wangchenbo/Desktop/data.xls")) .build(); } }
- 关于Chunk
chunk-oriented processing(基于块的处理)是一种常用的数据处理模式,它允许你以块的形式读取、处理和写入数据。ItemReader 会按照配置的 chunkSize 读取一批数据,然后这一批数据会被一起传递给 ItemProcessor 进行处理,最后会被整体写入,同时在设置chunkSize要指定输入输出的参数类型
2. 配置ItemReader
ItemReader是读取数据的接口,内置提供的常用的实现类有:
- JdbcCursorItemReader ,使用 ResultSet 游标来从数据库逐条读取数据
- FlatFileItemReader , 用于从平面文件(如 CSV、TXT 等)中读取数据
- XmlRecordItemReader , 用于从 XML 文件中读取数据
- DelimitedLineItemReader , 用于从分隔符分隔的文本文件中读取数据
- JdbcPagingItemReader , 用于从数据库中分页读取数据
对于当下读取Excel的需求,选择自定义实现ItemReader接口来定义读取规则。具体的规则通过实现接口的read()方法进行制定。
package ytt; import org.apache.poi.hssf.usermodel.HSSFWorkbook; import org.apache.poi.ss.usermodel.Row; import org.apache.poi.ss.usermodel.Sheet; import org.springframework.batch.item.ItemReader; import org.springframework.batch.item.NonTransientResourceException; import org.springframework.batch.item.ParseException; import org.springframework.batch.item.UnexpectedInputException; import java.io.FileInputStream; import java.io.IOException; public class ExcelItemReader implements ItemReader<String> { //HSSFWorkbook:是操作Excel2003以前(包括2003)的版本,扩展名是.xls //XSSFWorkbook:是操作Excel2007后的版本,扩展名是.xlsx //SXSSFWorkbook:是操作Excel2007后的版本,扩展名是.xlsx private HSSFWorkbook workbook; private Sheet sheet; private int currentRow = 0; public ExcelItemReader(String filePath) throws IOException { FileInputStream fis = new FileInputStream(filePath); this.workbook = new HSSFWorkbook(fis); this.sheet = this.workbook.getSheetAt(0); } @Override public String read() throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException { //当行数大于最后一行,关闭工作簿,结束 if (currentRow > sheet.getLastRowNum()) { closeWorkbook(); return null; } //读取一行 Row row = sheet.getRow(currentRow++); return row.getCell(0).getStringCellValue(); } private void closeWorkbook() { try { if (workbook != null) { workbook.close(); } } catch (IOException e) { e.printStackTrace(); } } }
3.配置ItemProcessor
import com.alibaba.fastjson.JSON; import com.alibaba.fastjson.JSONObject; import org.apache.http.HttpEntity; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpPost; import org.apache.http.entity.StringEntity; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.util.EntityUtils; import org.springframework.batch.item.ItemProcessor; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; import org.springframework.util.CollectionUtils; import org.springframework.util.StringUtils; import java.nio.charset.StandardCharsets; import java.util.HashMap; import java.util.Map; /** * 定义数据处理规则 <br> * 接收每个来自Reader的数据,处理数据,执行到预设数量的数据后,以chunk为单位writer输出 */ @Configuration public class Processor { /** * 单例化httpClient,否则在ItemProcessor重复执行期间会重复创建 */ @Bean public CloseableHttpClient httpClient() { return HttpClients.createDefault(); } @Bean public ItemProcessor<String, Map<String, String>> excelProcessor(CloseableHttpClient httpClient) { return phoneNumber -> { //调用地址 String url = "http://116.62.107.xxx:12010/api/User/Personal"; //创建post请求 HttpPost httpPost = new HttpPost(url); //设置请求体 String requestBody = "{\"data\":{\"LoginName\":\"" + phoneNumber + "\"}}"; httpPost.setEntity(new StringEntity(requestBody, StandardCharsets.UTF_8)); // 设置请求头 httpPost.setHeader("Authorization", "Bearer /**省略**/"); httpPost.setHeader("Content-Type", "application/json;charset=UTF-8"); httpPost.setHeader("AppVersion", "1.5.8"); httpPost.setHeader("Host", "116.62.107.xxx:12010"); //返回值 HashMap<String, String> map = new HashMap<>(); try (CloseableHttpResponse response = httpClient.execute(httpPost)) { int statusCode = response.getStatusLine().getStatusCode(); System.out.println("Status code: " + statusCode + phoneNumber); // 获取响应实体 HttpEntity entity = response.getEntity(); if (entity != null) { // 打印响应内容 String responseBody = EntityUtils.toString(entity); //转换JSON筛选字段 JSONObject jsonBody = JSON.parseObject(responseBody); if (StringUtils.startsWithIgnoreCase(jsonBody.getJSONObject("data").getString("facName"), "2022")) { map.put("facName", jsonBody.getJSONObject("data").getJSONArray("details").getJSONObject(0) .getString("facName")); map.put("class", jsonBody.getJSONObject("data") .getString("facName")); map.put("name", jsonBody.getJSONObject("data") .getString("name")); map.put("phone", phoneNumber); } } } return CollectionUtils.isEmpty(map) ? null : map; }; } }
4. 配置ItemWriter
这里也选择自定义实现ItemWriter接口作为Writer
import org.apache.poi.hssf.usermodel.HSSFWorkbook; import org.apache.poi.ss.usermodel.Row; import org.apache.poi.ss.usermodel.Sheet; import org.springframework.batch.item.ItemWriter; import java.io.FileOutputStream; import java.io.IOException; import java.util.List; import java.util.Map; /** * writer以chunk块为单位对预设的size大小的chunk进行一次性写入 <br> * 接收的参数是基本项的集合 */ public class ExcelChunkWriter implements ItemWriter<Map<String ,String>> { private HSSFWorkbook workbook; private Sheet sheet; FileOutputStream fos; private int rowNum = 1; public ExcelChunkWriter(String outPath) throws IOException { fos = new FileOutputStream(outPath); this.workbook = new HSSFWorkbook(); this.sheet = workbook.createSheet("已清洗数据"); Row headerRow = sheet.createRow(0); headerRow.createCell(0).setCellValue("学院"); headerRow.createCell(1).setCellValue("班级"); headerRow.createCell(2).setCellValue("姓名"); headerRow.createCell(3).setCellValue("号码"); } @Override public void write(List<? extends Map<String ,String>> items) throws IOException { for (Map<String ,String> map : items) { Row row = sheet.createRow(rowNum++); row.createCell(0).setCellValue(map.get("facName")); row.createCell(1).setCellValue(map.get("class")); row.createCell(2).setCellValue(map.get("name")); row.createCell(3).setCellValue(map.get("phone")); } try { workbook.write(fos); }finally { workbook.close(); } } }
注意事项
- 实际上查看ItemReader和ItemPorcessor内部实现方法可以看出,每一个read和process方法接收的参数是单个数据项,实际上ItemReader每次处理单个数据之后并不会直接传递给Processor,而是维护了一个内部缓冲区,将读取的数据达到预设chunkSize数量后整体传递给ItemProcesser,同理itemProcessor也是逐个处理数据项,所以在实现ItemReader或者ItemProcessor接口方法时,只需要关注单个数据项的读取规则和处理逻辑即可,这种接口的设计目的是为了方便地处理单个数据项。
- ItemReader和ItemProcessor的实现方法接受的参数实际上是单个数据项,而ItemWriter将以chunk块为单位整体接收,以
List<? extends T> items
类型接收参数,进行统一写入
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· winform 绘制太阳,地球,月球 运作规律
· AI与.NET技术实操系列(五):向量存储与相似性搜索在 .NET 中的实现
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 上周热点回顾(3.3-3.9)