分析html格式数据，根据正则表达式获取所需数据,并存入数据库

例子为获取<dl>中的<dt></dt><dd></dd>中的内容，由于在<dt></dt>标签中还存在<a>标签，所有会一同举例说明。第一次做不够优化，共同学习。

<!DOCTYPE html>

<html>

<head>

</head>

<body>

　　　　<dt> <a href="sdf">小哈同学1号的日记，2017-2-26日</a> </dt>
　　　　<dd> 记录人：小哈同学1号 </dd>
　　　　<dd> 天气：晴 </dd>
　　　　<dd> 心情：<a href="fdg">今天外面有出太阳，可还是很冷，心情还不错！</a></dd>
　　</dl>

</body>

</html>

第一步：首先这边将html以字符串的格式存入了数据库中（可以不存，可上传后获取整个文件分析）

string json="上方的html";

第二步：通过正则表达式获取 <dl class="hello">内容

MatchCollection medl = Regex.Matches(json, @"<dl class=""hello"">([\s\S]*?)</dl>");//这里的json传的是需分析的字符串

List<string> mclist = new List<string>();//用于存储最后遍历出来的实体数据
//循环dl
for (int i = 0; i < medl.Count; i++)
{

　　第三步：获取<dl>下<a>标签中的内容
　　//获取dl下的dt下的a标签
　　MatchCollection dedt = Regex.Matches(medl[i].Value, @"(?<=>).*(?=</a>)");

　　List<string> titlelist = new List<string>();
　　foreach (var item in dedt[0].Value.Split(','))//<dt><a>标签中包含两个内容，日记标题和日期，这里通过split分割将值遍历出来
　　{
　　　　titlelist.Add(item);
　　}
　　for (int b = 0; b < titlelist.Count; b++)
　　{
　　　　if (b == 0)
　　　　{
　　　　　　mclist.Insert(0, "日记标题：" + titlelist[b]);
　　　　　　mclist.Insert(1, "心情：" + dedt[1].Value);//第二个<a>标签的值
　　　　}
　　　　else if (b == 1)
　　　　{
　　　　　　mclist.Insert(2, "日期：" + titlelist[b]);
　　　　}
　　　　else
　　　　{
　　　　}
　　}

　　第四步：获取<dl>下的<dd>标签里面的内容
　　//获取dl下的dd标签
　　MatchCollection mcdd = Regex.Matches(medl[i].Value, @"(?<=<dd>)([^<]*)(?=</dd>)");
　　//循环dd标签
　　for (int j = 0; j < mcdd.Count; j++)
　　{
　　　　mclist.Add(mcdd[j].Value);//将dd标签获取的值存入mclist中（如果这时存入的值有多余的转义字符可用Value.replace("需替换的值","替换后的值")替换）
　　}
　　hellobll.Add(GetModels(mclist));//将mclist中的数据存入数据库
　　//将mclist中存入数据库的数据移除，防止重复操作
　　mclist.Clear();

　}

//该方法将数据对应的数据字段中，在this.GetValueByKey方法中拿到相应的值。

private Model.hello GetModels(List<string> data)
{
Model.hello model = new Model.hello ();
model.Title= this.GetValueByKey(data, "日记标题");//日记标题

model.date= this.GetValueByKey(data, "日期");//日期
model.Name = this.GetValueByKey(data, "记录人");//记录人
model.Weather = this.GetValueByKey(data, "天气");//天气

model.Mood = this.GetValueByKey(data, "心情");//心情

return model;
}

private string GetValueByKey(List<string> data, string key)
{
string result = data.Find(x => x.StartsWith(key));
if (!string.IsNullOrEmpty(result))
{
result = result.Replace(key, string.Empty);
result = result.Replace("：", "");
result = result.Trim();
}
return result;
}

//实体类

public class Diary

{

　　Public int ID{get;set;}

　　public string Title{get;set;}

　　public string Name{get;set;}

　　public string date{get;set;]//这里时间存的是string类型

　　public string Weather{get;set;}

　　public string Mood{get;set;}

}

正则表达式可优化，心情这条数据没有通过<dd>标签获取，是通过<a>标签获取的。

posted @ 2017-02-26 03:48 向着哆啦前进的乌龟阅读(1314) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

向着哆啦前进的乌龟

分析html格式数据，根据正则表达式获取所需数据,并存入数据库

公告