论请求网页时的乱码问题

乱码问题很烦人,本人始终没有找到很好的解决方案,在一个抓取网页数据的程序中,最后还是使用了WebBrownser.

开始时使用HttpWebRequest

复制代码
/// <summary>
        /// 根据给定的URL获取网页源代码
        /// </summary>
        /// <param name="url"></param>
        /// <returns></returns>
        public static string GetWebPageSource(string url)
        {
            ServicePointManager.ServerCertificateValidationCallback =
                new System.Net.Security.RemoteCertificateValidationCallback(
                    delegate(object sender, X509Certificate certificate, X509Chain chain, SslPolicyErrors errors)
                    {
                        return true;
                    });

            HttpWebRequest request = WebRequest.CreateDefault(new Uri(url)) as HttpWebRequest;
            request.Method = "GET";
            request.Proxy = new WebProxy { UseDefaultCredentials = true };

            try

            {
                HttpWebResponse response = request.GetResponse() as HttpWebResponse;
                System.IO.Stream responseStream = response.GetResponseStream();

                int bytesRead = 0;
                byte[] buffer = new byte[64 * 1024];
                MemoryStream stmMemory = new MemoryStream();

                //save stream to byte[]
                while ((bytesRead = responseStream.Read(buffer, 0, buffer.Length)) > 0)
                {
                    stmMemory.Write(buffer, 0, bytesRead);
                }
                var bytes = stmMemory.ToArray();
                stmMemory.Close();

                //get charset
                var html = string.Empty;
                var charset = response.CharacterSet;
                if (string.IsNullOrWhiteSpace(charset))
                    charset = "UTF-8";

                //read html as default charset from byte[]
                var ms = new System.IO.MemoryStream(bytes);
                System.IO.StreamReader reader = new System.IO.StreamReader(ms, Encoding.GetEncoding(charset));

                //get charset from html
                var innerCharset = Pubs.DetectCharset(reader.ReadToEnd());
                if (innerCharset != string.Empty)
                    charset = innerCharset;

                //read html as detected charset from byte[]
                ms = new System.IO.MemoryStream(bytes);
                reader = new System.IO.StreamReader(ms, Encoding.GetEncoding(charset));
                html = reader.ReadToEnd();

                return html;
            }
            catch (Exception)
            {
                throw;
            }
        }

public static string DetectCharset(string html)
        {
            var wb = new WebBrowser();
            wb.ScriptErrorsSuppressed = true;
            wb.Navigate("about:blank");
            wb.Document.Write(html);
            var heads = wb.Document.GetElementsByTagName("head");
            if (heads.Count > 0)
            {
                var head = heads[0];
                foreach (HtmlElement child in head.Children)
                {
                    if (child.DomElement != null && child.DomElement as mshtml.HTMLMetaElement != null)
                    {
                        var meta = child.DomElement as mshtml.HTMLMetaElement;
                        try
                        {
                            if (meta.charset != null && meta.charset.Trim() != string.Empty)
                                return meta.charset;
                            if (meta.content != null && meta.content.Contains("charset="))
                            {
                                return meta.content.Substring(meta.content.IndexOf("charset=") + "charset=".Length);
                            }
                        }
                        catch
                        {
                            continue;
                        }
                        
                    }
                }
            }
            return string.Empty;
        }
View Code
复制代码

反正已经使用了WebBrownser了,就干脆直接用WebBrowser请求得了,而且没有乱码,代码又少,使用中没发现有什么大的问题,比较稳定,但发现它归根结义是一个COM组件,导致程序运行时内存一直在升,最后找到原因后在适当的位置Dispose就没什么问题了:

复制代码
public static WebBrowser GetBrowser(string url, int secondsTimeOut)
        {
            var wb = new WebBrowser();
            wb.ScriptErrorsSuppressed = true;
            wb.Navigate(url);
            var startTime = DateTime.Now ;
            while (true)
            {
                Application.DoEvents();
                if (wb.ReadyState == WebBrowserReadyState.Loaded 
                    || wb.ReadyState == WebBrowserReadyState.Complete
                    || startTime.AddSeconds(secondsTimeOut) < DateTime.Now)
                    break;
            }

            return wb;
        }
复制代码

 

 

posted on   空明流光  阅读(408)  评论(0编辑  收藏  举报

编辑推荐:
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
阅读排行:
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 单线程的Redis速度为什么快?
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 展开说说关于C#中ORM框架的用法!

导航

< 2025年3月 >
23 24 25 26 27 28 1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31 1 2 3 4 5
点击右上角即可分享
微信分享提示