打算做个模板爬虫,爬啊爬。

 

爬虫爬过来的代码不显示调用图片,css,js的绝对路径,引用到本地格式就错乱了。

为了解决这个问题,特地请教大师并优化代码,代码如下。

 

<?php

$rpp = '<sdf src="bbbs/sdd" <link rel="stylesheet" type="text/css" href="/public/ui/v2/static/css/basic.css?1594346753">'; //源代码,有加斜杠的,有没加斜杠的
$furl = "http://www.baidu.com"; /暂定为目标url


function relative_to_absolute($content, $feed_url) {
    preg_match('/(http|https|ftp):\/\//', $feed_url, $protocol);
    $server_url = preg_replace("/(http|https|ftp|news):\/\//", "", $feed_url);
    $server_url = preg_replace("/\/.*/", "", $server_url);

    if ($server_url == '') {
        return $content;
    }

    if (isset($protocol[0])) {
        $new_content = preg_replace('/href="/', 'href="'.$protocol[0].$server_url.'/', $content);
        $new_content = preg_replace('/href="\//', 'href="'.$protocol[0].$server_url.'/', $new_content);
        $new_content = preg_replace('/src="/', 'src="'.$protocol[0].$server_url.'/', $new_content);
        $new_content = preg_replace('/src="\//', 'src="'.$protocol[0].$server_url.'/', $new_content);
    } else {
        $new_content = $content;
    }
    return $new_content;
}

print_r(relative_to_absolute($rpp,$furl));

?>

  输出结果如下

<sdf src="http://www.baidu.com/bbbs/sdd" <link rel="stylesheet" type="text/css" href="http://www.baidu.com//public/ui/v2/static/css/basic.css?1594346753">

  希望能解决你的问题。

 

小彩蛋:匹配任意域名

(http|https)://[^\s][^\/]*
^(?:https?:)?//[^\s][^\/]*
https?://[^\s^\/]*