寻找与网页内容相关的图片（二）reddit的做法

正如前文所述，内容聚合网站，比如新浪微博、推特、facebook等网站对于网页的缩略图是刚需。为了让分享的内容引人入胜，网页的图片缩略图是必不可少的。年轻人的聚集地、社交新闻网站reddit也是一个这样的网站，由于他们将自己网站的源代码在github上开源，我便很容易了解他们的做法。

寻找网页图片缩略图的算法，可以在这里找到[1]。

实现这一功能的就是_find_thumbnail_image(self)函数，下边我会仔细分析一下他们的代码。

content_type, content = _fetch_url(self.url)
 
# if it's an image. it's pretty easy to guess what we should thumbnail.
if content_type and "image" in content_type and content:
    return self.url
 
if content_type and "html" in content_type and content:
    soup = BeautifulSoup.BeautifulSoup(content)
else:
    return None

_fetch_url会请求链接url，获取链接文件类型，和链接的内容。
可以从_fetch_url函数看到，文件的类型是通过，http响应的头部获取的。文件类型由多用途互联网邮件扩展类型(Multipurpose Internet Mail Extensions,MIME)指定。
如果url指向文件是图片(image)就直接返回url，如果指向的文件是超文本标记语言(HTML, hypertext markup language)就用BeautifulSoup包对HTML源代码解析，如果是其它文件类型返回None。

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。它能够通过你喜欢的转换器实现惯用的文档导航、查找、修改文档的方式。

# allow the content author to specify the thumbnail:
# <meta property="og:image" content="http://...">
og_image = (soup.find('meta', property='og:image') or
            soup.find('meta', attrs={'name': 'og:image'}))
if og_image and og_image['content']:
    return og_image['content']
 
# <link rel="image_src" href="http://...">
thumbnail_spec = soup.find('link', rel='image_src')
if thumbnail_spec and thumbnail_spec['href']:
    return thumbnail_spec['href']

接下来判断，用户（网页的作者）是否指定缩略图。使用的方法便是前文所说的开放图谱计划(Open Graph Protocol)

meta标签或者是link便签可以指定网页的缩略图，如果网页包含这两个标签就大功告成了，直接返回图片的源地址即可。这样很方便，但有明显的不足。如此没有检验图片是否有效，有的网站偷工减料返回的并非网页相关图片的缩略图，而是网站的logo，stackoverflow就是一个典型。不过话又说回来，出现这种特殊情况的概率是相当小的。

# ok, we have no guidance from the author. look for the largest
# image on the page with a few caveats. (see below)
max_area = 0
max_url = None
for image_url in self._extract_image_urls(soup):
    # When isolated from the context of a webpage, protocol-relative
    # URLs are ambiguous, so let's absolutify them now.
    if image_url.startswith('//'):
        image_url = coerce_url_to_protocol(image_url, self.protocol)
    size = _fetch_image_size(image_url, referer=self.url)
    if not size:
        continue
 
    area = size[0] * size[1]

接下来是一个循环，在通过_extract_image_urls找到网页的所有图片后，遍历每一张图片，找到最大的一张图片。

具体来说还加上了一些限制条件

# ignore little images
if area < 5000:
    g.log.debug('ignore little %s' % image_url)
    continue
 
# ignore excessively long/wide images
if max(size) / min(size) > 1.5:
    g.log.debug('ignore dimensions %s' % image_url)
    continue
 
# penalize images with "sprite" in their name
if 'sprite' in image_url.lower():
    g.log.debug('penalizing sprite %s' % image_url)
    area /= 10

图片的面积必须大于5000像素、宽长比必须小于1.5、url如果包含sprite，则进行惩罚，将面积除以10

_fetch_image_size(image_url, referer=self.url)是一个比较困难的地方，为了找到每一张图片的大小，必须对下载图片。一个小技巧是，图片的大小作为图片文件格式的一部分往往写在了图片文件的头部，只需要下载图片的一部分就可以得到大小了。想要具体了解可以分析一下那个函数。

if area > max_area:
    max_area = area
    max_url = image_url

到这就结束了。reddit的方法用一句话来总结就是，相信网页指定的缩略图，没有就找最大的图片，同时限制最小面积以及宽长比。

这是它们实际的效果：

facebook等其他一些社交网站的做法也大同小异，这里有一个回答是介绍facebook如何做的[2]。

参考资料

[1] https://github.com/reddit/reddit/blob/0fbea80d45c4ce35e50ae6f8b42e5e60d79743ca/r2/r2/lib/media.py

[2] http://stackoverflow.com/questions/1138460/how-does-facebook-sharer-select-images

posted on 2015-04-28 21:01 meelo 阅读(510) 评论(0) 编辑收藏举报

努力加载评论中...

刷新页面返回顶部

meelo

寻找与网页内容相关的图片（二）reddit的做法

导航

公告