The Social Web: data representation

参考链接:click here

本次assignment所用的jupyter book为a2_datarep.ipynb:

 "cells": [
   "cell_type": "markdown",
   "metadata": {
    "id": "dcMf4aubeMI9"
   "source": [
    "#  The Social Web: data representation\n",
    "- Instructors: Jacco van Ossenbruggen.\n",
    "- TAs: Ayesha Noorain, Alex Boyko, Caio Silva, Elena Beretta, Mirthe Dankloff.\n",
    "- Exercises for Hands-on session 2\n",
   "cell_type": "markdown",
   "metadata": {
    "id": "Zhts5HMzeMI-"
   "source": [
    "In this session you are going to mine data in various microformats. You will see the differences in what each of the formats can contain and what purpose they serve. We will start by looking at geographical data.\n",
    "- Python 3.8\n",
    "- Python packages: requests, BeautifulSoup4, HTMLParser, rdflib\n"
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    "id": "6f-OtFPPeMJA",
    "outputId": "9bcb836f-4204-4fac-d133-99e81a0b2884"
   "outputs": [],
   "source": [
    "# If you're using a virtualenv, make sure it's activated before running\n",
    "# this cell!\n",
    "!pip install requests\n",
    "!pip install BeautifulSoup4\n",
    "!pip install HTMLParser\n",
    "!pip install rdflib"
   "cell_type": "markdown",
   "metadata": {
    "id": "irPnmIK4eMJd"
   "source": [
    "##  Exercise 1\n",
    "Even if web pages do not use microformat, interesting data can often be extracted from the HTML. You may use packages such as BeautifulSoup to extract arbitrary pieces of data from any HTML page.\n",
    "The example below shows how we can find the URL of first image in the infobox table of the wikipedia page on Amsterdam. Tip: compare the code below with HTML source code of the wikipedia page: the image url is in the \"src\" attribute of the \"img\" element of in the \"table\" element with class=\"infobox\"."
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    "id": "9gpHw90keMJf",
    "outputId": "7ae1fe64-8d85-4a47-cfdf-422284954d81"
   "outputs": [],
   "source": [
    "# -*- coding: utf-8 -*-\n",
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "# This script requires you to add a url of a page with geotags to the commandline, e.g.\n",
    "# python ''\n",
    "URL = ''\n",
    "req = requests.get(URL, headers={'User-Agent' : \"Social Web Course Student\"})\n",
    "soup = BeautifulSoup(req.text)\n",
    "# print(req.text)\n",
    "image1 = soup.findAll('table', class_='infobox')[0].find('img')\n",
    "print(image1['src'])  \n"
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Extracting coordinates from a webpage and reformatting them in the geo microformat (based on Example 8-1 in Mining the Social Web). Note that wikipages may encode long/lat information in different ways. On of the ways used by the Amsterdam wikipedia page is in a span element that is not shown to the user: \n",
    "<span class=\"geo\">52.367; 4.900</span>\n",
    "This span element has a single child: len(geoTag == 1) and no further structure, we have to manually get the long/lat by splitting the string on the ';' semicolon."
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    "id": "LtHtQT9PeMJl",
    "outputId": "8a7f7b52-cdb2-409f-b3f0-ee7adf60a9f7"
   "outputs": [],
   "source": [
    "geoTag = soup.find(True, 'geo')\n",
    "if geoTag and len(geoTag) > 1:\n",
    "        lat = geoTag.find(True, 'latitude').string\n",
    "        lon = geoTag.find(True, 'longitude').string\n",
    "        print ('Location is at'), lat, lon\n",
    "elif geoTag and len(geoTag) == 1:\n",
    "        (lat, lon) = geoTag.string.split(';')\n",
    "        (lat, lon) = (lat.strip(), lon.strip())\n",
    "        print (('Location is at'), lat, lon)\n",
    "        print ('Location not found')\n"
   "cell_type": "markdown",
   "metadata": {
    "id": "8S_bXnjveMJp"
   "source": [
    "### Task 1\n",
    "Can you convert the output of Exercise 1 into KML? Here is the KML documentation: and here you can find a simple example of how it is used:\n",
    "Visualise the point in Google Maps using the following code example:\n",
    "You will have to create your own KML file for the custom map layer, and provide a URL to the KML file inside the JavaScript code, which means that you have to upload the file somewhere. You can use a service like to obtain a URL for your KML file —> paste the code there and request the RAW format URL; use this one in this Task1.\n",
    "Is KML a microformat, why (not)?"
   "cell_type": "markdown",
   "metadata": {
    "id": "kUnka7EyeMJp"
   "source": [
    "## Exercise 2 \n",
    "In order to find information in the web we can use microformats such as [hRecipe]( or's [Recipe]( But first, we'll show you how to find arbitrary tags in a webpage.\n"
   "cell_type": "markdown",
   "metadata": {
    "id": "b0pBs-PVeMJq"
   "source": [
    "### Task 2 \n",
    "Parsing data for a <sub><sup>veggie</sup></sub> spaghetti alla carbonara recipe (from Example 2-7 in Mining the Social Web)."
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "mt9BK_CZeMJr"
   "outputs": [],
   "source": [
    "import requests\n",
    "import json\n",
    "from bs4 import BeautifulSoup\n",
    "# A yummy webpage (feel free to change to your likings.)\n",
    "URL = \"\"\n",
    "# requests will return the html found at the given webpage...\n",
    "page = requests.get(URL)\n",
    "# ...and a BeautifulSoup object can be created from its content.\n",
    "soup = BeautifulSoup(page.content, 'html.parser')\n",
    "listchildren = list(soup.children)\n",
    "# print(listchildren)"
   "cell_type": "markdown",
   "metadata": {
    "id": "IhdMwqykeMJt"
   "source": [
    "We can find any element in the page through *css tag selectors*\n",
    "You can find them all [here](, but shortly these are \".\" for classes, # for ids and plain text for the element name.\n",
    "You can also combine them, so that looking for \".class1.class2\" would select all elements displaying both classes. For a deeper overview please check the above link (or google \"html tag selectors\"). "
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 71
    "id": "PBaiK8OLeMJu",
    "outputId": "5b75f973-41c1-4ad4-fd9f-4f1f7665ba1d"
   "outputs": [],
   "source": [
    "print(len(listchildren)) # we can see here how many children the html doc has got.\n",
    "ingredients_unparsed = soup.select_one(\".tasty-recipes-ingredients\")\n",
    "# let's get all the \"list item\" elements in a list:\n",
    "ing_unp = ingredients_unparsed.findAll('li')\n",
   "cell_type": "markdown",
   "metadata": {
    "id": "tFXVPZhIeMJw"
   "source": [
    "Mmmh... not so pretty yet. How about listing their items using the text method?"
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    "id": "xASBZsnMeMJx",
    "outputId": "7af0f6e9-3b4f-4f34-e444-794087d06e25"
   "outputs": [],
   "source": [
    "ingredients = [t.text for t in ing_unp]\n",
    "# [print(i) for i in ingredients]  # Also prints the generator\n",
    "# Instead\n",
    "for ing in ingredients:\n",
    "    print(ing)"
   "cell_type": "markdown",
   "metadata": {
    "id": "O-RItVHyeMJz"
   "source": [
    "Good. Now the instructions:"
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 54
    "id": "d-3Op4B6eMJ0",
    "outputId": "75a70f0c-86d3-4be9-d2d8-84df91c4f392"
   "outputs": [],
   "source": [
    "instructions_unparsed = soup.select_one(\".tasty-recipes-instructions\")\n",
    "instructions_unparsed = instructions_unparsed.findAll(\"li\")\n",
   "cell_type": "markdown",
   "metadata": {
    "id": "wPWXuglfeMJ2"
   "source": [
    "Let's finish off with the title:"
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    "id": "yg1TnWe2eMJ3",
    "outputId": "05d39a2e-3779-45f1-ddeb-9c6d2ae5f494"
   "outputs": [],
   "source": [
    "title_unparsed = soup.select_one(\".post-header\") # \n",
    "categorical_title = title_unparsed.text.split(\"›\") # website specific divider.\n",
    "recipe_title = categorical_title[-1].strip() # let's remove that ugly space at the beginning.\n",
   "cell_type": "markdown",
   "metadata": {
    "id": "RYb6WtXYeMJ6"
   "source": [
    "## Task 2.1\n",
    "Now it's your turn. Create a function that can scrape any recipe webpage from the same website (other websites will have different class tags). \n",
    "Make sure to:\n",
    "- return itemized content (e.g. ingredients) in a list. You may want to use a list comprehension here.\n",
    "- Not all items have been cleaned of their html markdown (see variables ```ingredients``` vs. ```instructions_unparsed```. Make sure to return a list with human readable content (i.e. by using the ```.text``` attribute).\n"
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 54
    "id": "UQu9ecLEeMJ6",
    "outputId": "a8aa0e14-a8fb-4279-cf32-8dca97ab3412"
   "outputs": [],
   "source": [
    "# -*- coding: utf-8 -*-\n",
    "import requests\n",
    "import json\n",
    "from bs4 import BeautifulSoup\n",
    "# Pass in a URL containing hRecipe, such as\n",
    "URL = \"\"#YOUR RECIPE HERE/\n",
    "# Parse out some of the pertinent information for a recipe.\n",
    "# See\n",
    "def parse_website(url):\n",
    "    page = requests.get(url)\n",
    "    soup = BeautifulSoup(page.content, 'html.parser')\n",
    "    \n",
    "    # You code here\n",
    "    # Parse header and get the title\n",
    "    title_unparsed = soup.select_one(\".post-header\") # \n",
    "    categorical_title = title_unparsed.text.split(\"›\") # website specific divider.\n",
    "    recipe_title = categorical_title[-1].strip() # let's remove that ugly space at the beginning.\n",
    "    fn = recipe_title\n",
    "    # Ingredients\n",
    "    ingredients_unparsed = soup.select_one(\".tasty-recipes-ingredients\")\n",
    "    # let's get all the \"list item\" elements in a list:\n",
    "    ing_unp = ingredients_unparsed.findAll('li')\n",
    "    ingredients = [t.text for t in ing_unp]\n",
    "    # Instructions\n",
    "    instructions_unparsed = soup.select_one(\".tasty-recipes-instructions\")\n",
    "    instructions_unparsed = instructions_unparsed.findAll(\"li\")\n",
    "    instructions = [t.text for t in instructions_unparsed]\n",
    "    return {\n",
    "            'name': fn,\n",
    "            'ingredients': ingredients,\n",
    "            'instructions': instructions,\n",
    "            }\n",
    "    \n",
    "recipe = parse_website(URL)\n",
    "print (recipe)"
   "cell_type": "markdown",
   "metadata": {
    "id": "ccURluAIeMJ8"
   "source": [
    "But How can we get information not only from one website,  but from all? \n",
    "The answer: microformats.\n",
    "But rather than extracting with information manually from the or hRecipe microformats, we can use a package, ```scrape-schema-recipe``` \n",
    "Feel free to experiment with it. "
   "cell_type": "markdown",
   "metadata": {
    "id": "EBY-y_GreMJ8"
   "source": [
    "### Task 2.2\n",
    "hRecipe is a microformat specifically created for recipes.\n",
    "Can you for example easily compare different dessert recipe ingredients? For inspiration you can look back at the exercises you did in Hands-on session 1 where you compared different sets of tweets."
   "cell_type": "markdown",
   "metadata": {
    "id": "n-J8fiLbeMJ9"
   "source": [
    "## Exercise 3"
   "cell_type": "markdown",
   "metadata": {
    "id": "7XBeqJHVeMJ9"
   "source": [
    " is one of the most widely used annotations formats. is a multipurpose  template that has been created by a consortium consisting of Yahoo!, Google and Microsoft. It can describe entities, events, products etc. Check out the vocabulary specs on"
   "cell_type": "markdown",
   "metadata": {
    "id": "fiw8JClyeMJ-"
   "source": [
    "### Task 3\n",
    "Parsing microdata. To parse this data you need to install the rdflib-microdata package, which you have done in one of the previous steps.\n",
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 153
    "id": "X2zr3fOOeMJ-",
    "outputId": "d123f981-d73f-470f-b5e9-8735819f894b"
   "outputs": [],
   "source": [
    "from rdflib import Graph\n",
    "# Source:\n",
    "# Pass in a URL containing microformats\n",
    "URL = \"\"\n",
    "# Initialize a graph\n",
    "g = Graph()\n",
    "# Parse in an RDF file graph dbpedia\n",
    "result = g.parse(location=URL)\n",
    "# Loop through first 10 triples in the graph\n",
    "for index, (sub, pred, obj) in enumerate(g):\n",
    "    print(sub, pred, obj)\n",
    "    if index == 10:\n",
    "        break"
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    "id": "hrQ2EuY5JAn1",
    "outputId": "eba60ebb-7ac5-4451-c16e-3f68e66af7f3"
   "outputs": [],
   "source": [
    "# Print the size of the Graph\n",
    "print(f'Graph has {len(g)} facts')"
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 323
    "id": "IAO1JllwJMqO",
    "outputId": "08f5e32d-d1a6-4a30-878a-ce7b768a8811"
   "outputs": [],
   "source": [
    "# Print out the entire Graph in the RDF Turtle format\n",
   "cell_type": "markdown",
   "metadata": {
    "id": "dzbynasAeMKA"
   "source": [
    "### Task 3.1 \n",
    "Compare the information about a band on to the Facebook Open Graph information about the same band from Facebook. What are the differences? Which format do you think supports better interoperability?"
   "cell_type": "markdown",
   "metadata": {
    "id": "Nocs4YDPeMKB"
   "source": [
    "### Task 3.2\n",
    "Explore the various microformats at and compare the output of the exercises with the output of Think about possible microformats you want to support in your final assignment and read up on how to parse them."
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "Hands-on_2_microformats.ipynb",
   "provenance": [],
   "toc_visible": true
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
 "nbformat": 4,
 "nbformat_minor": 1
View Code

In this session we are going to mine data in various microformats. You will see the differences in what each of the formats can contain and what purpose they serve. We will start by looking geographical data.


1. Python 3.8

2. Python packages: requests, BeautifulSoup4, HTMLParser, rdflib





<a href="">Web Design Blog</a>


<a href=”“ rel=”homepage”>Web Design Blog</a>






<div>Joe Doe</div>
<div>The Example Company</div>
<a href=""></a>


<div class="vcard">
<div class="fn">Joe Doe</div>
<div class="org">The Example Company</div>
<div class="tel">604-555-1234</div>
<a class="url" href=""></a>




1. 在爬取Web内容时,能够更为准确地识别内容块的语义;

2. 在对内容进行操作,包括提供访问、校对,还可以将其转化成其他的相关格式,提供给外部程序和Web服务使用。

Exercise 1

即使网页不使用微格式,也可以从HTML中提取出有趣的数据。我们可以使用BeautifulSoup这样的包从任何HTML网页中提取任意数据片段。下边的代码展示了我们如何在阿姆斯特丹维基百科页面信息框表中找到第一张图片的URL。Tip: the image url is in the "src" attribute of the "img" element of in the "table" element with class="infobox".

# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup

# This script requires you to add a url of a page with geotags to the commandline, e.g.
# python ''
URL = ''

req = requests.get(URL, headers={'User-Agent' : "Social Web Course Student"})
soup = BeautifulSoup(req.text)
# print(req.text)
image1 = soup.findAll('table', class_='infobox')[0].find('img')



posted @ 2022-11-07 03:57  我是球啊  阅读(32)  评论(0编辑  收藏  举报