Scrapy for Scraping

By pjain      Published July 27, 2020, 8:04 p.m. in blog AI-Analytics-Data   

==== Scrapy The Champipon Scraper

Scrapy Quik Ex

  • Setup pip install Scrapy

Ex1 off scraping

$ scrapy runspider myspider.py  # outputs json { title: ..}

# myspider.py
import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for title in response.css('.post-header>h2'):
            yield {'title': title.css('a ::text').get()}

        for next_page in response.css('div.prev-post > a'):
            yield response.follow(next_page, self.parse)

Ex2: Scrape quotes

$ scrapy runspider quotes_spider.py -o quotes.json

[{"author": "Jane Austen",
    "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"
}, ..]


# ~z/.../quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

Scrapy 102 - DB, Django integration, Throttling

Scraping integ with Django

https://blog.theodo.fr/2019/01/data-scraping-scrapy-django-integration/ https://medium.com/@ali_oguzhan/how-to-use-scrapy-with-django-application-c16fabd0e62e Threading scrapper for flights on Django/Python Creating Scrapy scrapers via the Django admin interface https://github.com/nKandel/django-dynamic-scraper

DB integration

Throttling

Scrapy 101

Compare High Level Scrapy full crawler to Low level BS4, Lxml

  • Beautifulsoup: Python package for parsing HTML and XML document
  • lxml:PythonicbindingfortheClibrarieslibxml2and libxslt

  • Scrapy: a Python framework for making web crawlers

"In other words, comparing BeautifulSoup (or lxml) to Scrapy is like comparing jinja2 to Django."- Source: Scrapy FAQ

r os t

  • OFF
  • https://github.com/scrapy/slybot

  • run hm page ex 0

  • read at a glance run http://docs.scrapy.org/en/latest/intro/overview.html
  • tutorial - http://docs.scrapy.org/en/latest/intro/tutorial.html#intro-tutorial

  • Tutes https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/ https://www.digitalocean.com/community/tutorials/how-to-crawl-a-web-page-with-scrapy-and-python-3

  • Crawl a blog url, and find all url from it, then save to mysql. https://github.com/clasense4/scrapy-blog-crawler

Develop your first web crawler in Python Scrapy – Python Pandemonium – Medium Python Web Scraping Tutorials – Real Python A REALLY simple, but powerful Python web crawler — palkeo 60 Innovative Website Crawlers for Content Monitoring Python Web Scraping Tutorial using BeautifulSoup Web Scraping with Python: A How-To Guide - Algorithmia How I automated my job search by building a web crawler from scratch SayOne | Blog Here’s how I applied coding to my job – Code Like A Girl Web Scraping Tutorial with Python: Tips and Tricks Anyone know of a good Python based web crawler that I could use? - Stack Overflow How to Crawl the Web Politely with Scrapy Python Web Crawler with Web Front End : Python (5) What's the best way to learn to make web crawlers using Python? - Quora Scrapy Tutorial Series: Web Scraping Using Python | MichaelYin Blog A Basic SEO for Django – Vinta Software Recursively Scraping Web Pages with Scrapy How to scrape a website using Python + Scrapy in 5 simple steps Web Scraping and Crawling Are Perfectly Legal, Right? Building A Concurrent Web Scraper With Python and Selenium | TestDriven.io Running scrapy spider programmatically django web crawler blog - Google Search Small Open-Source Django Projects to Get Started Preventing Web Scraping: Best Practices for Keeping Your Content Safe Django 1.7 + Scrapy

Scraping other OS github

clasense4/scrapy-blog-crawler: Crawl a blog url, and find all url from it, then save to mysql. ialuronico/CrawlingBlogs: Python Crawler for Blogs danielfrg/django_crawler: A django blog crawler NISH1001/medium-crawler: A crawler for scraping shit from medium blogs jy617lee/naver_blog_crawler: 검색기간, 검색어를 입력하여 검색되는 네이버 블로그 포스팅의 날짜, 제목, 본문을 저장하는 네이버 크롤러 younghz/blog-crawler: It is a blog crawler which is specified to CSDN. hack4code/BlogSpider: spiders crawl blogs (rss | atom | blog) saka947/blogCrawl enixdark/crawlBlog: simple crawl data from blog use scrapy MoonOnTheWay/BlogsData: Blogs data crawled from multiple blog sites zuzhi/blogbot: Crawl engineering blogs with Scrapy L1nf3ng/BlogSpider: a spider crawles for new blogs asharron/BlogCrawler: A web crawler that helps professors read their students blogs by aggregating it into one place. DevblogSearch/blogCrawling oxcow/BlogCrawler: Sina Blog Or Qzone Blog etc Blog Article Crawler LzzZzzB/URL-team-blog_spider: crawl article name and url superman-wrdh/crawl_blog: crawal blog https://66super.com netqyq/blog-crawler: crawl web sites use scrapy(demo) galacticsurfer/blog-crawler: Basic crawler written in python using scrapy framework isidsea/blogspot_crawler: Just simple Blogspot crawler hbprotoss/blog_crawler HemingwayLee/blogger-crawler TongWee/Blog_Crawler: Learning scrapy Elkiscoming/blog-crawler: Crawler to find blog networks frostyplanet/blogbus_crawler linking123/crawlBlogInfo2MySQL: crawl name, url, readers num, comments num of one blog. renruoxu/academic_crawler: Academic paper and blog crawler edeas123/blogcrawler: Microservice to create, schedule and run blog crawls junglestory/scrape_blog_crawler changyy/blogger-title-link-crawler junglestory/scrape_blog_crawler geekypunk/CommonCrawlURLIndex-Blogger: CommonCrawlURLIndex-Blogger lixi5338619/scrapy_boke: Use Scrapy frameworks to crawl blog content kyujin-cho/pyNaverBlogCrawler shunk031/research-blogging-crawler WandyLau/myscrapy: To crawl the blog of Aaron Swartz. KwinnerChen/blog_crawler_powered_by_scrapy vandana-iyer/website-crawler: A python crawler that crawls wordpress travel blogs shaoyikai/blogspider: You can crawling data from a blog web site by this spider engine easily Downloads

Scrapy Concepts

Cloud <-- Spiders Crawl the Web
             v
           Item Pipeline ---------> Feed exporter

Scrapy Engine = configured by "Project" in subdirs
    scrapy crawl prjnm


                    <--- Scheduler ---->
            Scrapy Engine Controller (req/response crawl Web Links)
Internet -<crawl/downloader> REQ--> [Data Responses] --> Spider

                    <--- Scheduler ---->
Spider Data digests [ITEMS .. ] ----> <-- ItemPipeline --->

AR

Scrapy is written in pure Python and depends on a few key Python packages (among others):

lxml, an efficient XML and HTML parser parsel, an HTML/XML data extraction library written on top of lxml, w3lib, a multi-purpose helper for dealing with URLs and web page encodings twisted, an asynchronous networking framework cryptography and pyOpenSSL, to deal with various network-level security needs

5a: Py Backend, Scraping, Parsing

  • Go to http://w1.weather.gov/xml/current_obs/ and get the weather report in xml, or go to http://www.nasa.gov/content/nasa-rss-feeds or else look for Billboard top 100 songs in xml or whatever, and download it, then play with parsing it into something interesting to you. Later, figure out how to retrieve it directly from the internet into your script. Come up with interesting ways of displaying the info or saving it in formats like html that you can view with a browser, etc etc...

  • WSGI web services model between stages 2 and 3, like Bottle or Gunicorn, or Google App Engine. E.g. refs: http://stackoverflow.com/questions/26362532/bottle-with-gunicorn and http://blog.yprez.com/running-a-bottle-app-with-gunicorn.html to show you how dead easy it makes web services. Bots are nice, but they're not as practical as server software for about the same quantity of potential problems you might want to solve.

A fast high-level screen scraping and web crawling framework for Python. https://github.com/scrapy/scrapy

Scrape, Rating, Agg Tech Top Feeds/Sources of info

Personalized Skins, CSS, Readability

Bensign/tweetmeme: A simpler, more delightful way of viewing Tweets on Techmeme


0 comments

There are no comments yet

Add new comment

Similar posts

BeautifulSoup for Scraping

BeautifulSoup for Scraping

Web Scraping in Python - Overview

Time series Analysis and Prediction