Sujet : Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...]
De : not (at) *nospam* telling.you.invalid (Computer Nerd Kev)
Groupes : comp.miscDate : 26. Jul 2024, 13:18:48
Autres entêtes
Organisation : Ausics - https://newsgroups.ausics.net
Message-ID : <66a39428@news.ausics.net>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14
User-Agent : tin/2.0.1-20111224 ("Achenvoir") (UNIX) (Linux/2.4.31 (i686))
D <
nospam@example.net> wrote:
Read only sounds very simple. I usually scrape in python with the requests
library and the beautiful soup library. A simple scraping loop could look
like this (modify per web board of course):
for page in range(100, 150):
html = requests.get("https://www.svt.se/text-tv/" + str(page))
soup = BeautifulSoup(html.text, 'html.parser')
div_bs4 = soup.find('div', {"class": "Content_screenreaderOnly__3Cnkp"})
try:
email_body += div_bs4.string + "\n"
except AttributeError:
None
So basically a range of pages, then loop over those pages,
You need to sync it to the messages in the forum index though,
otherwise when they get a spam flood of messages that the admin
deletes, or just jump the thread counter around for some other
reason, the scraper is stuck looking for the next 25 threads
after the last one it saw when it needs to jump forwards 150. I
guess you could interpret the deleted thread pages and crawl
through them, but then you need the crawler to remember the gap
that was left so it doesn't forget to check for new posts in the
threads before the spam flood.
So even if it's possible to iterate over threads that way on all
forum platforms (which I'm not sure about), I think it would be
more reliable in the long run to parse the index pages to determine
which threads to retrieve. Also less risk of getting blocked by web
servers for too many requests.
But thanks for the example. I'm not really sure whether a HTML
parser library would be helpful or just a pointless extra layer
of complexity. So far I've just used regular expressions for
scraping webpages. I was thinking along the lines of a template
system defining strings that indicate the start/end of fields (and
any key features in-between) ideally allowing new forum parsers to
be added without needing to touch the code. There must be things
like that around already...
Perhaps I'm determined to make it hard for myself, but if it broke
all the time and was complicated to fix, then that would be worse.
Anyhow now I've got onto thinking about that I've wasted all the
time I was actually going to spend finishing a PHP static site
generator to format data that I scraped off a website last week.
That seemed simple at first too...
-- __ __#_ < |\| |< _#