1
0
Fork 0
A tool for scraping files from imageboards’ threads.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 

3.3 KiB

Changelog

0.5.1 - 2021-05-04

Added

  • Message when a file cannot be retrieved.

Fixed

  • Removed excessive hash comparison when files has same name;
  • A string forgotten to set to be a f-string, so now it displays a reason of why thread wasn't found.

0.5.0 - 2021-05-03

Added

  • Now program makes use of skip_posts argument. Use CLI option -S <number> or --skip-posts <number> to set how much posts you want to skip.

Changed

  • Better, minified messages;
  • Fixed inheritance of Scraper's subclasses and its sane rewrite that led to future easy extension with way less repeating.
  • Added a general class TinyboardLikeParser that implements post parser for all imageboards based on it or the ones that have identical JSON API. From now on all such generalisation classes will end with *LikeParser;
  • Changed file_base_url for 8kun.top.

Removed

  • Support for Lolifox, since it's gone.

0.4.1 - 2020-12-08

Fixed

  • Now HTTPException from http.client and URLError from urllib.request are handled;
  • 2ch.hk's stickers handling.

0.4.0 - 2020-11-18

Added

  • For 2ch.hk check for if a file is a sticker was added;
  • Encoding for !op.txt file was explicitly set to utf-8;
  • Handling of connection errors was added so now program won't crash if file doesn't exist or not accessible for any other reason and if any damaged files was created then they will be removed;
  • Added 3 retries if file was damaged during downloading;
  • To a scraper was added matching of hashes of two files that happen to share same name and size, but hash reported by an imageboard is not the same as of a file. It results in excessive downloading and hash calculations. Hopefully, that only the case for 2ch.hk.

Changed

  • FileInfo class is now a frozen dataclass for memory efficiency.

Fixed

  • Found that arguments for match function that matches for image.ext pattern were mixed up in places all over the parsers;
  • Also for 2ch.hk checking for if sub and com was changed to subject and comment.

0.3.0 - 2020-09-09

Added

  • Parser for lolifox.cc.

Removed

  • BasicScraper. Not needed anymore, there is a faster threaded version.

Fixed

  • Now User-Agent is correctly applied everywhere.

0.2.2 - 2020-07-20

Added

  • Parser for 8kun.top.

Changed

  • The way of comparison if that site is supported to just looking for a substring.
  • Edited regex that checks if filename is just an "image.ext" so it only checks if after "image." only goes 1 to 4 characters.

Notes

  • Consider that issue with size on 2ch.hk. Usually it really tells the size in kB. The problem is that sometimes it just wrong.

0.2.1 - 2020-07-18

Changed

  • Now program tells you what thread doesn't exist or about to be scraped. That is useful in batch processing with scripts.

0.2.0 - 2020-07-18

Added

  • Threaded version of the scraper, so now it is fast as heck!

Fixed

  • Handled situation when OP's post has no comment and/or subject.

0.1.0 - 2020-07-08

Added

  • JSON parsers for 4chan.org, lainchan.org and 2ch.hk.
  • Basic straightforward scraper that downloads files one by one.

Issues

  • 2ch.hk: I can't figure out what exactly it tells as a size and hash of a file. Example: file may have a size of 127798 bytes (125K) but 2ch reports 150 and a hash reported doesn't equal to a computed one.