1
0
Fork 0

Compare commits

...

52 Commits

Author SHA1 Message Date
Alexander Andreev 43909c2b29
Changelog updated with 0.5.1 changes. 1 year ago
Alexander Andreev acbfaefa9c
Version changed to 0.5.1 in a Makefile. 1 year ago
Alexander Andreev 86ef44aa07
Version changed to 0.5.1. 1 year ago
Alexander Andreev 419fb2b673
Removed excessive comparison of hash. Added message when file cannot be retrieved. 1 year ago
Alexander Andreev 0287d3a132
Turned a string into f-string. 1 year ago
Alexander Andreev 245e33f40d
README updated. lolifox.cc removed. Option --skip-posts added. 1 year ago
Alexander Andreev e092c905b2
Makefile updated to version 0.5.0. 1 year ago
Alexander Andreev 90338073ed
Updated CHANGELOG with version 0.5.0. 1 year ago
Alexander Andreev cdcc184de8
Lolifox removed. Development Status classifier is changed to Alpha. Python 3.7 classifier left to represent oldest supported version. 1 year ago
Alexander Andreev b335891097
Copyright, date, and version are updated. 1 year ago
Alexander Andreev 1213cef776
Lolifox removed. Added skip_posts handling. 1 year ago
Alexander Andreev 78d4a62c17
IB parsers rewritten accordingly to fixed Parser class. 1 year ago
Alexander Andreev f3ef07af68
Rewrite of Parser class because it was fucked up. Now there's no problems with inheritance and its subclasses now more pleasant to write. ThreadNotFoundError now has a reason field. 1 year ago
Alexander Andreev 6373518dc3
Added order=True for FIleInfo to make sure that order of fields is preserved. 1 year ago
Alexander Andreev caf18a1bf0
Added option --skip-posts and messages are now takes just one line. 1 year ago
Alexander Andreev 751549f575
A new generalised class for all imageboards based on Tinyboard or having identical API. 1 year ago
Alexander Andreev 38b5740d73
Removing lolifox.cc parser because this board is dead. 1 year ago
Alexander Andreev 2f9d26427c
Now incrementing _files_downloaded happens when _progress_callback is set. And made super() with no args. 1 year ago
Alexander Andreev e7cf2e7c4b
Added a missing return True statement in _check_file 1 year ago
Alexander Andreev 4f6f56ae7b
Version in a Makefile is changed to 0.4.1. 1 year ago
Alexander Andreev 503eb9959b
Version updated to 0.4.1. 1 year ago
Alexander Andreev cb2e0d77f7
Changelog update for 0.4.1. 1 year ago
Alexander Andreev 93e442939a
Dvach's stickers handling. 1 year ago
Alexander Andreev 6022c9929a
Added HTTP and URL exceptions handling. 1 year ago
Alexander Andreev f79abcc310 In classifiers licence was fixed and added more topics related to a program. 2 years ago
Alexander Andreev 9cdb510325 A little fix for README. 2 years ago
Alexander Andreev 986fdbe7a7 Handling of no arguments passed. 2 years ago
Alexander Andreev 2e6352cb13 Updated changelog. 2 years ago
Alexander Andreev 7b2fcf0899 Improved error handling, retries for damaged files. 2 years ago
Alexander Andreev 21837c5335 Updated changelog. 2 years ago
Alexander Andreev b970973018 ConnectionResetError handling. 2 years ago
Alexander Andreev 6dab626084 Version is changed to 0.4.0. 2 years ago
Alexander Andreev 86b6278657 Updated changelog and readme. 2 years ago
Alexander Andreev 7754a90313 FileInfo is now a frozen dataclass for efficiency. 2 years ago
Alexander Andreev bb47b50c5f _is_file_ok now is _check_file and modified to be more efficient. Also added check for if files happened to share same name and size, but IB said wrong hash. 2 years ago
Alexander Andreev 8403fcf0f2 Now op file is explicitly in utf-8. 2 years ago
Alexander Andreev 647a787974 FIxed arguments for a match function. 2 years ago
Alexander Andreev 6a54b88498 sub and com ->subject and comment. Fixed arguments for match function. 2 years ago
Alexander Andreev 2043fc277f No right to fuck up! Shit... Forgot third part of a version. 2 years ago
Alexander Andreev a106d5b739 Added support for lolifox.cc. Fixed User-Agent usage, so it applied correctly everywhere now. 2 years ago
Alexander Andreev 7825b53121 Did a minor refactoring. Also combined two first lines that are printed for a thread into one. 2 years ago
Alexander Andreev b26152f3ca Moved User-Agent off to __init__ in its own variable. 2 years ago
Alexander Andreev 9ad9fcfd6f Added supported IBs to readme. 2 years ago
Alexander Andreev 2fcd4f0aa7 Updated usage, so I don't have to edit it every time I add a new IB. 2 years ago
Alexander Andreev bfaa9d2778 Reduced summary. Changed URL. Edited keywords to actual domains. 2 years ago
Alexander Andreev 371c6623e9 Updated changelog. 2 years ago
Alexander Andreev 520d88c76a Parser for 8kun.top added. And I changed compares in __init__. 2 years ago
Alexander Andreev 93d2904a4f Regex limited to up to 4 characters after first dot occured. 2 years ago
Alexander Andreev 6df9e573aa Updated version to 0.2.2 2 years ago
Alexander Andreev f21ff0aff5 Oh, fuck me. What a typo... xD 2 years ago
Alexander Andreev c0282f3934 Changelog updated. 2 years ago
Alexander Andreev 4db2e1dc75 A little change of output. 2 years ago
  1. 87
      CHANGELOG.md
  2. 2
      Makefile
  3. 31
      README.md
  4. 11
      scrapthechan/__init__.py
  5. 82
      scrapthechan/cli/scraper.py
  6. 32
      scrapthechan/fileinfo.py
  7. 89
      scrapthechan/parser.py
  8. 30
      scrapthechan/parsers/__init__.py
  9. 53
      scrapthechan/parsers/dvach.py
  10. 25
      scrapthechan/parsers/eightkun.py
  11. 46
      scrapthechan/parsers/fourchan.py
  12. 63
      scrapthechan/parsers/lainchan.py
  13. 51
      scrapthechan/parsers/tinyboardlike.py
  14. 204
      scrapthechan/scraper.py
  15. 15
      scrapthechan/scrapers/basicscraper.py
  16. 39
      scrapthechan/scrapers/threadedscraper.py
  17. 22
      setup.cfg

87
CHANGELOG.md

@ -1,5 +1,92 @@
# Changelog
## 0.5.1 - 2021-05-04
## Added
- Message when a file cannot be retrieved.
## Fixed
- Removed excessive hash comparison when files has same name;
- A string forgotten to set to be a f-string, so now it displays a reason of why
thread wasn't found.
## 0.5.0 - 2021-05-03
## Added
- Now program makes use of skip_posts argument. Use CLI option `-S <number>`
or `--skip-posts <number>` to set how much posts you want to skip.
## Changed
- Better, minified messages;
- Fixed inheritance of `Scraper`'s subclasses and its sane rewrite that led to
future easy extension with way less repeating.
- Added a general class `TinyboardLikeParser` that implements post parser for
all imageboards based on it or the ones that have identical JSON API. From now
on all such generalisation classes will end with `*LikeParser`;
- Changed `file_base_url` for 8kun.top.
## Removed
- Support for Lolifox, since it's gone.
## 0.4.1 - 2020-12-08
## Fixed
- Now HTTPException from http.client and URLError from urllib.request
are handled;
- 2ch.hk's stickers handling.
## 0.4.0 - 2020-11-18
### Added
- For 2ch.hk check for if a file is a sticker was added;
- Encoding for `!op.txt` file was explicitly set to `utf-8`;
- Handling of connection errors was added so now program won't crash if file
doesn't exist or not accessible for any other reason and if any damaged files
was created then they will be removed;
- Added 3 retries if file was damaged during downloading;
- To a scraper was added matching of hashes of two files that happen to share
same name and size, but hash reported by an imageboard is not the same as of
a file. It results in excessive downloading and hash calculations. Hopefully,
that only the case for 2ch.hk.
### Changed
- FileInfo class is now a frozen dataclass for memory efficiency.
### Fixed
- Found that arguments for match function that matches for `image.ext` pattern
were mixed up in places all over the parsers;
- Also for 2ch.hk checking for if `sub` and `com` was changed to `subject` and
`comment`.
## 0.3.0 - 2020-09-09
### Added
- Parser for lolifox.cc.
### Removed
- BasicScraper. Not needed anymore, there is a faster threaded version.
### Fixed
- Now User-Agent is correctly applied everywhere.
## 0.2.2 - 2020-07-20
### Added
- Parser for 8kun.top.
### Changed
- The way of comparison if that site is supported to just looking for a
substring.
- Edited regex that checks if filename is just an "image.ext" so it only checks
if after "image." only goes 1 to 4 characters.
### Notes
- Consider that issue with size on 2ch.hk. Usually it really tells the size in
kB. The problem is that sometimes it just wrong.
## 0.2.1 - 2020-07-18
### Changed
- Now program tells you what thread doesn't exist or about to be scraped. That
is useful in batch processing with scripts.
## 0.2.0 - 2020-07-18
### Added
- Threaded version of the scraper, so now it is fast as heck!

2
Makefile

@ -1,7 +1,7 @@
build: scrapthechan README.md setup.cfg
python setup.py sdist bdist_wheel
install:
python -m pip install --upgrade dist/scrapthechan-0.2.0-py3-none-any.whl --user
python -m pip install --upgrade dist/scrapthechan-0.5.1-py3-none-any.whl --user
uninstall:
# We change directory so pip uninstall will run, it'll fail otherwise.
@cd ~/

31
README.md

@ -1,8 +1,8 @@
This is a tool for scraping files from imageboards' threads.
It extracts the files from a JSON version of a thread. And then downloads 'em
in a specified output directory or if it isn't specified then creates following
directory hierarchy in a working directory:
It extracts the files from a JSON representation of a thread. And then downloads
'em in a specified output directory or if it isn't specified then creates
following directory hierarchy in a working directory:
<imageboard name>
|-<board name>
@ -24,9 +24,24 @@ separately. E.g. `4chan b 1100500`.
`-o`, `--output-dir` -- output directory where all files will be dumped to.
`--no-op` -- by default OP's post will be saved in a `!op.txt` file. This flag
disables this behaviour. I desided to put an `!` in a name so this file will be
on the top in a directory listing.
`-N`, `--no-op` -- by default OP's post will be saved in a `!op.txt` file. This
flag disables this behaviour. An exclamation mark `!` in a name is for so this
file will be on the top of a directory listing.
`-v`, `--version` prints the version of the program, and `-h`, `--help` prints
help for a program.
`-S <num>`, `--skip-posts <num>` -- skip given number of posts.
`-v`, `--version` prints the version of the program.
`-h`, `--help` prints help for a program.
# Supported imageboards
- [4chan.org](https://4chan.org) since 0.1.0
- [lainchan.org](https://lainchan.org) since 0.1.0
- [2ch.hk](https://2ch.hk) since 0.1.0
- [8kun.top](https://8kun.top) since 0.2.2
# TODO
- Sane rewrite of a program;
- Thread watcher.

11
scrapthechan/__init__.py

@ -1,13 +1,16 @@
__date__ = "18 Jule 2020"
__version__ = "0.2.0"
__date__ = "4 May 2021"
__version__ = "0.5.1"
__author__ = "Alexander \"Arav\" Andreev"
__email__ = "me@arav.top"
__copyright__ = f"Copyright (c) 2020 {__author__} <{__email__}>"
__copyright__ = f"Copyright (c) 2020,2021 {__author__} <{__email__}>"
__license__ = \
"""This program is licensed under the terms of the MIT license.
For a copy see COPYING file in a directory of the program, or
see <https://opensource.org/licenses/MIT>"""
USER_AGENT = f"ScrapTheChan/{__version__}"
VERSION = \
f"ScrapTheChan ver. {__version__} ({__date__})\n\n{__copyright__}\n"\
f"ScrapTheChan ver. {__version__} ({__date__})\n{__copyright__}\n"\
f"\n{__license__}"

82
scrapthechan/cli/scraper.py

@ -3,30 +3,30 @@ from os import makedirs
from os.path import join, exists
from re import search
from sys import argv
from typing import List
from typing import List, Optional
from scrapthechan import VERSION
from scrapthechan.parser import Parser, ParserThreadNotFoundError
from scrapthechan.parser import Parser, ThreadNotFoundError
from scrapthechan.parsers import get_parser_by_url, get_parser_by_site, \
SUPPORTED_IMAGEBOARDS
#from scrapthechan.scrapers.basicscraper import BasicScraper
from scrapthechan.scrapers.threadedscraper import ThreadedScraper
__all__ = ["main"]
USAGE = \
"""Usage: scrapthechan [OPTIONS] (URL | IMAGEBOARD BOARD THREAD)
USAGE: str = \
f"""Usage: scrapthechan [OPTIONS] (URL | IMAGEBOARD BOARD THREAD)
Options:
\t-h,--help -- print this help and exit;
\t-v,--version -- print program's version and exit;
\t-o,--output-dir -- directory where to place scraped files. By default
\t following structure will be created in current directory:
\t <imageboard>/<board>/<thread>;
\t-N,--no-op -- by default OP's post will be written in !op.txt file. This
\t option disables this behaviour;
\t-h,--help -- print this help and exit;
\t-v,--version -- print program's version and exit;
\t-o,--output-dir -- directory where to place scraped files. By default
\t following structure will be created in current directory:
\t <imageboard>/<board>/<thread>;
\t-N,--no-op -- by default OP's post will be written in !op.txt file. This
\t option disables this behaviour;
\t-S,--skip-posts <num> -- skip given number of posts.
Arguments:
\tURL -- URL of a thread;
@ -34,19 +34,19 @@ Arguments:
\tBOARD -- short name of a board. E.g. b;
\tTHREAD -- ID of a thread. E.g. 100500.
Supported imageboards: 4chan.org, 2ch.hk, lainchan.org.
Supported imageboards: {', '.join(SUPPORTED_IMAGEBOARDS)}.
"""
def parse_common_arguments(args: str) -> dict:
r = r"(?P<help>-h|--help)|(?P<version>-v|--version)"
argd = search(r, args)
if not argd is None:
argd = argd.groupdict()
return {
"help": not argd["help"] is None,
"version": not argd["version"] is None }
return None
def parse_common_arguments(args: str) -> Optional[dict]:
r = r"(?P<help>-h|--help)|(?P<version>-v|--version)"
args = search(r, args)
if not args is None:
args = args.groupdict()
return {
"help": not args["help"] is None,
"version": not args["version"] is None }
return None
def parse_arguments(args: str) -> dict:
rlink = r"^(https?:\/\/)?(?P<site>[\w.-]+)[ \/](?P<board>\w+)(\S+)?[ \/](?P<thread>\w+)"
@ -54,15 +54,21 @@ def parse_arguments(args: str) -> dict:
if not link is None:
link = link.groupdict()
out_dir = search(r"(?=(-o|--output-dir) (?P<outdir>\S+))", args)
skip_posts = search(r"(?=(-S|--skip-posts) (?P<skip>\d+))", args)
return {
"site": None if link is None else link["site"],
"board": None if link is None else link["board"],
"thread": None if link is None else link["thread"],
"skip-posts": None if skip_posts is None else int(skip_posts.group('skip')),
"no-op": not search(r"-N|--no-op", args) is None,
"output-dir": None if out_dir is None \
else out_dir.groupdict()["outdir"] }
def main() -> None:
if len(argv) == 1:
print(USAGE)
exit()
cargs = parse_common_arguments(' '.join(argv[1:]))
if not cargs is None:
if cargs["help"]:
@ -79,19 +85,22 @@ def main() -> None:
exit()
try:
parser = get_parser_by_site(args["site"], args["board"], args["thread"])
if not args["skip-posts"] is None:
parser = get_parser_by_site(args["site"], args["board"],
args["thread"], args["skip-posts"])
else:
parser = get_parser_by_site(args["site"], args["board"],
args["thread"])
except NotImplementedError as ex:
print(f"{str(ex)}.")
print(f"Supported image boards are {', '.join(SUPPORTED_IMAGEBOARDS)}")
exit()
except ParserThreadNotFoundError:
print(f"Thread is no longer exist.")
except ThreadNotFoundError as e:
print(f"Thread {args['site']}/{args['board']}/{args['thread']} " \
f"not found. Reason: {e.reason}")
exit()
flen = len(parser.files)
print(f"There are {flen} files in this thread.")
files_count = len(parser.files)
if not args["output-dir"] is None:
save_dir = args["output-dir"]
@ -99,25 +108,26 @@ def main() -> None:
save_dir = join(parser.imageboard, parser.board,
parser.thread)
print(f"They will be saved in {save_dir}.")
print(f"{files_count} files in " \
f"{args['site']}/{args['board']}/{args['thread']}. " \
f"They're going to {save_dir}. ", end="")
makedirs(save_dir, exist_ok=True)
if not args["no-op"]:
print("Writing OP... ", end='')
if parser.op is None:
print("No text's there.")
print("OP's empty.")
elif not exists(join(save_dir, "!op.txt")):
with open(join(save_dir, "!op.txt"), 'w') as opf:
with open(join(save_dir, "!op.txt"), 'w', encoding='utf-8') as opf:
opf.write(f"{parser.op}\n")
print("Done.")
print("OP's written.")
else:
print("Exists.")
print("OP exists.")
scraper = ThreadedScraper(save_dir, parser.files, \
lambda i: print(f"{i}/{flen}", end="\r"))
lambda i: print(f"{i}/{files_count}", end="\r"))
scraper.run()

32
scrapthechan/fileinfo.py

@ -1,23 +1,23 @@
"""FileInfo object stores all needed information about a file."""
"""FileInfo object stores information about a file."""
from dataclasses import dataclass
__all__ = ["FileInfo"]
@dataclass(frozen=True, order=True)
class FileInfo:
"""Stores all needed information about a file.
"""Stores information about a file.
Arguments:
- `name` -- name of a file;
- `size` -- size of a file;
- `dlurl` -- full download URL for a file;
- `hash_value` -- hash sum of a file;
- `hash_algo` -- hash algorithm used (e.g. md5).
"""
def __init__(self, name: str, size: int, dlurl: str,
hash_value: str, hash_algo: str) -> None:
self.name = name
self.size = size
self.dlurl = dlurl
self.hash_value = hash_value
self.hash_algo = hash_algo
Fields:
- `name` -- name of a file;
- `size` -- size of a file;
- `download_url` -- full download URL for a file;
- `hash_value` -- hash sum of a file;
- `hash_algorithm` -- hash algorithm used (e.g. md5).
"""
name: str
size: int
download_url: str
hash_value: str
hash_algorithm: str

89
scrapthechan/parser.py

@ -4,16 +4,22 @@ from itertools import chain
from json import loads
from re import findall, match
from typing import List, Optional
from urllib.request import urlopen, urlretrieve
from urllib.request import urlopen, Request, HTTPError
from scrapthechan import USER_AGENT
from scrapthechan.fileinfo import FileInfo
__all__ = ["Parser", "ParserThreadNotFoundError"]
__all__ = ["Parser", "ThreadNotFoundError"]
class ParserThreadNotFoundError(Exception):
pass
class ThreadNotFoundError(Exception):
def __init__(self, reason: str = ""):
self._reason = reason
@property
def reason(self) -> str:
return self._reason
class Parser:
@ -24,28 +30,42 @@ class Parser:
Arguments:
board -- is a name of a board on an image board;
thread -- is a name of a thread inside a board;
posts -- is a list of posts in form of dictionaries exported from a JSON;
thread -- is an id of a thread inside a board;
skip_posts -- number of posts to skip.
All the extracted files will be stored as the `FileInfo` objects."""
__url_thread_json: str = "https://example.org/{board}/{thread}.json"
__url_file_link: str = None
def __init__(self, board: str, thread: str, posts: List[dict],
def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None:
self._board = board
self._thread = thread
self._op_post = posts[0]
if not skip_posts is None:
posts = posts[skip_posts:]
self._board: str = board
self._thread: str = thread
self._posts = self._extract_posts_list(self._get_json())
self._op_post: dict = self._posts[0]
self._posts = self._posts[skip_posts:] if not skip_posts is None else self._posts
self._files = list(chain.from_iterable(filter(None, \
map(self._parse_post, posts))))
map(self._parse_post, self._posts))))
@property
def json_thread_url(self) -> str:
raise NotImplementedError
@property
def file_base_url(self) -> str:
raise NotImplementedError
@property
def subject_field(self) -> str:
return "sub"
@property
def comment_field(self) -> str:
return "com"
@property
def imageboard(self) -> str:
"""Returns image board's name."""
return NotImplementedError
raise NotImplementedError
@property
def board(self) -> str:
@ -61,21 +81,40 @@ class Parser:
def op(self) -> str:
"""Returns OP's post as combination of subject and comment separated
by a new line."""
raise NotImplementedError
op = ""
if self.subject_field in self._op_post:
op = f"{self._op_post[self.subject_field]}\n"
if self.comment_field in self._op_post:
op += self._op_post[self.comment_field]
return op if not op == "" else None
@property
def files(self) -> List[FileInfo]:
"""Returns a list of retrieved files as `FileInfo` objects."""
return self._files
def _get_json(self, thread_url: str) -> dict:
"""Gets JSON version of a thread and converts it in a dictionary."""
def _extract_posts_list(self, lst: List) -> List[dict]:
"""This method must be overridden in child classes where you specify
a path in a JSON document where posts are stored. E.g., on 4chan this is
['posts'], and on 2ch.hk it's ['threads'][0]['posts']."""
return lst
def _get_json(self) -> dict:
"""Retrieves a JSON representation of a thread and converts it in
a dictionary."""
try:
with urlopen(thread_url) as url:
thread_url = self.json_thread_url.format(board=self._board, \
thread=self._thread)
req = Request(thread_url, headers={'User-Agent': USER_AGENT})
with urlopen(req) as url:
return loads(url.read().decode('utf-8'))
except:
raise ParserThreadNotFoundError
def _parse_post(self, post: dict) -> List[FileInfo]:
"""Parses a single post and extracts files into `FileInfo` object."""
except HTTPError as e:
raise ThreadNotFoundError(str(e))
except Exception as e:
raise e
def _parse_post(self, post: dict) -> Optional[List[FileInfo]]:
"""Parses a single post and extracts files into `FileInfo` object.
Single object is wrapped in a list for convenient insertion into
a list."""
raise NotImplementedError

30
scrapthechan/parsers/__init__.py

@ -1,6 +1,6 @@
"""Here are defined the JSON parsers for imageboards."""
from re import search
from typing import List
from typing import List, Optional
from scrapthechan.parser import Parser
@ -8,27 +8,31 @@ from scrapthechan.parser import Parser
__all__ = ["SUPPORTED_IMAGEBOARDS", "get_parser_by_url", "get_parser_by_site"]
SUPPORTED_IMAGEBOARDS: List[str] = ["4chan.org", "lainchan.org", "2ch.hk"]
URLRX = r"https?:\/\/(?P<s>[\w\.]+)\/(?P<b>\w+)\/(?:\w+)?\/(?P<t>\w+)"
SUPPORTED_IMAGEBOARDS: List[str] = ["4chan.org", "lainchan.org", "2ch.hk", \
"8kun.top"]
def get_parser_by_url(url: str) -> Parser:
def get_parser_by_url(url: str, skip_posts: Optional[int] = None) -> Parser:
"""Parses URL and extracts from it site name, board and thread.
And then returns initialised Parser object for detected imageboard."""
URLRX = r"https?:\/\/(?P<s>[\w\.]+)\/(?P<b>\w+)\/(?:\w+)?\/(?P<t>\w+)"
site, board, thread = search(URLRX, url).groups()
return get_parser_by_site(site, board, thread)
return get_parser_by_site(site, board, thread, skip_posts)
def get_parser_by_site(site: str, board: str, thread: str) -> Parser:
def get_parser_by_site(site: str, board: str, thread: str,
skip_posts: Optional[int] = None) -> Parser:
"""Returns an initialised parser for `site` with `board` and `thread`."""
if site in ['boards.4chan.org', 'boards.4channel.org',
'4chan', '4chan.org']:
if '4chan' in site:
from .fourchan import FourChanParser
return FourChanParser(board, thread)
elif site in ['lainchan.org', 'lainchan']:
return FourChanParser(board, thread, skip_posts)
elif 'lainchan' in site:
from .lainchan import LainchanParser
return LainchanParser(board, thread)
elif site in ['2ch.hk', '2ch']:
return LainchanParser(board, thread, skip_posts)
elif '2ch' in site:
from .dvach import DvachParser
return DvachParser(board, thread)
return DvachParser(board, thread, skip_posts)
elif '8kun' in site:
from .eightkun import EightKunParser
return EightKunParser(board, thread, skip_posts)
else:
raise NotImplementedError(f"Parser for {site} is not implemented")

53
scrapthechan/parsers/dvach.py

@ -10,39 +10,54 @@ __all__ = ["DvachParser"]
class DvachParser(Parser):
"""JSON parser for 2ch.hk image board."""
__url_thread_json = "https://2ch.hk/{board}/res/{thread}.json"
__url_file_link = "https://2ch.hk"
def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None:
posts = self._get_json(self.__url_thread_json.format(board=board, \
thread=thread))['threads'][0]['posts']
super(DvachParser, self).__init__(board, thread, posts, skip_posts)
super().__init__(board, thread, skip_posts)
@property
def json_thread_url(self) -> str:
return "https://2ch.hk/{board}/res/{thread}.json"
@property
def file_base_url(self) -> str:
return "https://2ch.hk"
@property
def subject_field(self) -> str:
return "subject"
@property
def comment_field(self) -> str:
return "comment"
@property
def imageboard(self) -> str:
return "2ch.hk"
@property
def op(self) -> Optional[str]:
op = ""
if 'sub' in self._op_post:
op = f"{self._op_post['subject']}\n"
if 'com' in self._op_post:
op += self._op_post['comment']
return op if not op == "" else None
def _extract_posts_list(self, lst: List) -> List[dict]:
return lst['threads'][0]['posts']
def _parse_post(self, post) -> Optional[List[FileInfo]]:
if not 'files' in post: return None
files = []
for f in post['files']:
if match(f['fullname'], r"^image\.\w+$") is None:
fullname = f['fullname']
if not 'sticker' in f:
if match(r"^image\.\w+$", f['fullname']) is None:
fullname = f['fullname']
else:
fullname = f['name']
else:
fullname = f['name']
# Here's same thing as 4chan. 2ch.hk also has md5 field, so it is
# completely fine to hardcode `hash_algo`.
files.append(FileInfo(fullname, f['size'],
f"{self.__url_file_link}{f['path']}",
f['md5'], 'md5'))
if 'md5' in f:
files.append(FileInfo(fullname, f['size'],
f"{self.file_base_url}{f['path']}",
f['md5'], 'md5'))
else:
files.append(FileInfo(fullname, f['size'],
f"{self.file_base_url}{f['path']}",
None, None))
return files

25
scrapthechan/parsers/eightkun.py

@ -0,0 +1,25 @@
from typing import Optional
from scrapthechan.parsers.tinyboardlike import TinyboardLikeParser
__all__ = ["EightKunParser"]
class EightKunParser(TinyboardLikeParser):
"""JSON parser for 8kun.top image board."""
def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None:
super().__init__(board, thread, skip_posts)
@property
def imageboard(self) -> str:
return "8kun.top"
@property
def json_thread_url(self) -> str:
return "https://8kun.top/{board}/res/{thread}.json"
@property
def file_base_url(self) -> str:
return "https://media.8kun.top/file_dl/{filename}"

46
scrapthechan/parsers/fourchan.py

@ -1,51 +1,25 @@
from re import match
from typing import List, Optional
from typing import Optional
from scrapthechan.fileinfo import FileInfo
from scrapthechan.parser import Parser
from scrapthechan.parsers.tinyboardlike import TinyboardLikeParser
__all__ = ["FourChanParser"]
class FourChanParser(Parser):
class FourChanParser(TinyboardLikeParser):
"""JSON parser for 4chan.org image board."""
__url_thread_json = "https://a.4cdn.org/{board}/thread/{thread}.json"
__url_file_link = "https://i.4cdn.org/{board}/{filename}"
def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None:
posts = self._get_json(self.__url_thread_json.format(board=board, \
thread=thread))['posts']
super(FourChanParser, self).__init__(board, thread, posts, skip_posts)
super().__init__(board, thread, skip_posts)
@property
def imageboard(self) -> str:
return "4chan.org"
@property
def op(self) -> Optional[str]:
op = ""
if 'sub' in self._op_post:
op = f"{self._op_post['sub']}\n"
if 'com' in self._op_post:
op += self._op_post['com']
return op if not op == "" else None
def _parse_post(self, post: dict) -> List[FileInfo]:
if not 'tim' in post: return None
dlfname = f"{post['tim']}{post['ext']}"
if "filename" in post:
if match(post['filename'], r"^image\.\w+$") is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
# Hash algorithm is hardcoded since it is highly unlikely that it will
# be changed in foreseeable future. And if it'll change then this line
# will be necessarily updated anyway.
return [FileInfo(filename, post['fsize'],
self.__url_file_link.format(board=self.board, filename=dlfname),
post['md5'], 'md5')]
def json_thread_url(self) -> str:
return "https://a.4cdn.org/{board}/thread/{thread}.json"
@property
def file_base_url(self) -> str:
return "https://i.4cdn.org/{board}/{filename}"

63
scrapthechan/parsers/lainchan.py

@ -1,66 +1,25 @@
from re import match
from typing import List, Optional
from typing import Optional
from scrapthechan.parser import Parser
from scrapthechan.fileinfo import FileInfo
from scrapthechan.parsers.tinyboardlike import TinyboardLikeParser
__all__ = ["LainchanParser"]
class LainchanParser(Parser):
"""JSON parser for lainchan.org image board.
JSON structure is identical to 4chan.org's, so this parser is just inherited
from 4chan.org's parser and only needed things are redefined.
"""
__url_thread_json = "https://lainchan.org/{board}/res/{thread}.json"
__url_file_link = "https://lainchan.org/{board}/src/{filename}"
class LainchanParser(TinyboardLikeParser):
"""JSON parser for lainchan.org image board."""
def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None:
posts = self._get_json(self.__url_thread_json.format(board=board, \
thread=thread))['posts']
super(LainchanParser, self).__init__(board, thread, posts, skip_posts)
super().__init__(board, thread, skip_posts)
@property
def imageboard(self) -> str:
return "lainchan.org"
@property
def op(self) -> Optional[str]:
op = ""
if 'sub' in self._op_post:
op = f"{self._op_post['sub']}\n"
if 'com' in self._op_post:
op += self._op_post['com']
return op if not op == "" else None
def _parse_post(self, post) -> List[FileInfo]:
if not 'tim' in post: return None
dlfname = f"{post['tim']}{post['ext']}"
if "filename" in post:
if match(post['filename'], r"^image\.\w+$") is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
files = []
files.append(FileInfo(filename, post['fsize'],
self.__url_file_link.format(board=self.board, filename=dlfname),
post['md5'], 'md5'))
@property
def json_thread_url(self) -> str:
return "https://lainchan.org/{board}/res/{thread}.json"
if "extra_files" in post:
for f in post["extra_files"]:
dlfname = f"{f['tim']}{f['ext']}"
if "filename" in post:
if match(post['filename'], r"^image\.\w+$") is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
dlurl = self.__url_file_link.format(board=self.board, \
filename=dlfname)
files.append(FileInfo(filename, f['fsize'], \
dlurl, f['md5'], 'md5'))
return files
@property
def file_base_url(self) -> str:
return "https://lainchan.org/{board}/src/{filename}"

51
scrapthechan/parsers/tinyboardlike.py

@ -0,0 +1,51 @@
from re import match
from typing import List, Optional
from scrapthechan.parser import Parser
from scrapthechan.fileinfo import FileInfo
__all__ = ["TinyboardLikeParser"]
class TinyboardLikeParser(Parser):
"""Base parser for imageboards that are based on Tinyboard, or have similar
JSON API."""
def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None:
super().__init__(board, thread, skip_posts)
def _extract_posts_list(self, lst: List) -> List[dict]:
return lst['posts']
def _parse_post(self, post: dict) -> Optional[List[FileInfo]]:
if not 'tim' in post: return None
dlfname = f"{post['tim']}{post['ext']}"
if "filename" in post:
if match(r"^image\.\w+$", post['filename']) is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
files = []
files.append(FileInfo(filename, post['fsize'],
self.file_base_url.format(board=self.board, filename=dlfname),
post['md5'], 'md5'))
if "extra_files" in post:
for f in post["extra_files"]:
dlfname = f"{f['tim']}{f['ext']}"
if "filename" in post:
if match(r"^image\.\w+$", post['filename']) is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
dlurl = self.file_base_url.format(board=self.board, \
filename=dlfname)
files.append(FileInfo(filename, f['fsize'], \
dlurl, f['md5'], 'md5'))
return files

204
scrapthechan/scraper.py

@ -1,96 +1,146 @@
"""Base Scraper implementation."""
"""Base class for all scrapers that will actually do the job."""
from base64 import b64encode
from os import remove, stat
from os.path import exists, join, getsize
import re
from typing import List, Callable
from urllib.request import urlretrieve, URLopener
from urllib.request import urlretrieve, URLopener, HTTPError, URLError
import hashlib
from http.client import HTTPException
from scrapthechan import __version__
from scrapthechan import USER_AGENT
from scrapthechan.fileinfo import FileInfo
__all__ = ["Scraper"]
class Scraper:
"""Base scraper implementation.
Arguments:
save_directory -- a path to a directory where file will be
saved;
files -- a list of FileInfo objects;
download_progress_callback -- a callback function that will be called
for each file started downloading.
"""
def __init__(self, save_directory: str, files: List[FileInfo],
download_progress_callback: Callable[[int], None] = None) -> None:
self._save_directory = save_directory
self._files = files
self._url_opener = URLopener()
self._url_opener.version = f"ScrapTheChan/{__version__}"
self._progress_callback = download_progress_callback
"""Base class for all scrapers that will actually do the job.
Arguments:
save_directory -- a path to a directory where file will be
saved;
files -- a list of FileInfo objects;
download_progress_callback -- a callback function that will be called
for each file started downloading.
"""
def __init__(self, save_directory: str, files: List[FileInfo],
download_progress_callback: Callable[[int], None] = None) -> None:
self._save_directory = save_directory
self._files = files
self._url_opener = URLopener()
self._url_opener.addheaders = [('User-Agent', USER_AGENT)]
self._url_opener.version = USER_AGENT
self._progress_callback = download_progress_callback
def run(self):
raise NotImplementedError
def run(self):
raise NotImplementedError
def _same_filename(self, filename: str, path: str) -> str:
"""Check if there is a file with same name. If so then add incremental
number enclosed in brackets to a name of a new one."""
newname = filename
while exists(join(path, newname)):
has_extension = newname.rfind(".") != -1
if has_extension:
l, r = newname.rsplit(".", 1)
lbracket = l.rfind("(")
if lbracket == -1:
newname = f"{l}(1).{r}"
else:
num = l[lbracket+1:-1]
if num.isnumeric():
newname = f"{l[:lbracket]}({int(num)+1}).{r}"
else:
newname = f"{l}(1).{r}"
else:
lbracket = l.rfind("(")
if lbracket == -1:
newname = f"{newname}(1)"
else:
num = newname[lbracket+1:-1]
if num.isnumeric():
newname = f"{newname[:lbracket]}({int(num)+1})"
return newname
def _same_filename(self, filename: str, path: str) -> str:
"""Check if there is a file with same name. If so then add incremental
number enclosed in brackets to a name of a new one."""
newname = filename
while exists(join(path, newname)):
has_extension = newname.rfind(".") != -1
if has_extension:
l, r = newname.rsplit(".", 1)
lbracket = l.rfind("(")
if lbracket == -1:
newname = f"{l}(1).{r}"
else:
num = l[lbracket+1:-1]
if num.isnumeric():
newname = f"{l[:lbracket]}({int(num)+1}).{r}"
else:
newname = f"{l}(1).{r}"
else:
lbracket = l.rfind("(")
if lbracket == -1:
newname = f"{newname}(1)"
else:
num = newname[lbracket+1:-1]
if num.isnumeric():
newname = f"{newname[:lbracket]}({int(num)+1})"
return newname
def _hash_file(self, filename: str, hash_algo: str = "md5",
blocksize: int = 1048576) -> (str, str):
"""Compute hash of a file."""
hash_func = hashlib.new(hash_algo)
with open(filename, 'rb') as f:
buf = f.read(blocksize)
while len(buf) > 0:
hash_func.update(buf)
buf = f.read(blocksize)
return hash_func.hexdigest(), hash_func.digest()
def _hash_file(self, filepath: str, hash_algorithm: str = "md5",
blocksize: int = 1048576) -> (str, str):
"""Compute hash of a file."""
if hash_algorithm is None:
return None
hash_func = hashlib.new(hash_algorithm)
with open(filepath, 'rb') as f:
buf = f.read(blocksize)
while len(buf) > 0:
hash_func.update(buf)
buf = f.read(blocksize)
return hash_func.hexdigest(), b64encode(hash_func.digest()).decode()
def _is_file_ok(self, f: FileInfo, filepath: str) -> bool:
"""Check if a file exist and isn't broken."""
if not exists(filepath):
return False
computed_size = getsize(filepath)
is_size_match = f.size == computed_size \
or f.size == round(computed_size / 1024)
hexdig, dig = self._hash_file(filepath, f.hash_algo)
is_hash_match = f.hash_value == hexdig \
or f.hash_value == b64encode(dig).decode()
return is_size_match and is_hash_match
def _check_file(self, f: FileInfo, filepath: str) -> bool:
"""Check if a file exist and isn't broken."""
if not exists(filepath):
return False
computed_size = getsize(filepath)
if not (f.size == computed_size \
or f.size == round(computed_size / 1024)):
return False
if not f.hash_algorithm is None:
hexdig, dig = self._hash_file(filepath, f.hash_algorithm)
return f.hash_value == hexdig or f.hash_value == dig
return True
def _download_file(self, f: FileInfo):
"""Download a single file."""
filepath = join(self._save_directory, f.name)
if self._is_file_ok(f, filepath):
return True
elif exists(filepath):
filepath = join(self._save_directory, \
self._same_filename(f.name, self._save_directory))
self._url_opener.retrieve(f.dlurl, filepath)
def _download_file(self, f: FileInfo):
"""Download a single file."""
is_same_filename = False
filepath = join(self._save_directory, f.name)
orig_filepath = filepath
if self._check_file(f, filepath):
return
elif exists(filepath):
is_same_filename = True
filepath = join(self._save_directory, \
self._same_filename(f.name, self._save_directory))
try:
retries = 3
while retries > 0:
self._url_opener.retrieve(f.download_url, filepath)
if not self._check_file(f, filepath):
remove(filepath)
retries -= 1
else:
break
if retries == 0:
print(f"Cannot retrieve {f.download_url}, {filepath}.")
return
if is_same_filename:
_, f1_dig = self._hash_file(orig_filepath, f.hash_algorithm)
_, f2_dig = self._hash_file(filepath, f.hash_algorithm)
if f1_dig == f2_dig:
remove(filepath)
except FileNotFoundError as e:
print("File Not Found", filepath)
except HTTPError as e:
print("HTTP Error", e.code, e.reason, f.download_url)
if exists(filepath):
remove(filepath)
except HTTPException:
print("HTTP Exception for", f.download_url)
if exists(filepath):
remove(filepath)
except URLError as e:
print("URL Error for", f.download_url)
if exists(filepath):
remove(filepath)
except ConnectionResetError:
print("Connection reset for", f.download_url)
if exists(filepath):
remove(filepath)
except ConnectionRefusedError:
print("Connection refused for", f.download_url)
if exists(filepath):
remove(filepath)
except ConnectionAbortedError:
print("Connection aborted for", f.download_url)
if exists(filepath):
remove(filepath)

15
scrapthechan/scrapers/basicscraper.py

@ -1,15 +0,0 @@
"""Implementation of basic sequential one-threaded scraper that downloads
files one by one."""
from scrapthechan.scraper import Scraper
__all__ = ["BasicScraper"]
class BasicScraper(Scraper):
def run(self):
"""Download files one by one."""
for i, f in enumerate(self._files, start=1):
if not self._progress_callback is None:
self._progress_callback(i)
self._download_file(f)

39
scrapthechan/scrapers/threadedscraper.py

@ -7,25 +7,26 @@ from multiprocessing.pool import ThreadPool
from scrapthechan.scraper import Scraper
from scrapthechan.fileinfo import FileInfo
__all__ = ["ThreadedScraper"]
class ThreadedScraper(Scraper):
def __init__(self, save_directory: str, files: List[FileInfo],
download_progress_callback: Callable[[int], None] = None) -> None:
super(ThreadedScraper, self).__init__(save_directory, files,
download_progress_callback)
self._files_downloaded = 0
self._files_downloaded_mutex = Lock()
def run(self):
pool = ThreadPool(cpu_count() * 2)
pool.map(self._thread_run, self._files)
pool.close()
pool.join()
def _thread_run(self, f: FileInfo):
with self._files_downloaded_mutex:
self._files_downloaded += 1
if not self._progress_callback is None:
self._progress_callback(self._files_downloaded)
self._download_file(f)
def __init__(self, save_directory: str, files: List[FileInfo],
download_progress_callback: Callable[[int], None] = None) -> None:
super().__init__(save_directory, files, download_progress_callback)