1
0
Fork 0

Compare commits

...

38 Commits

Author SHA1 Message Date
Alexander Andreev 43909c2b29
Changelog updated with 0.5.1 changes. 1 year ago
Alexander Andreev acbfaefa9c
Version changed to 0.5.1 in a Makefile. 1 year ago
Alexander Andreev 86ef44aa07
Version changed to 0.5.1. 1 year ago
Alexander Andreev 419fb2b673
Removed excessive comparison of hash. Added message when file cannot be retrieved. 1 year ago
Alexander Andreev 0287d3a132
Turned a string into f-string. 1 year ago
Alexander Andreev 245e33f40d
README updated. lolifox.cc removed. Option --skip-posts added. 1 year ago
Alexander Andreev e092c905b2
Makefile updated to version 0.5.0. 1 year ago
Alexander Andreev 90338073ed
Updated CHANGELOG with version 0.5.0. 1 year ago
Alexander Andreev cdcc184de8
Lolifox removed. Development Status classifier is changed to Alpha. Python 3.7 classifier left to represent oldest supported version. 1 year ago
Alexander Andreev b335891097
Copyright, date, and version are updated. 1 year ago
Alexander Andreev 1213cef776
Lolifox removed. Added skip_posts handling. 1 year ago
Alexander Andreev 78d4a62c17
IB parsers rewritten accordingly to fixed Parser class. 1 year ago
Alexander Andreev f3ef07af68
Rewrite of Parser class because it was fucked up. Now there's no problems with inheritance and its subclasses now more pleasant to write. ThreadNotFoundError now has a reason field. 1 year ago
Alexander Andreev 6373518dc3
Added order=True for FIleInfo to make sure that order of fields is preserved. 1 year ago
Alexander Andreev caf18a1bf0
Added option --skip-posts and messages are now takes just one line. 1 year ago
Alexander Andreev 751549f575
A new generalised class for all imageboards based on Tinyboard or having identical API. 1 year ago
Alexander Andreev 38b5740d73
Removing lolifox.cc parser because this board is dead. 1 year ago
Alexander Andreev 2f9d26427c
Now incrementing _files_downloaded happens when _progress_callback is set. And made super() with no args. 1 year ago
Alexander Andreev e7cf2e7c4b
Added a missing return True statement in _check_file 1 year ago
Alexander Andreev 4f6f56ae7b
Version in a Makefile is changed to 0.4.1. 1 year ago
Alexander Andreev 503eb9959b
Version updated to 0.4.1. 1 year ago
Alexander Andreev cb2e0d77f7
Changelog update for 0.4.1. 1 year ago
Alexander Andreev 93e442939a
Dvach's stickers handling. 1 year ago
Alexander Andreev 6022c9929a
Added HTTP and URL exceptions handling. 1 year ago
Alexander Andreev f79abcc310 In classifiers licence was fixed and added more topics related to a program. 2 years ago
Alexander Andreev 9cdb510325 A little fix for README. 2 years ago
Alexander Andreev 986fdbe7a7 Handling of no arguments passed. 2 years ago
Alexander Andreev 2e6352cb13 Updated changelog. 2 years ago
Alexander Andreev 7b2fcf0899 Improved error handling, retries for damaged files. 2 years ago
Alexander Andreev 21837c5335 Updated changelog. 2 years ago
Alexander Andreev b970973018 ConnectionResetError handling. 2 years ago
Alexander Andreev 6dab626084 Version is changed to 0.4.0. 2 years ago
Alexander Andreev 86b6278657 Updated changelog and readme. 2 years ago
Alexander Andreev 7754a90313 FileInfo is now a frozen dataclass for efficiency. 2 years ago
Alexander Andreev bb47b50c5f _is_file_ok now is _check_file and modified to be more efficient. Also added check for if files happened to share same name and size, but IB said wrong hash. 2 years ago
Alexander Andreev 8403fcf0f2 Now op file is explicitly in utf-8. 2 years ago
Alexander Andreev 647a787974 FIxed arguments for a match function. 2 years ago
Alexander Andreev 6a54b88498 sub and com ->subject and comment. Fixed arguments for match function. 2 years ago
  1. 57
      CHANGELOG.md
  2. 2
      Makefile
  3. 25
      README.md
  4. 6
      scrapthechan/__init__.py
  5. 70
      scrapthechan/cli/scraper.py
  6. 32
      scrapthechan/fileinfo.py
  7. 81
      scrapthechan/parser.py
  8. 24
      scrapthechan/parsers/__init__.py
  9. 53
      scrapthechan/parsers/dvach.py
  10. 58
      scrapthechan/parsers/eightkun.py
  11. 46
      scrapthechan/parsers/fourchan.py
  12. 63
      scrapthechan/parsers/lainchan.py
  13. 65
      scrapthechan/parsers/lolifox.py
  14. 51
      scrapthechan/parsers/tinyboardlike.py
  15. 201
      scrapthechan/scraper.py
  16. 39
      scrapthechan/scrapers/threadedscraper.py
  17. 12
      setup.cfg

57
CHANGELOG.md

@ -1,6 +1,61 @@
# Changelog
## 0.3 - 2020-09-09
## 0.5.1 - 2021-05-04
## Added
- Message when a file cannot be retrieved.
## Fixed
- Removed excessive hash comparison when files has same name;
- A string forgotten to set to be a f-string, so now it displays a reason of why
thread wasn't found.
## 0.5.0 - 2021-05-03
## Added
- Now program makes use of skip_posts argument. Use CLI option `-S <number>`
or `--skip-posts <number>` to set how much posts you want to skip.
## Changed
- Better, minified messages;
- Fixed inheritance of `Scraper`'s subclasses and its sane rewrite that led to
future easy extension with way less repeating.
- Added a general class `TinyboardLikeParser` that implements post parser for
all imageboards based on it or the ones that have identical JSON API. From now
on all such generalisation classes will end with `*LikeParser`;
- Changed `file_base_url` for 8kun.top.
## Removed
- Support for Lolifox, since it's gone.
## 0.4.1 - 2020-12-08
## Fixed
- Now HTTPException from http.client and URLError from urllib.request
are handled;
- 2ch.hk's stickers handling.
## 0.4.0 - 2020-11-18
### Added
- For 2ch.hk check for if a file is a sticker was added;
- Encoding for `!op.txt` file was explicitly set to `utf-8`;
- Handling of connection errors was added so now program won't crash if file
doesn't exist or not accessible for any other reason and if any damaged files
was created then they will be removed;
- Added 3 retries if file was damaged during downloading;
- To a scraper was added matching of hashes of two files that happen to share
same name and size, but hash reported by an imageboard is not the same as of
a file. It results in excessive downloading and hash calculations. Hopefully,
that only the case for 2ch.hk.
### Changed
- FileInfo class is now a frozen dataclass for memory efficiency.
### Fixed
- Found that arguments for match function that matches for `image.ext` pattern
were mixed up in places all over the parsers;
- Also for 2ch.hk checking for if `sub` and `com` was changed to `subject` and
`comment`.
## 0.3.0 - 2020-09-09
### Added
- Parser for lolifox.cc.

2
Makefile

@ -1,7 +1,7 @@
build: scrapthechan README.md setup.cfg
python setup.py sdist bdist_wheel
install:
python -m pip install --upgrade dist/scrapthechan-0.3.0-py3-none-any.whl --user
python -m pip install --upgrade dist/scrapthechan-0.5.1-py3-none-any.whl --user
uninstall:
# We change directory so pip uninstall will run, it'll fail otherwise.
@cd ~/

25
README.md

@ -1,8 +1,8 @@
This is a tool for scraping files from imageboards' threads.
It extracts the files from a JSON version of a thread. And then downloads 'em
in a specified output directory or if it isn't specified then creates following
directory hierarchy in a working directory:
It extracts the files from a JSON representation of a thread. And then downloads
'em in a specified output directory or if it isn't specified then creates
following directory hierarchy in a working directory:
<imageboard name>
|-<board name>
@ -24,12 +24,15 @@ separately. E.g. `4chan b 1100500`.
`-o`, `--output-dir` -- output directory where all files will be dumped to.
`--no-op` -- by default OP's post will be saved in a `!op.txt` file. This flag
disables this behaviour. I desided to put an `!` in a name so this file will be
on the top in a directory listing.
`-N`, `--no-op` -- by default OP's post will be saved in a `!op.txt` file. This
flag disables this behaviour. An exclamation mark `!` in a name is for so this
file will be on the top of a directory listing.
`-v`, `--version` prints the version of the program, and `-h`, `--help` prints
help for a program.
`-S <num>`, `--skip-posts <num>` -- skip given number of posts.
`-v`, `--version` prints the version of the program.
`-h`, `--help` prints help for a program.
# Supported imageboards
@ -37,4 +40,8 @@ help for a program.
- [lainchan.org](https://lainchan.org) since 0.1.0
- [2ch.hk](https://2ch.hk) since 0.1.0
- [8kun.top](https://8kun.top) since 0.2.2
- [lolifox.cc](https://lolifox.cc) since 0.3
# TODO
- Sane rewrite of a program;
- Thread watcher.

6
scrapthechan/__init__.py

@ -1,8 +1,8 @@
__date__ = "9 September 2020"
__version__ = "0.3.0"
__date__ = "4 May 2021"
__version__ = "0.5.1"
__author__ = "Alexander \"Arav\" Andreev"
__email__ = "me@arav.top"
__copyright__ = f"Copyright (c) 2020 {__author__} <{__email__}>"
__copyright__ = f"Copyright (c) 2020,2021 {__author__} <{__email__}>"
__license__ = \
"""This program is licensed under the terms of the MIT license.
For a copy see COPYING file in a directory of the program, or

70
scrapthechan/cli/scraper.py

@ -3,7 +3,7 @@ from os import makedirs
from os.path import join, exists
from re import search
from sys import argv
from typing import List
from typing import List, Optional
from scrapthechan import VERSION
from scrapthechan.parser import Parser, ThreadNotFoundError
@ -15,17 +15,18 @@ from scrapthechan.scrapers.threadedscraper import ThreadedScraper
__all__ = ["main"]
USAGE = \
USAGE: str = \
f"""Usage: scrapthechan [OPTIONS] (URL | IMAGEBOARD BOARD THREAD)
Options:
\t-h,--help -- print this help and exit;
\t-v,--version -- print program's version and exit;
\t-o,--output-dir -- directory where to place scraped files. By default
\t following structure will be created in current directory:
\t <imageboard>/<board>/<thread>;
\t-N,--no-op -- by default OP's post will be written in !op.txt file. This
\t option disables this behaviour;
\t-h,--help -- print this help and exit;
\t-v,--version -- print program's version and exit;
\t-o,--output-dir -- directory where to place scraped files. By default
\t following structure will be created in current directory:
\t <imageboard>/<board>/<thread>;
\t-N,--no-op -- by default OP's post will be written in !op.txt file. This
\t option disables this behaviour;
\t-S,--skip-posts <num> -- skip given number of posts.
Arguments:
\tURL -- URL of a thread;
@ -37,15 +38,15 @@ Supported imageboards: {', '.join(SUPPORTED_IMAGEBOARDS)}.
"""
def parse_common_arguments(args: str) -> dict:
r = r"(?P<help>-h|--help)|(?P<version>-v|--version)"
args = search(r, args)
if not args is None:
args = args.groupdict()
return {
"help": not args["help"] is None,
"version": not args["version"] is None }
return None
def parse_common_arguments(args: str) -> Optional[dict]:
r = r"(?P<help>-h|--help)|(?P<version>-v|--version)"
args = search(r, args)
if not args is None:
args = args.groupdict()
return {
"help": not args["help"] is None,
"version": not args["version"] is None }
return None
def parse_arguments(args: str) -> dict:
rlink = r"^(https?:\/\/)?(?P<site>[\w.-]+)[ \/](?P<board>\w+)(\S+)?[ \/](?P<thread>\w+)"
@ -53,15 +54,21 @@ def parse_arguments(args: str) -> dict:
if not link is None:
link = link.groupdict()
out_dir = search(r"(?=(-o|--output-dir) (?P<outdir>\S+))", args)
skip_posts = search(r"(?=(-S|--skip-posts) (?P<skip>\d+))", args)
return {
"site": None if link is None else link["site"],
"board": None if link is None else link["board"],
"thread": None if link is None else link["thread"],
"skip-posts": None if skip_posts is None else int(skip_posts.group('skip')),
"no-op": not search(r"-N|--no-op", args) is None,
"output-dir": None if out_dir is None \
else out_dir.groupdict()["outdir"] }
def main() -> None:
if len(argv) == 1:
print(USAGE)
exit()
cargs = parse_common_arguments(' '.join(argv[1:]))
if not cargs is None:
if cargs["help"]:
@ -78,17 +85,21 @@ def main() -> None:
exit()
try:
parser = get_parser_by_site(args["site"], args["board"], args["thread"])
if not args["skip-posts"] is None:
parser = get_parser_by_site(args["site"], args["board"],
args["thread"], args["skip-posts"])
else:
parser = get_parser_by_site(args["site"], args["board"],
args["thread"])
except NotImplementedError as ex:
print(f"{str(ex)}.")
print(f"Supported image boards are {', '.join(SUPPORTED_IMAGEBOARDS)}")
exit()
except ThreadNotFoundError:
except ThreadNotFoundError as e:
print(f"Thread {args['site']}/{args['board']}/{args['thread']} " \
"is no longer exist.")
f"not found. Reason: {e.reason}")
exit()
files_count = len(parser.files)
if not args["output-dir"] is None:
@ -97,23 +108,22 @@ def main() -> None:
save_dir = join(parser.imageboard, parser.board,
parser.thread)
print(f"There are {files_count} files in " \
f"{args['site']}/{args['board']}/{args['thread']}." \
f"They will be saved in {save_dir}.")
print(f"{files_count} files in " \
f"{args['site']}/{args['board']}/{args['thread']}. " \
f"They're going to {save_dir}. ", end="")
makedirs(save_dir, exist_ok=True)
if not args["no-op"]:
print("Writing OP... ", end='')
if parser.op is None:
print("No text's there.")
print("OP's empty.")
elif not exists(join(save_dir, "!op.txt")):
with open(join(save_dir, "!op.txt"), 'w') as opf:
with open(join(save_dir, "!op.txt"), 'w', encoding='utf-8') as opf:
opf.write(f"{parser.op}\n")
print("Done.")
print("OP's written.")
else:
print("Exists.")
print("OP exists.")
scraper = ThreadedScraper(save_dir, parser.files, \

32
scrapthechan/fileinfo.py

@ -1,23 +1,23 @@
"""FileInfo object stores all needed information about a file."""
"""FileInfo object stores information about a file."""
from dataclasses import dataclass
__all__ = ["FileInfo"]
@dataclass(frozen=True, order=True)
class FileInfo:
"""Stores all needed information about a file.
"""Stores information about a file.
Arguments:
- `name` -- name of a file;
- `size` -- size of a file;
- `dlurl` -- full download URL for a file;
- `hash_value` -- hash sum of a file;
- `hash_algo` -- hash algorithm used (e.g. md5).
"""
def __init__(self, name: str, size: int, dlurl: str,
hash_value: str, hash_algo: str) -> None:
self.name = name
self.size = size
self.dlurl = dlurl
self.hash_value = hash_value
self.hash_algo = hash_algo
Fields:
- `name` -- name of a file;
- `size` -- size of a file;
- `download_url` -- full download URL for a file;
- `hash_value` -- hash sum of a file;
- `hash_algorithm` -- hash algorithm used (e.g. md5).
"""
name: str
size: int
download_url: str
hash_value: str
hash_algorithm: str

81
scrapthechan/parser.py

@ -4,7 +4,7 @@ from itertools import chain
from json import loads
from re import findall, match
from typing import List, Optional
from urllib.request import urlopen, Request
from urllib.request import urlopen, Request, HTTPError
from scrapthechan import USER_AGENT
from scrapthechan.fileinfo import FileInfo
@ -14,7 +14,12 @@ __all__ = ["Parser", "ThreadNotFoundError"]
class ThreadNotFoundError(Exception):
pass
def __init__(self, reason: str = ""):
self._reason = reason
@property
def reason(self) -> str:
return self._reason
class Parser:
@ -25,28 +30,42 @@ class Parser:
Arguments:
board -- is a name of a board on an image board;
thread -- is a name of a thread inside a board;
posts -- is a list of posts in form of dictionaries exported from a JSON;
thread -- is an id of a thread inside a board;
skip_posts -- number of posts to skip.
All the extracted files will be stored as the `FileInfo` objects."""
__url_thread_json: str = "https://example.org/{board}/{thread}.json"
__url_file_link: str = None
def __init__(self, board: str, thread: str, posts: List[dict],
def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None:
self._board = board
self._thread = thread
self._op_post = posts[0]
if not skip_posts is None:
posts = posts[skip_posts:]
self._board: str = board
self._thread: str = thread
self._posts = self._extract_posts_list(self._get_json())
self._op_post: dict = self._posts[0]
self._posts = self._posts[skip_posts:] if not skip_posts is None else self._posts
self._files = list(chain.from_iterable(filter(None, \
map(self._parse_post, posts))))
map(self._parse_post, self._posts))))
@property
def json_thread_url(self) -> str:
raise NotImplementedError
@property
def file_base_url(self) -> str:
raise NotImplementedError
@property
def subject_field(self) -> str:
return "sub"
@property
def comment_field(self) -> str:
return "com"
@property
def imageboard(self) -> str:
"""Returns image board's name."""
return NotImplementedError
raise NotImplementedError
@property
def board(self) -> str:
@ -62,22 +81,40 @@ class Parser:
def op(self) -> str:
"""Returns OP's post as combination of subject and comment separated
by a new line."""
raise NotImplementedError
op = ""
if self.subject_field in self._op_post:
op = f"{self._op_post[self.subject_field]}\n"
if self.comment_field in self._op_post:
op += self._op_post[self.comment_field]
return op if not op == "" else None
@property
def files(self) -> List[FileInfo]:
"""Returns a list of retrieved files as `FileInfo` objects."""
return self._files
def _get_json(self, thread_url: str) -> dict:
"""Gets JSON version of a thread and converts it in a dictionary."""
def _extract_posts_list(self, lst: List) -> List[dict]:
"""This method must be overridden in child classes where you specify
a path in a JSON document where posts are stored. E.g., on 4chan this is
['posts'], and on 2ch.hk it's ['threads'][0]['posts']."""
return lst
def _get_json(self) -> dict:
"""Retrieves a JSON representation of a thread and converts it in
a dictionary."""
try:
thread_url = self.json_thread_url.format(board=self._board, \
thread=self._thread)
req = Request(thread_url, headers={'User-Agent': USER_AGENT})
with urlopen(req) as url:
return loads(url.read().decode('utf-8'))
except:
raise ThreadNotFoundError
def _parse_post(self, post: dict) -> List[FileInfo]:
"""Parses a single post and extracts files into `FileInfo` object."""
except HTTPError as e:
raise ThreadNotFoundError(str(e))
except Exception as e:
raise e
def _parse_post(self, post: dict) -> Optional[List[FileInfo]]:
"""Parses a single post and extracts files into `FileInfo` object.
Single object is wrapped in a list for convenient insertion into
a list."""
raise NotImplementedError

24
scrapthechan/parsers/__init__.py

@ -1,6 +1,6 @@
"""Here are defined the JSON parsers for imageboards."""
from re import search
from typing import List
from typing import List, Optional
from scrapthechan.parser import Parser
@ -8,33 +8,31 @@ from scrapthechan.parser import Parser
__all__ = ["SUPPORTED_IMAGEBOARDS", "get_parser_by_url", "get_parser_by_site"]
URLRX = r"https?:\/\/(?P<s>[\w\.]+)\/(?P<b>\w+)\/(?:\w+)?\/(?P<t>\w+)"
SUPPORTED_IMAGEBOARDS: List[str] = ["4chan.org", "lainchan.org", "2ch.hk", \
"8kun.top", "lolifox.cc"]
"8kun.top"]
def get_parser_by_url(url: str) -> Parser:
def get_parser_by_url(url: str, skip_posts: Optional[int] = None) -> Parser:
"""Parses URL and extracts from it site name, board and thread.
And then returns initialised Parser object for detected imageboard."""
URLRX = r"https?:\/\/(?P<s>[\w\.]+)\/(?P<b>\w+)\/(?:\w+)?\/(?P<t>\w+)"
site, board, thread = search(URLRX, url).groups()
return get_parser_by_site(site, board, thread)
return get_parser_by_site(site, board, thread, skip_posts)
def get_parser_by_site(site: str, board: str, thread: str) -> Parser:
def get_parser_by_site(site: str, board: str, thread: str,
skip_posts: Optional[int] = None) -> Parser:
"""Returns an initialised parser for `site` with `board` and `thread`."""
if '4chan' in site:
from .fourchan import FourChanParser
return FourChanParser(board, thread)
return FourChanParser(board, thread, skip_posts)
elif 'lainchan' in site:
from .lainchan import LainchanParser
return LainchanParser(board, thread)
return LainchanParser(board, thread, skip_posts)
elif '2ch' in site:
from .dvach import DvachParser
return DvachParser(board, thread)
return DvachParser(board, thread, skip_posts)
elif '8kun' in site:
from .eightkun import EightKunParser
return EightKunParser(board, thread)
elif 'lolifox' in site:
from .lolifox import LolifoxParser
return LolifoxParser(board, thread)
return EightKunParser(board, thread, skip_posts)
else:
raise NotImplementedError(f"Parser for {site} is not implemented")

53
scrapthechan/parsers/dvach.py

@ -10,39 +10,54 @@ __all__ = ["DvachParser"]
class DvachParser(Parser):
"""JSON parser for 2ch.hk image board."""
__url_thread_json = "https://2ch.hk/{board}/res/{thread}.json"
__url_file_link = "https://2ch.hk"
def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None:
posts = self._get_json(self.__url_thread_json.format(board=board, \
thread=thread))['threads'][0]['posts']
super(DvachParser, self).__init__(board, thread, posts, skip_posts)
super().__init__(board, thread, skip_posts)
@property
def json_thread_url(self) -> str:
return "https://2ch.hk/{board}/res/{thread}.json"
@property
def file_base_url(self) -> str:
return "https://2ch.hk"
@property
def subject_field(self) -> str:
return "subject"
@property
def comment_field(self) -> str:
return "comment"
@property
def imageboard(self) -> str:
return "2ch.hk"
@property
def op(self) -> Optional[str]:
op = ""
if 'sub' in self._op_post:
op = f"{self._op_post['subject']}\n"
if 'com' in self._op_post:
op += self._op_post['comment']
return op if not op == "" else None
def _extract_posts_list(self, lst: List) -> List[dict]:
return lst['threads'][0]['posts']
def _parse_post(self, post) -> Optional[List[FileInfo]]:
if not 'files' in post: return None
files = []
for f in post['files']:
if match(f['fullname'], r"^image\.\w{1,4}$") is None:
fullname = f['fullname']
if not 'sticker' in f:
if match(r"^image\.\w+$", f['fullname']) is None:
fullname = f['fullname']
else:
fullname = f['name']
else:
fullname = f['name']
# Here's same thing as 4chan. 2ch.hk also has md5 field, so it is
# completely fine to hardcode `hash_algo`.
files.append(FileInfo(fullname, f['size'],
f"{self.__url_file_link}{f['path']}",
f['md5'], 'md5'))
if 'md5' in f:
files.append(FileInfo(fullname, f['size'],
f"{self.file_base_url}{f['path']}",
f['md5'], 'md5'))
else:
files.append(FileInfo(fullname, f['size'],
f"{self.file_base_url}{f['path']}",
None, None))
return files

58
scrapthechan/parsers/eightkun.py

@ -1,63 +1,25 @@
from re import match
from typing import List, Optional
from typing import Optional
from scrapthechan.fileinfo import FileInfo
from scrapthechan.parser import Parser
from scrapthechan.parsers.tinyboardlike import TinyboardLikeParser
__all__ = ["EightKunParser"]
class EightKunParser(Parser):
class EightKunParser(TinyboardLikeParser):
"""JSON parser for 8kun.top image board."""
__url_thread_json = "https://8kun.top/{board}/res/{thread}.json"
__url_file_link = "https://media.8kun.top/file_store/{filename}"
def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None:
posts = self._get_json(self.__url_thread_json.format(board=board, \
thread=thread))['posts']
super(EightKunParser, self).__init__(board, thread, posts, skip_posts)
super().__init__(board, thread, skip_posts)
@property
def imageboard(self) -> str:
return "8kun.top"
@property
def op(self) -> Optional[str]:
op = ""
if 'sub' in self._op_post:
op = f"{self._op_post['sub']}\n"
if 'com' in self._op_post:
op += self._op_post['com']
return op if not op == "" else None
def _parse_post(self, post: dict) -> List[FileInfo]:
if not 'tim' in post: return None
dlfname = f"{post['tim']}{post['ext']}"
if "filename" in post:
if match(post['filename'], r"^image\.\w{1,4}$") is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
files = []
files.append(FileInfo(filename, post['fsize'],
self.__url_file_link.format(board=self.board, filename=dlfname),
post['md5'], 'md5'))
if "extra_files" in post:
for f in post["extra_files"]:
dlfname = f"{f['tim']}{f['ext']}"
if "filename" in post:
if match(post['filename'], r"^image\.\w+$") is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
dlurl = self.__url_file_link.format(board=self.board, \
filename=dlfname)
files.append(FileInfo(filename, f['fsize'], \
dlurl, f['md5'], 'md5'))
return files
def json_thread_url(self) -> str:
return "https://8kun.top/{board}/res/{thread}.json"
@property
def file_base_url(self) -> str:
return "https://media.8kun.top/file_dl/{filename}"

46
scrapthechan/parsers/fourchan.py

@ -1,51 +1,25 @@
from re import match
from typing import List, Optional
from typing import Optional
from scrapthechan.fileinfo import FileInfo
from scrapthechan.parser import Parser
from scrapthechan.parsers.tinyboardlike import TinyboardLikeParser
__all__ = ["FourChanParser"]
class FourChanParser(Parser):
class FourChanParser(TinyboardLikeParser):
"""JSON parser for 4chan.org image board."""
__url_thread_json = "https://a.4cdn.org/{board}/thread/{thread}.json"
__url_file_link = "https://i.4cdn.org/{board}/{filename}"
def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None:
posts = self._get_json(self.__url_thread_json.format(board=board, \
thread=thread))['posts']
super(FourChanParser, self).__init__(board, thread, posts, skip_posts)
super().__init__(board, thread, skip_posts)
@property
def imageboard(self) -> str:
return "4chan.org"
@property
def op(self) -> Optional[str]:
op = ""
if 'sub' in self._op_post:
op = f"{self._op_post['sub']}\n"
if 'com' in self._op_post:
op += self._op_post['com']
return op if not op == "" else None
def _parse_post(self, post: dict) -> List[FileInfo]:
if not 'tim' in post: return None
dlfname = f"{post['tim']}{post['ext']}"
if "filename" in post:
if match(post['filename'], r"^image\.\w{1,4}$") is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
# Hash algorithm is hardcoded since it is highly unlikely that it will
# be changed in foreseeable future. And if it'll change then this line
# will be necessarily updated anyway.
return [FileInfo(filename, post['fsize'],
self.__url_file_link.format(board=self.board, filename=dlfname),
post['md5'], 'md5')]
def json_thread_url(self) -> str:
return "https://a.4cdn.org/{board}/thread/{thread}.json"
@property
def file_base_url(self) -> str:
return "https://i.4cdn.org/{board}/{filename}"

63
scrapthechan/parsers/lainchan.py

@ -1,66 +1,25 @@
from re import match
from typing import List, Optional
from typing import Optional
from scrapthechan.parser import Parser
from scrapthechan.fileinfo import FileInfo
from scrapthechan.parsers.tinyboardlike import TinyboardLikeParser
__all__ = ["LainchanParser"]
class LainchanParser(Parser):
"""JSON parser for lainchan.org image board.
JSON structure is identical to 4chan.org's, so this parser is just inherited
from 4chan.org's parser and only needed things are redefined.
"""
__url_thread_json = "https://lainchan.org/{board}/res/{thread}.json"
__url_file_link = "https://lainchan.org/{board}/src/{filename}"
class LainchanParser(TinyboardLikeParser):
"""JSON parser for lainchan.org image board."""
def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None:
posts = self._get_json(self.__url_thread_json.format(board=board, \
thread=thread))['posts']
super(LainchanParser, self).__init__(board, thread, posts, skip_posts)
super().__init__(board, thread, skip_posts)
@property
def imageboard(self) -> str:
return "lainchan.org"
@property
def op(self) -> Optional[str]:
op = ""
if 'sub' in self._op_post:
op = f"{self._op_post['sub']}\n"
if 'com' in self._op_post:
op += self._op_post['com']
return op if not op == "" else None
def _parse_post(self, post) -> List[FileInfo]:
if not 'tim' in post: return None
dlfname = f"{post['tim']}{post['ext']}"
if "filename" in post:
if match(post['filename'], r"^image\.\w{1,4}$") is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
files = []
files.append(FileInfo(filename, post['fsize'],
self.__url_file_link.format(board=self.board, filename=dlfname),
post['md5'], 'md5'))
@property
def json_thread_url(self) -> str:
return "https://lainchan.org/{board}/res/{thread}.json"
if "extra_files" in post:
for f in post["extra_files"]:
dlfname = f"{f['tim']}{f['ext']}"
if "filename" in post:
if match(post['filename'], r"^image\.\w+$") is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
dlurl = self.__url_file_link.format(board=self.board, \
filename=dlfname)
files.append(FileInfo(filename, f['fsize'], \
dlurl, f['md5'], 'md5'))
return files
@property
def file_base_url(self) -> str:
return "https://lainchan.org/{board}/src/{filename}"

65
scrapthechan/parsers/lolifox.py

@ -1,65 +0,0 @@
from re import match
from typing import List, Optional
from scrapthechan.parser import Parser
from scrapthechan.fileinfo import FileInfo
__all__ = ["LolifoxParser"]
class LolifoxParser(Parser):
"""JSON parser for lolifox.cc image board.
JSON structure is identical to lainchan.org.
"""
__url_thread_json = "https://lolifox.cc/{board}/res/{thread}.json"
__url_file_link = "https://lolifox.cc/{board}/src/{filename}"
def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None:
posts = self._get_json(self.__url_thread_json.format(board=board, \
thread=thread))['posts']
super(LolifoxParser, self).__init__(board, thread, posts, skip_posts)
@property
def imageboard(self) -> str:
return "lolifox.cc"
@property
def op(self) -> Optional[str]:
op = ""
if 'sub' in self._op_post:
op = f"{self._op_post['sub']}\n"
if 'com' in self._op_post:
op += self._op_post['com']
return op if not op == "" else None
def _parse_post(self, post) -> List[FileInfo]:
if not 'tim' in post: return None
dlfname = f"{post['tim']}{post['ext']}"
if "filename" in post:
if match(post['filename'], r"^image\.\w{1,4}$") is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
files = []
files.append(FileInfo(filename, post['fsize'],
self.__url_file_link.format(board=self.board, filename=dlfname),
post['md5'], 'md5'))
if "extra_files" in post:
for f in post["extra_files"]:
dlfname = f"{f['tim']}{f['ext']}"
if "filename" in post:
if match(post['filename'], r"^image\.\w+$") is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
dlurl = self.__url_file_link.format(board=self.board, \
filename=dlfname)
files.append(FileInfo(filename, f['fsize'], \
dlurl, f['md5'], 'md5'))
return files

51
scrapthechan/parsers/tinyboardlike.py

@ -0,0 +1,51 @@
from re import match
from typing import List, Optional
from scrapthechan.parser import Parser
from scrapthechan.fileinfo import FileInfo
__all__ = ["TinyboardLikeParser"]
class TinyboardLikeParser(Parser):
"""Base parser for imageboards that are based on Tinyboard, or have similar
JSON API."""
def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None:
super().__init__(board, thread, skip_posts)
def _extract_posts_list(self, lst: List) -> List[dict]:
return lst['posts']
def _parse_post(self, post: dict) -> Optional[List[FileInfo]]:
if not 'tim' in post: return None
dlfname = f"{post['tim']}{post['ext']}"
if "filename" in post:
if match(r"^image\.\w+$", post['filename']) is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
files = []
files.append(FileInfo(filename, post['fsize'],
self.file_base_url.format(board=self.board, filename=dlfname),
post['md5'], 'md5'))
if "extra_files" in post:
for f in post["extra_files"]:
dlfname = f"{f['tim']}{f['ext']}"
if "filename" in post:
if match(r"^image\.\w+$", post['filename']) is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
dlurl = self.file_base_url.format(board=self.board, \
filename=dlfname)
files.append(FileInfo(filename, f['fsize'], \
dlurl, f['md5'], 'md5'))
return files

201
scrapthechan/scraper.py

@ -5,8 +5,9 @@ from os import remove, stat
from os.path import exists, join, getsize
import re
from typing import List, Callable
from urllib.request import urlretrieve, URLopener
from urllib.request import urlretrieve, URLopener, HTTPError, URLError
import hashlib
from http.client import HTTPException
from scrapthechan import USER_AGENT
from scrapthechan.fileinfo import FileInfo
@ -15,83 +16,131 @@ __all__ = ["Scraper"]
class Scraper:
"""Base class for all scrapers that will actually do the job.
Arguments:
save_directory -- a path to a directory where file will be
saved;
files -- a list of FileInfo objects;
download_progress_callback -- a callback function that will be called
for each file started downloading.
"""
def __init__(self, save_directory: str, files: List[FileInfo],
download_progress_callback: Callable[[int], None] = None) -> None:
self._save_directory = save_directory
self._files = files
self._url_opener = URLopener()
self._url_opener.addheaders = [('User-Agent', USER_AGENT)]
self._url_opener.version = USER_AGENT
self._progress_callback = download_progress_callback
"""Base class for all scrapers that will actually do the job.
Arguments:
save_directory -- a path to a directory where file will be
saved;
files -- a list of FileInfo objects;
download_progress_callback -- a callback function that will be called
for each file started downloading.
"""
def __init__(self, save_directory: str, files: List[FileInfo],
download_progress_callback: Callable[[int], None] = None) -> None:
self._save_directory = save_directory
self._files = files
self._url_opener = URLopener()
self._url_opener.addheaders = [('User-Agent', USER_AGENT)]
self._url_opener.version = USER_AGENT
self._progress_callback = download_progress_callback
def run(self):
raise NotImplementedError
def run(self):
raise NotImplementedError
def _same_filename(self, filename: str, path: str) -> str:
"""Check if there is a file with same name. If so then add incremental
number enclosed in brackets to a name of a new one."""
newname = filename
while exists(join(path, newname)):
has_extension = newname.rfind(".") != -1
if has_extension:
l, r = newname.rsplit(".", 1)
lbracket = l.rfind("(")
if lbracket == -1:
newname = f"{l}(1).{r}"
else:
num = l[lbracket+1:-1]
if num.isnumeric():
newname = f"{l[:lbracket]}({int(num)+1}).{r}"
else:
newname = f"{l}(1).{r}"
else:
lbracket = l.rfind("(")
if lbracket == -1:
newname = f"{newname}(1)"
else:
num = newname[lbracket+1:-1]
if num.isnumeric():
newname = f"{newname[:lbracket]}({int(num)+1})"
return newname
def _same_filename(self, filename: str, path: str) -> str:
"""Check if there is a file with same name. If so then add incremental
number enclosed in brackets to a name of a new one."""
newname = filename
while exists(join(path, newname)):
has_extension = newname.rfind(".") != -1
if has_extension:
l, r = newname.rsplit(".", 1)
lbracket = l.rfind("(")
if lbracket == -1:
newname = f"{l}(1).{r}"
else:
num = l[lbracket+1:-1]
if num.isnumeric():
newname = f"{l[:lbracket]}({int(num)+1}).{r}"
else:
newname = f"{l}(1).{r}"
else:
lbracket = l.rfind("(")
if lbracket == -1:
newname = f"{newname}(1)"
else:
num = newname[lbracket+1:-1]
if num.isnumeric():
newname = f"{newname[:lbracket]}({int(num)+1})"
return newname
def _hash_file(self, filename: str, hash_algo: str = "md5",
blocksize: int = 1048576) -> (str, str):
"""Compute hash of a file."""
hash_func = hashlib.new(hash_algo)
with open(filename, 'rb') as f:
buf = f.read(blocksize)
while len(buf) > 0:
hash_func.update(buf)
buf = f.read(blocksize)
return hash_func.hexdigest(), hash_func.digest()
def _hash_file(self, filepath: str, hash_algorithm: str = "md5",
blocksize: int = 1048576) -> (str, str):
"""Compute hash of a file."""
if hash_algorithm is None:
return None
hash_func = hashlib.new(hash_algorithm)
with open(filepath, 'rb') as f:
buf = f.read(blocksize)
while len(buf) > 0:
hash_func.update(buf)
buf = f.read(blocksize)
return hash_func.hexdigest(), b64encode(hash_func.digest()).decode()
def _is_file_ok(self, f: FileInfo, filepath: str) -> bool:
"""Check if a file exist and isn't broken."""
if not exists(filepath):
return False
computed_size = getsize(filepath)
is_size_match = f.size == computed_size \
or f.size == round(computed_size / 1024)
hexdig, dig = self._hash_file(filepath, f.hash_algo)
is_hash_match = f.hash_value == hexdig \
or f.hash_value == b64encode(dig).decode()
return is_size_match and is_hash_match
def _check_file(self, f: FileInfo, filepath: str) -> bool:
"""Check if a file exist and isn't broken."""
if not exists(filepath):
return False
computed_size = getsize(filepath)
if not (f.size == computed_size \
or f.size == round(computed_size / 1024)):
return False
if not f.hash_algorithm is None:
hexdig, dig = self._hash_file(filepath, f.hash_algorithm)
return f.hash_value == hexdig or f.hash_value == dig
return True
def _download_file(self, f: FileInfo):
"""Download a single file."""
filepath = join(self._save_directory, f.name)
if self._is_file_ok(f, filepath):
return True
elif exists(filepath):
filepath = join(self._save_directory, \
self._same_filename(f.name, self._save_directory))
self._url_opener.retrieve(f.dlurl, filepath)
def _download_file(self, f: FileInfo):
"""Download a single file."""
is_same_filename = False
filepath = join(self._save_directory, f.name)
orig_filepath = filepath
if self._check_file(f, filepath):
return
elif exists(filepath):
is_same_filename = True
filepath = join(self._save_directory, \
self._same_filename(f.name, self._save_directory))
try:
retries = 3
while retries > 0:
self._url_opener.retrieve(f.download_url, filepath)
if not self._check_file(f, filepath):
remove(filepath)
retries -= 1
else:
break
if retries == 0:
print(f"Cannot retrieve {f.download_url}, {filepath}.")
return
if is_same_filename:
_, f1_dig = self._hash_file(orig_filepath, f.hash_algorithm)
_, f2_dig = self._hash_file(filepath, f.hash_algorithm)
if f1_dig == f2_dig:
remove(filepath)
except FileNotFoundError as e:
print("File Not Found", filepath)
except HTTPError as e:
print("HTTP Error", e.code, e.reason, f.download_url)
if exists(filepath):
remove(filepath)
except HTTPException:
print("HTTP Exception for", f.download_url)
if exists(filepath):
remove(filepath)
except URLError as e:
print("URL Error for", f.download_url)
if exists(filepath):
remove(filepath)
except ConnectionResetError:
print("Connection reset for", f.download_url)
if exists(filepath):
remove(filepath)
except ConnectionRefusedError:
print("Connection refused for", f.download_url)
if exists(filepath):
remove(filepath)
except ConnectionAbortedError:
print("Connection aborted for", f.download_url)
if exists(filepath):
remove(filepath)

39
scrapthechan/scrapers/threadedscraper.py

@ -7,25 +7,26 @@ from multiprocessing.pool import ThreadPool
from scrapthechan.scraper import Scraper
from scrapthechan.fileinfo import FileInfo
__all__ = ["ThreadedScraper"]
class ThreadedScraper(Scraper):
def __init__(self, save_directory: str, files: List[FileInfo],
download_progress_callback: Callable[[int], None] = None) -> None:
super(ThreadedScraper, self).__init__(save_directory, files,
download_progress_callback)