1
0
Fork 0

Compare commits

...

26 Commits

Author SHA1 Message Date
Alexander Andreev 43909c2b29
Changelog updated with 0.5.1 changes. 1 year ago
Alexander Andreev acbfaefa9c
Version changed to 0.5.1 in a Makefile. 1 year ago
Alexander Andreev 86ef44aa07
Version changed to 0.5.1. 1 year ago
Alexander Andreev 419fb2b673
Removed excessive comparison of hash. Added message when file cannot be retrieved. 1 year ago
Alexander Andreev 0287d3a132
Turned a string into f-string. 1 year ago
Alexander Andreev 245e33f40d
README updated. lolifox.cc removed. Option --skip-posts added. 1 year ago
Alexander Andreev e092c905b2
Makefile updated to version 0.5.0. 1 year ago
Alexander Andreev 90338073ed
Updated CHANGELOG with version 0.5.0. 1 year ago
Alexander Andreev cdcc184de8
Lolifox removed. Development Status classifier is changed to Alpha. Python 3.7 classifier left to represent oldest supported version. 1 year ago
Alexander Andreev b335891097
Copyright, date, and version are updated. 1 year ago
Alexander Andreev 1213cef776
Lolifox removed. Added skip_posts handling. 1 year ago
Alexander Andreev 78d4a62c17
IB parsers rewritten accordingly to fixed Parser class. 1 year ago
Alexander Andreev f3ef07af68
Rewrite of Parser class because it was fucked up. Now there's no problems with inheritance and its subclasses now more pleasant to write. ThreadNotFoundError now has a reason field. 1 year ago
Alexander Andreev 6373518dc3
Added order=True for FIleInfo to make sure that order of fields is preserved. 1 year ago
Alexander Andreev caf18a1bf0
Added option --skip-posts and messages are now takes just one line. 1 year ago
Alexander Andreev 751549f575
A new generalised class for all imageboards based on Tinyboard or having identical API. 1 year ago
Alexander Andreev 38b5740d73
Removing lolifox.cc parser because this board is dead. 1 year ago
Alexander Andreev 2f9d26427c
Now incrementing _files_downloaded happens when _progress_callback is set. And made super() with no args. 1 year ago
Alexander Andreev e7cf2e7c4b
Added a missing return True statement in _check_file 1 year ago
Alexander Andreev 4f6f56ae7b
Version in a Makefile is changed to 0.4.1. 1 year ago
Alexander Andreev 503eb9959b
Version updated to 0.4.1. 1 year ago
Alexander Andreev cb2e0d77f7
Changelog update for 0.4.1. 1 year ago
Alexander Andreev 93e442939a
Dvach's stickers handling. 1 year ago
Alexander Andreev 6022c9929a
Added HTTP and URL exceptions handling. 1 year ago
Alexander Andreev f79abcc310 In classifiers licence was fixed and added more topics related to a program. 2 years ago
Alexander Andreev 9cdb510325 A little fix for README. 2 years ago
  1. 33
      CHANGELOG.md
  2. 2
      Makefile
  3. 19
      README.md
  4. 6
      scrapthechan/__init__.py
  5. 64
      scrapthechan/cli/scraper.py
  6. 2
      scrapthechan/fileinfo.py
  7. 81
      scrapthechan/parser.py
  8. 24
      scrapthechan/parsers/__init__.py
  9. 55
      scrapthechan/parsers/dvach.py
  10. 58
      scrapthechan/parsers/eightkun.py
  11. 46
      scrapthechan/parsers/fourchan.py
  12. 63
      scrapthechan/parsers/lainchan.py
  13. 38
      scrapthechan/parsers/tinyboardlike.py
  14. 31
      scrapthechan/scraper.py
  15. 39
      scrapthechan/scrapers/threadedscraper.py
  16. 12
      setup.cfg

33
CHANGELOG.md

@ -1,5 +1,38 @@
# Changelog
## 0.5.1 - 2021-05-04
## Added
- Message when a file cannot be retrieved.
## Fixed
- Removed excessive hash comparison when files has same name;
- A string forgotten to set to be a f-string, so now it displays a reason of why
thread wasn't found.
## 0.5.0 - 2021-05-03
## Added
- Now program makes use of skip_posts argument. Use CLI option `-S <number>`
or `--skip-posts <number>` to set how much posts you want to skip.
## Changed
- Better, minified messages;
- Fixed inheritance of `Scraper`'s subclasses and its sane rewrite that led to
future easy extension with way less repeating.
- Added a general class `TinyboardLikeParser` that implements post parser for
all imageboards based on it or the ones that have identical JSON API. From now
on all such generalisation classes will end with `*LikeParser`;
- Changed `file_base_url` for 8kun.top.
## Removed
- Support for Lolifox, since it's gone.
## 0.4.1 - 2020-12-08
## Fixed
- Now HTTPException from http.client and URLError from urllib.request
are handled;
- 2ch.hk's stickers handling.
## 0.4.0 - 2020-11-18
### Added
- For 2ch.hk check for if a file is a sticker was added;

2
Makefile

@ -1,7 +1,7 @@
build: scrapthechan README.md setup.cfg
python setup.py sdist bdist_wheel
install:
python -m pip install --upgrade dist/scrapthechan-0.4.0-py3-none-any.whl --user
python -m pip install --upgrade dist/scrapthechan-0.5.1-py3-none-any.whl --user
uninstall:
# We change directory so pip uninstall will run, it'll fail otherwise.
@cd ~/

19
README.md

@ -24,12 +24,15 @@ separately. E.g. `4chan b 1100500`.
`-o`, `--output-dir` -- output directory where all files will be dumped to.
`--no-op` -- by default OP's post will be saved in a `!op.txt` file. This flag
disables this behaviour. An exclamation mark `!` in a name is for so this file
will be on the top of a directory listing.
`-N`, `--no-op` -- by default OP's post will be saved in a `!op.txt` file. This
flag disables this behaviour. An exclamation mark `!` in a name is for so this
file will be on the top of a directory listing.
`-v`, `--version` prints the version of the program, and `-h`, `--help` prints
help for a program.
`-S <num>`, `--skip-posts <num>` -- skip given number of posts.
`-v`, `--version` prints the version of the program.
`-h`, `--help` prints help for a program.
# Supported imageboards
@ -37,4 +40,8 @@ help for a program.
- [lainchan.org](https://lainchan.org) since 0.1.0
- [2ch.hk](https://2ch.hk) since 0.1.0
- [8kun.top](https://8kun.top) since 0.2.2
- [lolifox.cc](https://lolifox.cc) since 0.3.0
# TODO
- Sane rewrite of a program;
- Thread watcher.

6
scrapthechan/__init__.py

@ -1,8 +1,8 @@
__date__ = "18 November 2020"
__version__ = "0.4.0"
__date__ = "4 May 2021"
__version__ = "0.5.1"
__author__ = "Alexander \"Arav\" Andreev"
__email__ = "me@arav.top"
__copyright__ = f"Copyright (c) 2020 {__author__} <{__email__}>"
__copyright__ = f"Copyright (c) 2020,2021 {__author__} <{__email__}>"
__license__ = \
"""This program is licensed under the terms of the MIT license.
For a copy see COPYING file in a directory of the program, or

64
scrapthechan/cli/scraper.py

@ -3,7 +3,7 @@ from os import makedirs
from os.path import join, exists
from re import search
from sys import argv
from typing import List
from typing import List, Optional
from scrapthechan import VERSION
from scrapthechan.parser import Parser, ThreadNotFoundError
@ -15,17 +15,18 @@ from scrapthechan.scrapers.threadedscraper import ThreadedScraper
__all__ = ["main"]
USAGE = \
USAGE: str = \
f"""Usage: scrapthechan [OPTIONS] (URL | IMAGEBOARD BOARD THREAD)
Options:
\t-h,--help -- print this help and exit;
\t-v,--version -- print program's version and exit;
\t-o,--output-dir -- directory where to place scraped files. By default
\t following structure will be created in current directory:
\t <imageboard>/<board>/<thread>;
\t-N,--no-op -- by default OP's post will be written in !op.txt file. This
\t option disables this behaviour;
\t-h,--help -- print this help and exit;
\t-v,--version -- print program's version and exit;
\t-o,--output-dir -- directory where to place scraped files. By default
\t following structure will be created in current directory:
\t <imageboard>/<board>/<thread>;
\t-N,--no-op -- by default OP's post will be written in !op.txt file. This
\t option disables this behaviour;
\t-S,--skip-posts <num> -- skip given number of posts.
Arguments:
\tURL -- URL of a thread;
@ -37,15 +38,15 @@ Supported imageboards: {', '.join(SUPPORTED_IMAGEBOARDS)}.
"""
def parse_common_arguments(args: str) -> dict:
r = r"(?P<help>-h|--help)|(?P<version>-v|--version)"
args = search(r, args)
if not args is None:
args = args.groupdict()
return {
"help": not args["help"] is None,
"version": not args["version"] is None }
return None
def parse_common_arguments(args: str) -> Optional[dict]:
r = r"(?P<help>-h|--help)|(?P<version>-v|--version)"
args = search(r, args)
if not args is None:
args = args.groupdict()
return {
"help": not args["help"] is None,
"version": not args["version"] is None }
return None
def parse_arguments(args: str) -> dict:
rlink = r"^(https?:\/\/)?(?P<site>[\w.-]+)[ \/](?P<board>\w+)(\S+)?[ \/](?P<thread>\w+)"
@ -53,10 +54,12 @@ def parse_arguments(args: str) -> dict:
if not link is None:
link = link.groupdict()
out_dir = search(r"(?=(-o|--output-dir) (?P<outdir>\S+))", args)
skip_posts = search(r"(?=(-S|--skip-posts) (?P<skip>\d+))", args)
return {
"site": None if link is None else link["site"],
"board": None if link is None else link["board"],
"thread": None if link is None else link["thread"],
"skip-posts": None if skip_posts is None else int(skip_posts.group('skip')),
"no-op": not search(r"-N|--no-op", args) is None,
"output-dir": None if out_dir is None \
else out_dir.groupdict()["outdir"] }
@ -82,17 +85,21 @@ def main() -> None:
exit()
try:
parser = get_parser_by_site(args["site"], args["board"], args["thread"])
if not args["skip-posts"] is None:
parser = get_parser_by_site(args["site"], args["board"],
args["thread"], args["skip-posts"])
else:
parser = get_parser_by_site(args["site"], args["board"],
args["thread"])
except NotImplementedError as ex:
print(f"{str(ex)}.")
print(f"Supported image boards are {', '.join(SUPPORTED_IMAGEBOARDS)}")
exit()
except ThreadNotFoundError:
except ThreadNotFoundError as e:
print(f"Thread {args['site']}/{args['board']}/{args['thread']} " \
"is no longer exist.")
f"not found. Reason: {e.reason}")
exit()
files_count = len(parser.files)
if not args["output-dir"] is None:
@ -101,23 +108,22 @@ def main() -> None:
save_dir = join(parser.imageboard, parser.board,
parser.thread)
print(f"There are {files_count} files in " \
f"{args['site']}/{args['board']}/{args['thread']}." \
f"They will be saved in {save_dir}.")
print(f"{files_count} files in " \
f"{args['site']}/{args['board']}/{args['thread']}. " \
f"They're going to {save_dir}. ", end="")
makedirs(save_dir, exist_ok=True)
if not args["no-op"]:
print("Writing OP... ", end='')
if parser.op is None:
print("No text's there.")
print("OP's empty.")
elif not exists(join(save_dir, "!op.txt")):
with open(join(save_dir, "!op.txt"), 'w', encoding='utf-8') as opf:
opf.write(f"{parser.op}\n")
print("Done.")
print("OP's written.")
else:
print("Exists.")
print("OP exists.")
scraper = ThreadedScraper(save_dir, parser.files, \

2
scrapthechan/fileinfo.py

@ -5,7 +5,7 @@ from dataclasses import dataclass
__all__ = ["FileInfo"]
@dataclass(frozen=True)
@dataclass(frozen=True, order=True)
class FileInfo:
"""Stores information about a file.

81
scrapthechan/parser.py

@ -4,7 +4,7 @@ from itertools import chain
from json import loads
from re import findall, match
from typing import List, Optional
from urllib.request import urlopen, Request
from urllib.request import urlopen, Request, HTTPError
from scrapthechan import USER_AGENT
from scrapthechan.fileinfo import FileInfo
@ -14,7 +14,12 @@ __all__ = ["Parser", "ThreadNotFoundError"]
class ThreadNotFoundError(Exception):
pass
def __init__(self, reason: str = ""):
self._reason = reason
@property
def reason(self) -> str:
return self._reason
class Parser:
@ -25,28 +30,42 @@ class Parser:
Arguments:
board -- is a name of a board on an image board;
thread -- is a name of a thread inside a board;
posts -- is a list of posts in form of dictionaries exported from a JSON;
thread -- is an id of a thread inside a board;
skip_posts -- number of posts to skip.
All the extracted files will be stored as the `FileInfo` objects."""
__url_thread_json: str = "https://example.org/{board}/{thread}.json"
__url_file_link: str = None
def __init__(self, board: str, thread: str, posts: List[dict],
def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None:
self._board = board
self._thread = thread
self._op_post = posts[0]
if not skip_posts is None:
posts = posts[skip_posts:]
self._board: str = board
self._thread: str = thread
self._posts = self._extract_posts_list(self._get_json())
self._op_post: dict = self._posts[0]
self._posts = self._posts[skip_posts:] if not skip_posts is None else self._posts
self._files = list(chain.from_iterable(filter(None, \
map(self._parse_post, posts))))
map(self._parse_post, self._posts))))
@property
def json_thread_url(self) -> str:
raise NotImplementedError
@property
def file_base_url(self) -> str:
raise NotImplementedError
@property
def subject_field(self) -> str:
return "sub"
@property
def comment_field(self) -> str:
return "com"
@property
def imageboard(self) -> str:
"""Returns image board's name."""
return NotImplementedError
raise NotImplementedError
@property
def board(self) -> str:
@ -62,22 +81,40 @@ class Parser:
def op(self) -> str:
"""Returns OP's post as combination of subject and comment separated
by a new line."""
raise NotImplementedError
op = ""
if self.subject_field in self._op_post:
op = f"{self._op_post[self.subject_field]}\n"
if self.comment_field in self._op_post:
op += self._op_post[self.comment_field]
return op if not op == "" else None
@property
def files(self) -> List[FileInfo]:
"""Returns a list of retrieved files as `FileInfo` objects."""
return self._files
def _get_json(self, thread_url: str) -> dict:
"""Gets JSON version of a thread and converts it in a dictionary."""
def _extract_posts_list(self, lst: List) -> List[dict]:
"""This method must be overridden in child classes where you specify
a path in a JSON document where posts are stored. E.g., on 4chan this is
['posts'], and on 2ch.hk it's ['threads'][0]['posts']."""
return lst
def _get_json(self) -> dict:
"""Retrieves a JSON representation of a thread and converts it in
a dictionary."""
try:
thread_url = self.json_thread_url.format(board=self._board, \
thread=self._thread)
req = Request(thread_url, headers={'User-Agent': USER_AGENT})
with urlopen(req) as url:
return loads(url.read().decode('utf-8'))
except:
raise ThreadNotFoundError
def _parse_post(self, post: dict) -> List[FileInfo]:
"""Parses a single post and extracts files into `FileInfo` object."""
except HTTPError as e:
raise ThreadNotFoundError(str(e))
except Exception as e:
raise e
def _parse_post(self, post: dict) -> Optional[List[FileInfo]]:
"""Parses a single post and extracts files into `FileInfo` object.
Single object is wrapped in a list for convenient insertion into
a list."""
raise NotImplementedError

24
scrapthechan/parsers/__init__.py

@ -1,6 +1,6 @@
"""Here are defined the JSON parsers for imageboards."""
from re import search
from typing import List
from typing import List, Optional
from scrapthechan.parser import Parser
@ -8,33 +8,31 @@ from scrapthechan.parser import Parser
__all__ = ["SUPPORTED_IMAGEBOARDS", "get_parser_by_url", "get_parser_by_site"]
URLRX = r"https?:\/\/(?P<s>[\w\.]+)\/(?P<b>\w+)\/(?:\w+)?\/(?P<t>\w+)"
SUPPORTED_IMAGEBOARDS: List[str] = ["4chan.org", "lainchan.org", "2ch.hk", \
"8kun.top", "lolifox.cc"]
"8kun.top"]
def get_parser_by_url(url: str) -> Parser:
def get_parser_by_url(url: str, skip_posts: Optional[int] = None) -> Parser:
"""Parses URL and extracts from it site name, board and thread.
And then returns initialised Parser object for detected imageboard."""
URLRX = r"https?:\/\/(?P<s>[\w\.]+)\/(?P<b>\w+)\/(?:\w+)?\/(?P<t>\w+)"
site, board, thread = search(URLRX, url).groups()
return get_parser_by_site(site, board, thread)
return get_parser_by_site(site, board, thread, skip_posts)
def get_parser_by_site(site: str, board: str, thread: str) -> Parser:
def get_parser_by_site(site: str, board: str, thread: str,
skip_posts: Optional[int] = None) -> Parser:
"""Returns an initialised parser for `site` with `board` and `thread`."""
if '4chan' in site:
from .fourchan import FourChanParser
return FourChanParser(board, thread)
return FourChanParser(board, thread, skip_posts)
elif 'lainchan' in site:
from .lainchan import LainchanParser
return LainchanParser(board, thread)
return LainchanParser(board, thread, skip_posts)
elif '2ch' in site:
from .dvach import DvachParser
return DvachParser(board, thread)
return DvachParser(board, thread, skip_posts)
elif '8kun' in site:
from .eightkun import EightKunParser
return EightKunParser(board, thread)
elif 'lolifox' in site:
from .lolifox import LolifoxParser
return LolifoxParser(board, thread)
return EightKunParser(board, thread, skip_posts)
else:
raise NotImplementedError(f"Parser for {site} is not implemented")

55
scrapthechan/parsers/dvach.py

@ -10,41 +10,54 @@ __all__ = ["DvachParser"]
class DvachParser(Parser):
"""JSON parser for 2ch.hk image board."""
__url_thread_json = "https://2ch.hk/{board}/res/{thread}.json"
__url_file_link = "https://2ch.hk"
def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None:
posts = self._get_json(self.__url_thread_json.format(board=board, \
thread=thread))['threads'][0]['posts']
super(DvachParser, self).__init__(board, thread, posts, skip_posts)
super().__init__(board, thread, skip_posts)
@property
def json_thread_url(self) -> str:
return "https://2ch.hk/{board}/res/{thread}.json"
@property
def file_base_url(self) -> str:
return "https://2ch.hk"
@property
def subject_field(self) -> str:
return "subject"
@property
def comment_field(self) -> str:
return "comment"
@property
def imageboard(self) -> str:
return "2ch.hk"
@property
def op(self) -> Optional[str]:
op = ""
if 'subject' in self._op_post:
op = f"{self._op_post['subject']}\n"
if 'comment' in self._op_post:
op += self._op_post['comment']
return op if not op == "" else None
def _extract_posts_list(self, lst: List) -> List[dict]:
return lst['threads'][0]['posts']
def _parse_post(self, post) -> Optional[List[FileInfo]]:
if not 'files' in post: return None
files = []
for f in post['files']:
if 'sticker' in f:
continue
if match(r"^image\.\w+$", f['fullname']) is None:
fullname = f['fullname']
if not 'sticker' in f:
if match(r"^image\.\w+$", f['fullname']) is None:
fullname = f['fullname']
else:
fullname = f['name']
else:
fullname = f['name']
# Here's same thing as 4chan. 2ch.hk also has md5 field, so it is
# completely fine to hardcode `hash_algo`.
files.append(FileInfo(fullname, f['size'],
f"{self.__url_file_link}{f['path']}",
f['md5'], 'md5'))
if 'md5' in f:
files.append(FileInfo(fullname, f['size'],
f"{self.file_base_url}{f['path']}",
f['md5'], 'md5'))
else:
files.append(FileInfo(fullname, f['size'],
f"{self.file_base_url}{f['path']}",
None, None))
return files

58
scrapthechan/parsers/eightkun.py

@ -1,63 +1,25 @@
from re import match
from typing import List, Optional
from typing import Optional
from scrapthechan.fileinfo import FileInfo
from scrapthechan.parser import Parser
from scrapthechan.parsers.tinyboardlike import TinyboardLikeParser
__all__ = ["EightKunParser"]
class EightKunParser(Parser):
class EightKunParser(TinyboardLikeParser):
"""JSON parser for 8kun.top image board."""
__url_thread_json = "https://8kun.top/{board}/res/{thread}.json"
__url_file_link = "https://media.8kun.top/file_store/{filename}"
def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None:
posts = self._get_json(self.__url_thread_json.format(board=board, \
thread=thread))['posts']
super(EightKunParser, self).__init__(board, thread, posts, skip_posts)
super().__init__(board, thread, skip_posts)
@property
def imageboard(self) -> str:
return "8kun.top"
@property
def op(self) -> Optional[str]:
op = ""
if 'sub' in self._op_post:
op = f"{self._op_post['sub']}\n"
if 'com' in self._op_post:
op += self._op_post['com']
return op if not op == "" else None
def _parse_post(self, post: dict) -> List[FileInfo]:
if not 'tim' in post: return None
dlfname = f"{post['tim']}{post['ext']}"
if "filename" in post:
if match(r"^image\.\w+$", post['filename']) is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
files = []
files.append(FileInfo(filename, post['fsize'],
self.__url_file_link.format(board=self.board, filename=dlfname),
post['md5'], 'md5'))
if "extra_files" in post:
for f in post["extra_files"]:
dlfname = f"{f['tim']}{f['ext']}"
if "filename" in post:
if match(r"^image\.\w+$", post['filename']) is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
dlurl = self.__url_file_link.format(board=self.board, \
filename=dlfname)
files.append(FileInfo(filename, f['fsize'], \
dlurl, f['md5'], 'md5'))
return files
def json_thread_url(self) -> str:
return "https://8kun.top/{board}/res/{thread}.json"
@property
def file_base_url(self) -> str:
return "https://media.8kun.top/file_dl/{filename}"

46
scrapthechan/parsers/fourchan.py

@ -1,51 +1,25 @@
from re import match
from typing import List, Optional
from typing import Optional
from scrapthechan.fileinfo import FileInfo
from scrapthechan.parser import Parser
from scrapthechan.parsers.tinyboardlike import TinyboardLikeParser
__all__ = ["FourChanParser"]
class FourChanParser(Parser):
class FourChanParser(TinyboardLikeParser):
"""JSON parser for 4chan.org image board."""
__url_thread_json = "https://a.4cdn.org/{board}/thread/{thread}.json"
__url_file_link = "https://i.4cdn.org/{board}/{filename}"
def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None:
posts = self._get_json(self.__url_thread_json.format(board=board, \
thread=thread))['posts']
super(FourChanParser, self).__init__(board, thread, posts, skip_posts)
super().__init__(board, thread, skip_posts)
@property
def imageboard(self) -> str:
return "4chan.org"
@property
def op(self) -> Optional[str]:
op = ""
if 'sub' in self._op_post:
op = f"{self._op_post['sub']}\n"
if 'com' in self._op_post:
op += self._op_post['com']
return op if not op == "" else None
def _parse_post(self, post: dict) -> List[FileInfo]:
if not 'tim' in post: return None
dlfname = f"{post['tim']}{post['ext']}"
if "filename" in post:
if match(r"^image\.\w+$", post['filename']) is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
# Hash algorithm is hardcoded since it is highly unlikely that it will
# be changed in foreseeable future. And if it'll change then this line
# will be necessarily updated anyway.
return [FileInfo(filename, post['fsize'],
self.__url_file_link.format(board=self.board, filename=dlfname),
post['md5'], 'md5')]
def json_thread_url(self) -> str:
return "https://a.4cdn.org/{board}/thread/{thread}.json"
@property
def file_base_url(self) -> str:
return "https://i.4cdn.org/{board}/{filename}"

63
scrapthechan/parsers/lainchan.py

@ -1,66 +1,25 @@
from re import match
from typing import List, Optional
from typing import Optional
from scrapthechan.parser import Parser
from scrapthechan.fileinfo import FileInfo
from scrapthechan.parsers.tinyboardlike import TinyboardLikeParser
__all__ = ["LainchanParser"]
class LainchanParser(Parser):
"""JSON parser for lainchan.org image board.
JSON structure is identical to 4chan.org's, so this parser is just inherited
from 4chan.org's parser and only needed things are redefined.
"""
__url_thread_json = "https://lainchan.org/{board}/res/{thread}.json"
__url_file_link = "https://lainchan.org/{board}/src/{filename}"
class LainchanParser(TinyboardLikeParser):
"""JSON parser for lainchan.org image board."""
def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None:
posts = self._get_json(self.__url_thread_json.format(board=board, \
thread=thread))['posts']
super(LainchanParser, self).__init__(board, thread, posts, skip_posts)
super().__init__(board, thread, skip_posts)
@property
def imageboard(self) -> str:
return "lainchan.org"
@property
def op(self) -> Optional[str]:
op = ""
if 'sub' in self._op_post:
op = f"{self._op_post['sub']}\n"
if 'com' in self._op_post:
op += self._op_post['com']
return op if not op == "" else None
def _parse_post(self, post) -> List[FileInfo]:
if not 'tim' in post: return None
dlfname = f"{post['tim']}{post['ext']}"
if "filename" in post:
if match(r"^image\.\w+$", post['filename']) is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
files = []
files.append(FileInfo(filename, post['fsize'],
self.__url_file_link.format(board=self.board, filename=dlfname),
post['md5'], 'md5'))
@property
def json_thread_url(self) -> str:
return "https://lainchan.org/{board}/res/{thread}.json"
if "extra_files" in post:
for f in post["extra_files"]:
dlfname = f"{f['tim']}{f['ext']}"
if "filename" in post:
if match(r"^image\.\w+$", post['filename']) is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
dlurl = self.__url_file_link.format(board=self.board, \
filename=dlfname)
files.append(FileInfo(filename, f['fsize'], \
dlurl, f['md5'], 'md5'))
return files
@property
def file_base_url(self) -> str:
return "https://lainchan.org/{board}/src/{filename}"

38
scrapthechan/parsers/lolifox.py → scrapthechan/parsers/tinyboardlike.py

@ -4,37 +4,21 @@ from typing import List, Optional
from scrapthechan.parser import Parser
from scrapthechan.fileinfo import FileInfo
__all__ = ["LolifoxParser"]
__all__ = ["TinyboardLikeParser"]
class LolifoxParser(Parser):
"""JSON parser for lolifox.cc image board.
JSON structure is identical to lainchan.org.
"""
__url_thread_json = "https://lolifox.cc/{board}/res/{thread}.json"
__url_file_link = "https://lolifox.cc/{board}/src/{filename}"
class TinyboardLikeParser(Parser):
"""Base parser for imageboards that are based on Tinyboard, or have similar
JSON API."""
def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None:
posts = self._get_json(self.__url_thread_json.format(board=board, \
thread=thread))['posts']
super(LolifoxParser, self).__init__(board, thread, posts, skip_posts)
@property
def imageboard(self) -> str:
return "lolifox.cc"
super().__init__(board, thread, skip_posts)
@property
def op(self) -> Optional[str]:
op = ""
if 'sub' in self._op_post:
op = f"{self._op_post['sub']}\n"
if 'com' in self._op_post:
op += self._op_post['com']
return op if not op == "" else None
def _extract_posts_list(self, lst: List) -> List[dict]:
return lst['posts']
def _parse_post(self, post) -> List[FileInfo]:
def _parse_post(self, post: dict) -> Optional[List[FileInfo]]:
if not 'tim' in post: return None
dlfname = f"{post['tim']}{post['ext']}"
@ -46,8 +30,9 @@ class LolifoxParser(Parser):
filename = f"{post['filename']}{post['ext']}"
files = []
files.append(FileInfo(filename, post['fsize'],
self.__url_file_link.format(board=self.board, filename=dlfname),
self.file_base_url.format(board=self.board, filename=dlfname),
post['md5'], 'md5'))
if "extra_files" in post:
@ -58,8 +43,9 @@ class LolifoxParser(Parser):
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
dlurl = self.__url_file_link.format(board=self.board, \
dlurl = self.file_base_url.format(board=self.board, \
filename=dlfname)
files.append(FileInfo(filename, f['fsize'], \
dlurl, f['md5'], 'md5'))
return files

31
scrapthechan/scraper.py

@ -5,8 +5,9 @@ from os import remove, stat
from os.path import exists, join, getsize
import re
from typing import List, Callable
from urllib.request import urlretrieve, URLopener, HTTPError
from urllib.request import urlretrieve, URLopener, HTTPError, URLError
import hashlib
from http.client import HTTPException
from scrapthechan import USER_AGENT
from scrapthechan.fileinfo import FileInfo
@ -66,6 +67,8 @@ class Scraper:
def _hash_file(self, filepath: str, hash_algorithm: str = "md5",
blocksize: int = 1048576) -> (str, str):
"""Compute hash of a file."""
if hash_algorithm is None:
return None
hash_func = hashlib.new(hash_algorithm)
with open(filepath, 'rb') as f:
buf = f.read(blocksize)
@ -82,8 +85,10 @@ class Scraper:
if not (f.size == computed_size \
or f.size == round(computed_size / 1024)):
return False
hexdig, dig = self._hash_file(filepath, f.hash_algorithm)
return f.hash_value == hexdig or f.hash_value == dig
if not f.hash_algorithm is None:
hexdig, dig = self._hash_file(filepath, f.hash_algorithm)
return f.hash_value == hexdig or f.hash_value == dig
return True
def _download_file(self, f: FileInfo):
"""Download a single file."""
@ -101,20 +106,32 @@ class Scraper:
while retries > 0:
self._url_opener.retrieve(f.download_url, filepath)
if not self._check_file(f, filepath):
print(filepath, f.size, f.hash_value)
remove(filepath)
retries -= 1
else:
break
if retries == 0:
print(f"Cannot retrieve {f.download_url}, {filepath}.")
return
if is_same_filename:
f1_hexdig, f1_dig = self._hash_file(orig_filepath, f.hash_algorithm)
f2_hexdig, f2_dig = self._hash_file(filepath, f.hash_algorithm)
if f1_hexdig == f2_hexdig or f1_dig == f2_dig:
_, f1_dig = self._hash_file(orig_filepath, f.hash_algorithm)
_, f2_dig = self._hash_file(filepath, f.hash_algorithm)
if f1_dig == f2_dig:
remove(filepath)
except FileNotFoundError as e:
print("File Not Found", filepath)
except HTTPError as e:
print("HTTP Error", e.code, e.reason, f.download_url)
if exists(filepath):
remove(filepath)
except HTTPException:
print("HTTP Exception for", f.download_url)
if exists(filepath):
remove(filepath)
except URLError as e:
print("URL Error for", f.download_url)
if exists(filepath):
remove(filepath)
except ConnectionResetError:
print("Connection reset for", f.download_url)
if exists(filepath):

39
scrapthechan/scrapers/threadedscraper.py

@ -7,25 +7,26 @@ from multiprocessing.pool import ThreadPool
from scrapthechan.scraper import Scraper
from scrapthechan.fileinfo import FileInfo
__all__ = ["ThreadedScraper"]
class ThreadedScraper(Scraper):
def __init__(self, save_directory: str, files: List[FileInfo],
download_progress_callback: Callable[[int], None] = None) -> None:
super(ThreadedScraper, self).__init__(save_directory, files,
download_progress_callback)
self._files_downloaded = 0
self._files_downloaded_mutex = Lock()
def run(self):
pool = ThreadPool(cpu_count() * 2)
pool.map(self._thread_run, self._files)
pool.close()
pool.join()
def _thread_run(self, f: FileInfo):
with self._files_downloaded_mutex:
self._files_downloaded += 1
if not self._progress_callback is None:
self._progress_callback(self._files_downloaded)
self._download_file(f)
def __init__(self, save_directory: str, files: List[FileInfo],
download_progress_callback: Callable[[int], None] = None) -> None:
super().__init__(save_directory, files, download_progress_callback)
self._files_downloaded = 0
self._files_downloaded_mutex = Lock()
def run(self):
pool = ThreadPool(cpu_count() * 2)
pool.map(self._thread_run, self._files)
pool.close()
pool.join()
def _thread_run(self, f: FileInfo):
if not self._progress_callback is None:
with self._files_downloaded_mutex:
self._files_downloaded += 1
self._progress_callback(self._files_downloaded)
self._download_file(f)

12
setup.cfg

@ -1,7 +1,7 @@
[metadata]
name = scrapthechan
version = attr: scrapthechan.__version__
description = Scrap the files posted in a thread on an imageboard.
description = Scrap the files from the imageboards.
long_description = file: README.md
long_description_content_type = text/markdown
author = Alexander "Arav" Andreev
@ -14,18 +14,20 @@ keywords =
2ch.hk
lainchan.org
8kun.top
lolifox.cc
license = MIT
license_file = COPYING
classifiers =
Development Status :: 2 - Pre-Alpha
Development Status :: 3 - Alpha
Environment :: Console
Intended Audience :: End Users/Desktop
License :: Other/Proprietary License
License :: OSI Approved :: MIT License
Natural Language :: English
Operating System :: OS Independent
Programming Language :: Python :: 3.7
Programming Language :: Python :: 3.8
Topic :: Communications :: BBS
Topic :: Internet :: WWW/HTTP
Topic :: Internet :: WWW/HTTP :: Dynamic Content :: Message Boards
Topic :: Text Processing
Topic :: Utilities
[options]

Loading…
Cancel
Save