From 63a549878e19e0d02bcb643c75977b086d6f3850 Mon Sep 17 00:00:00 2001 From: Oleg Valter Date: Sat, 24 Aug 2024 03:15:45 +0300 Subject: [PATCH 01/12] documented extractor's CLI options --- README.md | 112 ++++++++++++++++++++++++++++++------------------------ 1 file changed, 63 insertions(+), 49 deletions(-) diff --git a/README.md b/README.md index 930810a..8aaa437 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ This section contains background on why this project exists. If you know and/or don't care, feel free to skip to the next section. -In June 2023, Stack Exchange [briefly cancelled the data dump](https://meta.stackexchange.com/q/389922/332043), and backpedalled after a significant amount of backlash from the community. The status-quo of uploads to archive.org was restored. In the slightly more than a year between June 2023 and July 2024, it looked like they were staying off that path. Notably, they made a [logical shift to the _exact_ dates involved in the upload](https://meta.stackexchange.com/q/398279/) to deal with archive.org being slow. In December 2023, they [announced a delay in the upload](https://meta.stackexchange.com/q/395197/) likely to avoid speculation that another cancellation was happening. +In June 2023, Stack Exchange [briefly cancelled the data dump](https://meta.stackexchange.com/q/389922/332043), and backpedalled after a significant amount of backlash from the community. The status-quo of uploads to archive.org was restored. In the slightly more than a year between June 2023 and July 2024, it looked like they were staying off that path. Notably, they made a [logical shift to the _exact_ dates involved in the upload](https://meta.stackexchange.com/q/398279/) to deal with archive.org being slow. In December 2023, they [announced a delay in the upload](https://meta.stackexchange.com/q/395197/) likely to avoid speculation that another cancellation was happening. We appeared to be out of the woods. But this repo wouldn't exist if that was the case now, would it? @@ -20,20 +20,20 @@ The current revision can be read [here](https://meta.stackexchange.com/q/401324/ Here's what's happening: -* SE is moving the data dump from archive.org to their own infrastructure -* They're discontinuing the archive.org dump, which makes it significantly harder to archive the data if SE, for example, were to go out of business - * In addition to discontinuing the archive.org dump, they're imposing significant restrictions on the data dump -* They're doing the first revision **without the possibility to download the entire data dump** in one click, a drastic QOL reduction from the current situation. +- SE is moving the data dump from archive.org to their own infrastructure +- They're discontinuing the archive.org dump, which makes it significantly harder to archive the data if SE, for example, were to go out of business + - In addition to discontinuing the archive.org dump, they're imposing significant restrictions on the data dump +- They're doing the first revision **without the possibility to download the entire data dump** in one click, a drastic QOL reduction from the current situation. This is an opinionated summary of the reason why; SE wants to capitalise on AI companies that need training data, and have decided that the community doesn't matter in that process. The current process, while not nearly as restricting as rev. 1 and 2, is a symptom of precisely one thing; Stack Exchange doesn't care about its users, but rather cares about finding new ways to profit off user data. **Stack Exchange, Inc. is now the single biggest threat to the community**, and to the platform's user-generated and [permissively-licensed content](https://stackoverflow.com/help/licensing) that the community has spent countless hours creating precisely _because_ the data is public. -That is why this project exists; this is meant to automate the data dump download process for non-commercial license-compliant use, since Stack Exchange, Inc. couldn't be bothered adding a "download all" button from day 1. +That is why this project exists; this is meant to automate the data dump download process for non-commercial license-compliant use, since Stack Exchange, Inc. couldn't be bothered adding a "download all" button from day 1. -As an added bonus, since this project already exists, there's an accompanying system to automatically convert the data dump to other formats. In my experience, the vast majority of applications building on the data dump do not work directly with the XML. Other, more convenient data formats are often created as an intermediate. Aside using it as an intermediate for various forms of analysis, there are a couple major examples of other distribution forms that are listed later in this README. +As an added bonus, since this project already exists, there's an accompanying system to automatically convert the data dump to other formats. In my experience, the vast majority of applications building on the data dump do not work directly with the XML. Other, more convenient data formats are often created as an intermediate. Aside using it as an intermediate for various forms of analysis, there are a couple major examples of other distribution forms that are listed later in this README. -While these are preprocessed distributions of the data dump, this project is also meant to help converting to these various formats. While unlikely to replace the source code for either of these two examples, I hope the transformer system here can get rid of boilerplate for other projects. +While these are preprocessed distributions of the data dump, this project is also meant to help converting to these various formats. While unlikely to replace the source code for either of these two examples, I hope the transformer system here can get rid of boilerplate for other projects. ## Known archives of new data dumps @@ -45,20 +45,18 @@ A [different project](https://communitydatadump.com/index.html) is currently mai This list contains converter tools that work on all sites and all tables. -| Maintainer | Format(s) | First-party torrent available | Converter | -| --- | --- | --- | --- | -| Maxwell175 | SQLite, Postgres, MSSQL | Partially[^2] | [AGPL-3.0](https://github.com/Maxwell175/StackExchangeDumpConverter) | - +| Maintainer | Format(s) | First-party torrent available | Converter | +| ---------- | ----------------------- | ----------------------------- | -------------------------------------------------------------------- | +| Maxwell175 | SQLite, Postgres, MSSQL | Partially[^2] | [AGPL-3.0](https://github.com/Maxwell175/StackExchangeDumpConverter) | ### Other data dump distributions and conversion tools For completeness (well, sort of, none of these lists are exhaustive), this is a list of incomplete archives (archives that limit the number of included tables and/or sites) -| Maintainer | Format | Torrent available | Converter | Site(s) | Tables | -| --- | --- | --- | --- | --- | --- | -| Brent Ozar | [MSSQL](https://www.brentozar.com/archive/2015/10/how-to-download-the-stack-overflow-database-via-bittorrent/) | Yes | [MIT-licensed](https://github.com/BrentOzarULTD/soddi) | Stack Overflow only | All tables | -| Jason Punyon | [SQLite](https://seqlite.puny.engineering/) | No | Closed-source[^1] | All sites | Posts only | - +| Maintainer | Format | Torrent available | Converter | Site(s) | Tables | +| ------------ | -------------------------------------------------------------------------------------------------------------- | ----------------- | ------------------------------------------------------ | ------------------- | ---------- | +| Brent Ozar | [MSSQL](https://www.brentozar.com/archive/2015/10/how-to-download-the-stack-overflow-database-via-bittorrent/) | Yes | [MIT-licensed](https://github.com/BrentOzarULTD/soddi) | Stack Overflow only | All tables | +| Jason Punyon | [SQLite](https://seqlite.puny.engineering/) | No | Closed-source[^1] | All sites | Posts only | ## Using the downloader @@ -66,18 +64,18 @@ Note that it's stongly encouraged that you use a venv. To set one up, run `pytho ### Requirements -* Python 3.10 or newer[^3] -* `pip3 install -r requirements.txt` -* Lots of storage. The 2024Q1 data dump was 92GB compressed. -* A display you can access somehow (physical or virtual, but you need to be able to see it) to be able to solve captchas -* Email and password login for Stack Exchange - Google, Facebook, GitHub, and other login methods are not supported, and will not be supported. - * If you don't have this, see [this meta question](https://meta.stackexchange.com/a/1847/332043) for instructions. -* Firefox installed - * Snap and flatpak users may run into problems; it's strongly recommended to have a non-snap/flatpak installation of Firefox and Geckodriver. - * Known errors: - * "The geckodriver version may not be compatible with the detected firefox version" - update Firefox and Geckodriver. If this still doesn't work, consider switching to a non-snap installation of Firefox and Geckodriver. - * "Your Firefox profile cannot be loaded" - One of Geckodriver or Firefox is Snap-based, while the other is not. [Consider switching to a non-snap installation](https://stackoverflow.com/a/72531719/6296561) of Firefox, or verifying that your PATH is set correctly. - * If you need to manaully install Geckodriver (which shouldn't normally be necessary; it's often bundled with Firefox in one way or another), the binaries are on [GitHub](https://github.com/mozilla/geckodriver/releases) +- Python 3.10 or newer[^3] +- `pip3 install -r requirements.txt` +- Lots of storage. The 2024Q1 data dump was 92GB compressed. +- A display you can access somehow (physical or virtual, but you need to be able to see it) to be able to solve captchas +- Email and password login for Stack Exchange - Google, Facebook, GitHub, and other login methods are not supported, and will not be supported. + - If you don't have this, see [this meta question](https://meta.stackexchange.com/a/1847/332043) for instructions. +- Firefox installed + - Snap and flatpak users may run into problems; it's strongly recommended to have a non-snap/flatpak installation of Firefox and Geckodriver. + - Known errors: + - "The geckodriver version may not be compatible with the detected firefox version" - update Firefox and Geckodriver. If this still doesn't work, consider switching to a non-snap installation of Firefox and Geckodriver. + - "Your Firefox profile cannot be loaded" - One of Geckodriver or Firefox is Snap-based, while the other is not. [Consider switching to a non-snap installation](https://stackoverflow.com/a/72531719/6296561) of Firefox, or verifying that your PATH is set correctly. + - If you need to manaully install Geckodriver (which shouldn't normally be necessary; it's often bundled with Firefox in one way or another), the binaries are on [GitHub](https://github.com/mozilla/geckodriver/releases) The downloader does **not** support Docker due to the display requirement. @@ -88,11 +86,20 @@ The downloader does **not** support Docker due to the display requirement. 1. Make sure you have all the requirements from the Requirements section. 2. Copy `config.example.json` to `config.json` 3. Open `config.json`, and edit in the values. The values are described within the JSON file itself. -4. Run the extractor with `python3 -m sedd`. If you're on Windows, you may need to run `python -m sedd` instead. +4. Run the extractor with `python3 -m sedd`. If you're on Windows, you may need to run `python -m sedd` instead. + +#### CLI options + +Exractor CLI supports the following configuration options: + +| Short | Long | Type | Default | Description | +| ----- | -------------------- | -------- | ----------------- | -------------------------------------------------------------------------------------- | +| `-o` | `--outputDir ` | Optional | `/downloads` | Specifies the directory to download the archives to. | +| - | `--dry-run` | Optional | - | Whether to actually download the archives. If set, only traverses the network's sites. | #### Captchas and other misc. barriers -This software is designed around Selenium, a browser automation tool. This does, however, mean that the program can be stopped by various bot defenses. This would happen even if you downloaded all the [~183 data dumps](https://stackexchange.com/sites#questionsperday) fully by hand, because it's a _lot_ of repeated operations. +This software is designed around Selenium, a browser automation tool. This does, however, mean that the program can be stopped by various bot defenses. This would happen even if you downloaded all the [~183 data dumps](https://stackexchange.com/sites#questionsperday) fully by hand, because it's a _lot_ of repeated operations. This is where notification systems come in; expecting you to sit and watch for potentially a significant number of hours is not a good use of time. If anything happens, you'll be notified, so you don't have to continuously watch the program. Currently, only a native desktop notifier is supported, but support for other notifiers may be added in the future. @@ -100,17 +107,17 @@ This is where notification systems come in; expecting you to sit and watch for p As of Q1 2024, the data dump was a casual 93GB in compressed size. If you have your own system to transform the data dump after downloading, you only need to worry about the raw size of the data dump. -However, if you use the built-in transformer pipeline, you'll need to expect a _lot_ more data use. +However, if you use the built-in transformer pipeline, you'll need to expect a _lot_ more data use. The output, by default, is compressed back into 7z if dealing with a file-based transformer. Due to this, an intermediate file write is performed prior to compressing back into a .7z. At runtime, you need: -* The compressed data dump; at least 92GB and increasing with each dump -* The compressed converted data dump; depending on compression rates for the specific format, this anywhere from a little less than the original size to significantly larger -* A significant amount of space for intermediate files. While these will be deleted as soon as they're done and compressed, they'll take up a significant amount of space on the disk in the meanwhile +- The compressed data dump; at least 92GB and increasing with each dump +- The compressed converted data dump; depending on compression rates for the specific format, this anywhere from a little less than the original size to significantly larger +- A significant amount of space for intermediate files. While these will be deleted as soon as they're done and compressed, they'll take up a significant amount of space on the disk in the meanwhile Note that the transformer pipeline is executed separately; see the transformer section below. -#### Execution time +#### Execution time One of the major downsides with the way this project functions is that it's subject to Cloudflare bullshit. This means that the total time to download is `(combined size of data dumps) / (internet speed) + (rate limiting) + (navigation overhead) + (time to solve captchas)`. While navigation overhead and rate limiting (hopefully) doesn't account for a significant share of time, it can potentially be significant. It's certainly a slower option than archive.org's torrent. @@ -122,7 +129,8 @@ Once you've downloaded the data dumps, you may want to transform it into a more This section assumes you have Docker installed, with [docker-compose-v2](https://docs.docker.com/compose/migrate/). -From the root directory, run +From the root directory, run + ```bash docker compose up ``` @@ -130,10 +138,12 @@ docker compose up This automatically binds `downloads` and `out` in the current working directory to the docker container. If you want to change these paths, you'll need to edit `docker-compose.yml` manually for now. Additionally, the following environment variables are defined and forwarded to the build: -* `SEDD_OUTPUT_TYPE`: Any output type supported by the program. These are: `json`, `sqlite`. -* `SPDLOG_LEVEL`: Sets the logging level. Usually not necessary unless you want verbose output, or you're trying to debug something. + +- `SEDD_OUTPUT_TYPE`: Any output type supported by the program. These are: `json`, `sqlite`. +- `SPDLOG_LEVEL`: Sets the logging level. Usually not necessary unless you want verbose output, or you're trying to debug something. If you have a UNIX shell (i.e. not cmd or powershell; Windows users can use Git Bash), you can run + ```bash SEDD_OUTPUT_TYPE=sqlite docker compose up ``` @@ -145,20 +155,23 @@ If you insist on using cmd or PowerShell instead of a good shell, setting the va ### Native #### Requirements -* C++20 compiler -* CMake 3.10 or newer -* Linux-specific (TEMPORARY): `libtbb-dev`, or equivalent on your favourite distro. Optional, but required for multithreaded support under libstdc++ + +- C++20 compiler +- CMake 3.10 or newer +- Linux-specific (TEMPORARY): `libtbb-dev`, or equivalent on your favourite distro. Optional, but required for multithreaded support under libstdc++ Other dependencies (stc, libarchive, spdlog, and pugixml) are automatically handled by CMake using FetchContent. Unlike the downloader, this component can run without a display. #### Running + TL;DR: + ```bash cd transformer mkdir build cd build # Option 1: debug: -cmake .. -DCMAKE_BUILD_TYPE=Debug +cmake .. -DCMAKE_BUILD_TYPE=Debug # Option 2: release mode; strongly recommended for anything that needs the performance: cmake .. -DCMAKE_BUILD_TYPE=Release # --- @@ -166,7 +179,7 @@ cmake .. -DCMAKE_BUILD_TYPE=Release cmake --build . -j 8 # Note: this only works after running the Python downloader -# For early testing, I've been populating this folder with +# For early testing, I've been populating this folder with # files from the old archive.org data dump. # The last argument is the path to the downloaded data # *UNIX: @@ -180,9 +193,10 @@ Pass `--help` to see the available formatters for your current version of the da ### Supported transformers Currently, the following transformers are supported: -* `json` -* `sqlite` - * Note: All data related to a site is merged into a single database + +- `json` +- `sqlite` + - Note: All data related to a site is merged into a single database ## Language rationale @@ -192,7 +206,7 @@ C++ does not really support Selenium, which is effectively a requirement for the Python, on the other hand, infuriatingly doesn't support 7z streaming, at least not in a convenient format. There's the `libarchive` package, but it refuses to build. `python-libarchive` allegedly does, but [Windows support is flaky](https://github.com/smartfile/python-libarchive/issues/38), so the transformer might've had to be separated from the downloader anyway. There's py7zr, which does work everywhere, but it [doesn't support 7z streaming](https://github.com/miurahr/py7zr/issues/579). -7z and XML streaming are both _critical_ for the processing pipeline. If you plan to convert the entire data dump, you'll eventually run into `stackoverflow.com-PostHistory.7z`, which is 39GB compressed, and **181GB uncompressed** in the 2024 Q1 data dump. As time passes, this will likely continue to grow, and the absurd amounts of RAM required to just tank the full size [is barely supported on modern and _very_ high-end hardware](https://www.reddit.com/r/buildapc/comments/17hqk3k/what_happened_to_256gb_ram_capacity_motherboards/). Finding someone able to tank that is going to be difficult for the vast majority of people. +7z and XML streaming are both _critical_ for the processing pipeline. If you plan to convert the entire data dump, you'll eventually run into `stackoverflow.com-PostHistory.7z`, which is 39GB compressed, and **181GB uncompressed** in the 2024 Q1 data dump. As time passes, this will likely continue to grow, and the absurd amounts of RAM required to just tank the full size [is barely supported on modern and _very_ high-end hardware](https://www.reddit.com/r/buildapc/comments/17hqk3k/what_happened_to_256gb_ram_capacity_motherboards/). Finding someone able to tank that is going to be difficult for the vast majority of people. Consequently, direct `libarchive` support is beneficial, and rather than writing an entire new python wrapper (or taking over an existing one), it's easier to just write that part in C++. Also, since it might be easier to run this particular part in a Docker container to avoid downloading build tools on certain systems, having it be fully headless is an advantage. @@ -200,7 +214,7 @@ On the bright side, this should mean faster processing compared to Python. ## License -The code is under the MIT license; see the `LICENSE` file. +The code is under the MIT license; see the `LICENSE` file. The data downloaded and produced is under various versions of [CC-By-SA](https://stackoverflow.com/help/licensing), as per Stack Exchange's licensing rules, in addition to whatever extra rules they try to impose on the data dump. From d5a07118aa28196c3898aec0caa32fdd657cf5f7 Mon Sep 17 00:00:00 2001 From: Oleg Valter Date: Sat, 24 Aug 2024 03:24:49 +0300 Subject: [PATCH 02/12] aded --skip-loadded extractor CLI option & documented it --- README.md | 9 +++--- sedd/main.py | 84 +++++++++++++++++++++++++++++++++++++++------------- 2 files changed, 69 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index 8aaa437..5d46602 100644 --- a/README.md +++ b/README.md @@ -92,10 +92,11 @@ The downloader does **not** support Docker due to the display requirement. Exractor CLI supports the following configuration options: -| Short | Long | Type | Default | Description | -| ----- | -------------------- | -------- | ----------------- | -------------------------------------------------------------------------------------- | -| `-o` | `--outputDir ` | Optional | `/downloads` | Specifies the directory to download the archives to. | -| - | `--dry-run` | Optional | - | Whether to actually download the archives. If set, only traverses the network's sites. | +| Short | Long | Type | Default | Description | +| ----- | ---------------------- | -------- | ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `-o` | `--outputDir ` | Optional | `/downloads` | Specifies the directory to download the archives to. | +| `-s` | `--skip-loaded ` | Optional | - | Whether to skip over archives that have already been downloaded. An archive is considered to be downloaded if the output directory has one already & the file is not empty. | +| - | `--dry-run` | Optional | - | Whether to actually download the archives. If set, only traverses the network's sites. | #### Captchas and other misc. barriers diff --git a/sedd/main.py b/sedd/main.py index 16050db..ac33e3f 100644 --- a/sedd/main.py +++ b/sedd/main.py @@ -21,7 +21,13 @@ prog="sedd", description="Automatic (unofficial) SE data dump downloader for the anti-community data dump format", ) - +parser.add_argument( + "-s", "--skip-loaded", + required=False, + default=False, + action="store_true", + dest="skip_loaded" +) parser.add_argument( "-o", "--outputDir", required=False, @@ -38,6 +44,7 @@ args = parser.parse_args() + def get_download_dir(): download_dir = args.output_dir @@ -48,15 +55,17 @@ def get_download_dir(): return download_dir + options = Options() options.enable_downloads = True options.set_preference("browser.download.folderList", 2) options.set_preference("browser.download.manager.showWhenStarting", False) options.set_preference("browser.download.dir", get_download_dir()) -options.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-gzip") +options.set_preference( + "browser.helperApps.neverAsk.saveToDisk", "application/x-gzip") browser = webdriver.Firefox( - options = options + options=options ) if not os.path.exists("ubo.xpi"): print("Downloading uBO") @@ -74,11 +83,14 @@ def get_download_dir(): email = config["email"] password = config["password"] + def kill_cookie_shit(browser: WebDriver): sleep(3) - browser.execute_script("""let elem = document.getElementById("onetrust-banner-sdk"); if (elem) { elem.parentNode.removeChild(elem); }""") + browser.execute_script( + """let elem = document.getElementById("onetrust-banner-sdk"); if (elem) { elem.parentNode.removeChild(elem); }""") sleep(1) + def is_logged_in(browser: WebDriver, site: str): url = f"{site}/users/current" browser.get(url) @@ -86,6 +98,7 @@ def is_logged_in(browser: WebDriver, site: str): return "/users/" in browser.current_url + def login_or_create(browser: WebDriver, site: str): if is_logged_in(browser, site): print("Already logged in") @@ -125,7 +138,22 @@ def login_or_create(browser: WebDriver, site: str): break -def download_data_dump(browser: WebDriver, site: str, etags: Dict[str, str]): +def is_file_downloaded(site_or_url: str): + file_name = f"{re.sub(r'https://', '', site_or_url)}.7z" + + file_name = re.sub(r'^alcohol', 'beer', file_name) + file_name = re.sub(r'^mattermodeling', 'materials', file_name) + file_name = re.sub(r'^communitybuilding', 'moderators', file_name) + file_name = re.sub(r'^medicalsciences', 'health', file_name) + file_name = re.sub(r'^psychology', 'cogsci', file_name) + file_name = re.sub(r'^writing', 'writers', file_name) + file_name = re.sub(r'^video', 'avp', file_name) + file_name = re.sub(r'^meta\.(es|ja|pt|ru)\.', r'\1.meta.', file_name) + + return os.path.isfile(os.path.join(args.output_dir, file_name)) + + +def download_data_dump(browser: WebDriver, site: str, meta_url: str, etags: Dict[str, str]): print(f"Downloading data dump from {site}") def _exec_download(browser: WebDriver): @@ -168,30 +196,46 @@ def _exec_download(browser: WebDriver): url = browser.execute_script("return window.extractedUrl;") utils.extract_etag(url, etags) - sleep(5); + sleep(5) + main_loaded = is_file_downloaded(site) + meta_loaded = is_file_downloaded(meta_url) - browser.get(f"{site}/users/data-dump-access/current") - _exec_download(browser) + if not args.skip_loaded or not main_loaded or not meta_loaded: + if args.skip_loaded and main_loaded: + print(f"Already downloaded main for site {site}") + else: + browser.get(f"{site}/users/data-dump-access/current") + _exec_download(browser) + + if args.skip_loaded and meta_loaded: + print(f"Already downloaded meta for site {site}") + else: + print(meta_url) + browser.get(f"{meta_url}/users/data-dump-access/current") + _exec_download(browser) - if site not in ["https://meta.stackexchange.com", "https://stackapps.com"]: - # https://regex101.com/r/kG6nTN/1 - meta_url = re.sub(r"(https://(?:[^.]+\.(?=stackexchange))?)", r"\1meta.", site) - print(meta_url) - browser.get(f"{meta_url}/users/data-dump-access/current") - _exec_download(browser) etags: Dict[str, str] = {} for site in sites.sites: print(f"Extracting from {site}...") - login_or_create(browser, site) - download_data_dump( - browser, - site, - etags - ) + if site not in ["https://meta.stackexchange.com", "https://stackapps.com"]: + # https://regex101.com/r/kG6nTN/1 + meta_url = re.sub( + r"(https://(?:[^.]+\.(?=stackexchange))?)", r"\1meta.", site) + + if args.skip_loaded and is_file_downloaded(site) and is_file_downloaded(meta_url): + print(f"Already downloaded main & meta for site {site}") + else: + login_or_create(browser, site) + download_data_dump( + browser, + site, + meta_url, + etags + ) # TODO: replace with validation once downloading is verified done # (or export for separate, later verification) From 2975a3637c67ebdaca5368b54201fff94b8606f7 Mon Sep 17 00:00:00 2001 From: Oleg Valter Date: Sat, 24 Aug 2024 03:28:13 +0300 Subject: [PATCH 03/12] don't leave Firefox dangling after successful run --- sedd/main.py | 2 ++ 1 file changed, 2 insertions(+) diff --git a/sedd/main.py b/sedd/main.py index ac33e3f..ad6694b 100644 --- a/sedd/main.py +++ b/sedd/main.py @@ -241,3 +241,5 @@ def _exec_download(browser: WebDriver): # (or export for separate, later verification) # Though keeping it here, removing files and re-running downloads feels like a better idea print(etags) + +browser.quit() \ No newline at end of file From b2a3fde0630f83bd796ffa74fd42b843756f79e8 Mon Sep 17 00:00:00 2001 From: Oleg Valter Date: Sat, 24 Aug 2024 03:50:25 +0300 Subject: [PATCH 04/12] always quit the browser on error --- sedd/main.py | 37 +++++++++++++++++++------------------ 1 file changed, 19 insertions(+), 18 deletions(-) diff --git a/sedd/main.py b/sedd/main.py index ad6694b..9332c0e 100644 --- a/sedd/main.py +++ b/sedd/main.py @@ -218,28 +218,29 @@ def _exec_download(browser: WebDriver): etags: Dict[str, str] = {} -for site in sites.sites: - print(f"Extracting from {site}...") +try: + for site in sites.sites: + print(f"Extracting from {site}...") - if site not in ["https://meta.stackexchange.com", "https://stackapps.com"]: - # https://regex101.com/r/kG6nTN/1 - meta_url = re.sub( - r"(https://(?:[^.]+\.(?=stackexchange))?)", r"\1meta.", site) + if site not in ["https://meta.stackexchange.com", "https://stackapps.com"]: + # https://regex101.com/r/kG6nTN/1 + meta_url = re.sub( + r"(https://(?:[^.]+\.(?=stackexchange))?)", r"\1meta.", site) - if args.skip_loaded and is_file_downloaded(site) and is_file_downloaded(meta_url): - print(f"Already downloaded main & meta for site {site}") - else: - login_or_create(browser, site) - download_data_dump( - browser, - site, - meta_url, - etags - ) + if args.skip_loaded and is_file_downloaded(site) and is_file_downloaded(meta_url): + print(f"Already downloaded main & meta for site {site}") + else: + login_or_create(browser, site) + download_data_dump( + browser, + site, + meta_url, + etags + ) +finally: + browser.quit() # TODO: replace with validation once downloading is verified done # (or export for separate, later verification) # Though keeping it here, removing files and re-running downloads feels like a better idea print(etags) - -browser.quit() \ No newline at end of file From cdf517f8fe9eea628037aa10751ea24530699fd6 Mon Sep 17 00:00:00 2001 From: Oleg Valter Date: Sat, 24 Aug 2024 04:00:48 +0300 Subject: [PATCH 05/12] added file size check when deciding whether to redownload --- sedd/main.py | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/sedd/main.py b/sedd/main.py index 9332c0e..6efe2f8 100644 --- a/sedd/main.py +++ b/sedd/main.py @@ -138,6 +138,14 @@ def login_or_create(browser: WebDriver, site: str): break +def check_file(file_name: str): + try: + res = os.stat(os.path.join(args.output_dir, file_name)) + return res.st_size > 0 + except FileNotFoundError: + return False + + def is_file_downloaded(site_or_url: str): file_name = f"{re.sub(r'https://', '', site_or_url)}.7z" @@ -150,7 +158,7 @@ def is_file_downloaded(site_or_url: str): file_name = re.sub(r'^video', 'avp', file_name) file_name = re.sub(r'^meta\.(es|ja|pt|ru)\.', r'\1.meta.', file_name) - return os.path.isfile(os.path.join(args.output_dir, file_name)) + return check_file(file_name) def download_data_dump(browser: WebDriver, site: str, meta_url: str, etags: Dict[str, str]): From 5f371413b35742addb95da49c0abb78743f62006 Mon Sep 17 00:00:00 2001 From: Oleg Valter Date: Sat, 24 Aug 2024 04:25:33 +0300 Subject: [PATCH 06/12] added watchdog as a requirement --- requirements.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/requirements.txt b/requirements.txt index c429db6..eca96dc 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,2 +1,3 @@ selenium==4.23.1 desktop-notifier==5.0.1 +watchdog==0.8.2 \ No newline at end of file From ec8de6edbbdc84ff6fe75bd047b31dd4f467c0c5 Mon Sep 17 00:00:00 2001 From: Oleg Valter Date: Sat, 24 Aug 2024 06:32:39 +0300 Subject: [PATCH 07/12] added file-related utils --- sedd/utils.py | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/sedd/utils.py b/sedd/utils.py index 83a17f0..ba4bd5b 100644 --- a/sedd/utils.py +++ b/sedd/utils.py @@ -2,6 +2,8 @@ import requests as r from urllib.parse import urlparse import os.path +import re + def extract_etag(url: str, etags: Dict[str, str]): res = r.get( @@ -21,3 +23,39 @@ def extract_etag(url: str, etags: Dict[str, str]): etags[filename] = etag print(f"ETag for {filename}: {etag}") + + +def get_file_name(site_or_url: str): + file_name = f"{re.sub(r'https://', '', site_or_url)}.7z" + + file_name = re.sub(r'^alcohol', 'beer', file_name) + file_name = re.sub(r'^mattermodeling', 'materials', file_name) + file_name = re.sub(r'^communitybuilding', 'moderators', file_name) + file_name = re.sub(r'^medicalsciences', 'health', file_name) + file_name = re.sub(r'^psychology', 'cogsci', file_name) + file_name = re.sub(r'^writing', 'writers', file_name) + file_name = re.sub(r'^video', 'avp', file_name) + file_name = re.sub(r'^meta\.(es|ja|pt|ru)\.', r'\1.meta.', file_name) + + return file_name + + +def check_file(base_path: str, file_name: str): + try: + res = os.stat(os.path.join(base_path, file_name)) + return res.st_size > 0 + except FileNotFoundError: + return False + + +def remove_old_file(base_path: str, site_or_url: str): + try: + file_name = get_file_name(site_or_url) + os.remove(os.path.join(base_path, file_name)) + except FileNotFoundError: + pass + + +def is_file_downloaded(base_path: str, site_or_url: str): + file_name = get_file_name(site_or_url) + return check_file(base_path, file_name) From 8da911fb06287ec56e64a61135fdefa0596f2a78 Mon Sep 17 00:00:00 2001 From: Oleg Valter Date: Sat, 24 Aug 2024 08:39:01 +0300 Subject: [PATCH 08/12] added watcher for pending downloads --- sedd/data/files_map.py | 24 ++++++++++++ sedd/main.py | 80 ++++++++++++++++++++++------------------ sedd/utils.py | 38 ++++++++++++------- sedd/watcher/__init__.py | 0 sedd/watcher/handler.py | 50 +++++++++++++++++++++++++ sedd/watcher/observer.py | 28 ++++++++++++++ sedd/watcher/state.py | 21 +++++++++++ 7 files changed, 191 insertions(+), 50 deletions(-) create mode 100644 sedd/data/files_map.py create mode 100644 sedd/watcher/__init__.py create mode 100644 sedd/watcher/handler.py create mode 100644 sedd/watcher/observer.py create mode 100644 sedd/watcher/state.py diff --git a/sedd/data/files_map.py b/sedd/data/files_map.py new file mode 100644 index 0000000..1ae2843 --- /dev/null +++ b/sedd/data/files_map.py @@ -0,0 +1,24 @@ +# for most sites, dump names correspond to domain name. +# However, due to some subdomain shenanigans, a couple of sites differ: +files_map: dict[str, str] = { + 'alcohol.stackexchange.com': 'beer.stackexchange.com', + 'alcohol.meta.stackexchange.com': 'beer.meta.stackexchange.com', + 'mattermodeling.stackexchange.com': 'materials.stackexchange.com', + 'mattermodeling.meta.stackexchange.com': 'materials.meta.stackexchange.com', + 'communitybuilding.stackexchange.com': 'moderators.stackexchange.com', + 'communitybuilding.meta.stackexchange.com': 'moderators.meta.stackexchange.com', + 'medicalsciences.stackexchange.com': 'health.stackexchange.com', + 'medicalsciences.meta.stackexchange.com': 'health.meta.stackexchange.com', + 'psychology.stackexchange.com': 'cogsci.stackexchange.com', + 'psychology.meta.stackexchange.com': 'cogsci.meta.stackexchange.com', + 'writing.stackexchange.com': 'writers.stackexchange.com', + 'writing.meta.stackexchange.com': 'writers.meta.stackexchange.com', + 'video.stackexchange.com': 'avp.stackexchange.com', + 'video.meta.stackexchange.com': 'avp.meta.stackexchange.com', + 'meta.es.stackoverflow.com': 'es.meta.stackoverflow.com', + 'meta.ja.stackoverflow.com': 'ja.meta.stackoverflow.com', + 'meta.pt.stackoverflow.com': 'pt.meta.stackoverflow.com', + 'meta.ru.stackoverflow.com': 'ru.meta.stackoverflow.com', +} + +inverse_files_map: dict[str, str] = {v: k for k, v in files_map.items()} diff --git a/sedd/main.py b/sedd/main.py index 6efe2f8..c01c911 100644 --- a/sedd/main.py +++ b/sedd/main.py @@ -5,6 +5,8 @@ from selenium.common.exceptions import NoSuchElementException from typing import Dict +from .watcher.observer import register_pending_downloads_observer, Observer + from sedd.data import sites from time import sleep import json @@ -13,6 +15,8 @@ from .meta import notifications import re import os +import sys +from traceback import print_exception import argparse from . import utils @@ -138,29 +142,6 @@ def login_or_create(browser: WebDriver, site: str): break -def check_file(file_name: str): - try: - res = os.stat(os.path.join(args.output_dir, file_name)) - return res.st_size > 0 - except FileNotFoundError: - return False - - -def is_file_downloaded(site_or_url: str): - file_name = f"{re.sub(r'https://', '', site_or_url)}.7z" - - file_name = re.sub(r'^alcohol', 'beer', file_name) - file_name = re.sub(r'^mattermodeling', 'materials', file_name) - file_name = re.sub(r'^communitybuilding', 'moderators', file_name) - file_name = re.sub(r'^medicalsciences', 'health', file_name) - file_name = re.sub(r'^psychology', 'cogsci', file_name) - file_name = re.sub(r'^writing', 'writers', file_name) - file_name = re.sub(r'^video', 'avp', file_name) - file_name = re.sub(r'^meta\.(es|ja|pt|ru)\.', r'\1.meta.', file_name) - - return check_file(file_name) - - def download_data_dump(browser: WebDriver, site: str, meta_url: str, etags: Dict[str, str]): print(f"Downloading data dump from {site}") @@ -206,27 +187,30 @@ def _exec_download(browser: WebDriver): sleep(5) - main_loaded = is_file_downloaded(site) - meta_loaded = is_file_downloaded(meta_url) + main_loaded = utils.is_file_downloaded(args.output_dir, site) + meta_loaded = utils.is_file_downloaded(args.output_dir, meta_url) if not args.skip_loaded or not main_loaded or not meta_loaded: if args.skip_loaded and main_loaded: - print(f"Already downloaded main for site {site}") + pass else: browser.get(f"{site}/users/data-dump-access/current") + utils.remove_old_file(args.output_dir, site) _exec_download(browser) if args.skip_loaded and meta_loaded: - print(f"Already downloaded meta for site {site}") + pass else: - print(meta_url) browser.get(f"{meta_url}/users/data-dump-access/current") + utils.remove_old_file(args.output_dir, meta_url) _exec_download(browser) etags: Dict[str, str] = {} try: + state, observer = register_pending_downloads_observer(args.output_dir) + for site in sites.sites: print(f"Extracting from {site}...") @@ -235,8 +219,11 @@ def _exec_download(browser: WebDriver): meta_url = re.sub( r"(https://(?:[^.]+\.(?=stackexchange))?)", r"\1meta.", site) - if args.skip_loaded and is_file_downloaded(site) and is_file_downloaded(meta_url): - print(f"Already downloaded main & meta for site {site}") + main_loaded = utils.is_file_downloaded(args.output_dir, site) + meta_loaded = utils.is_file_downloaded(args.output_dir, meta_url) + + if args.skip_loaded and main_loaded and meta_loaded: + pass else: login_or_create(browser, site) download_data_dump( @@ -245,10 +232,31 @@ def _exec_download(browser: WebDriver): meta_url, etags ) -finally: - browser.quit() -# TODO: replace with validation once downloading is verified done -# (or export for separate, later verification) -# Though keeping it here, removing files and re-running downloads feels like a better idea -print(etags) + if observer: + pending = state.size() + + print(f"Waiting for {pending} download{'s'[:pending^1]} to complete") + + while True: + if state.empty(): + observer.stop() + browser.quit() + break + else: + sleep(1) + +except: + exception = sys.exc_info() + + try: + print_exception(exception) + except: + print(exception) + + browser.quit() +finally: + # TODO: replace with validation once downloading is verified done + # (or export for separate, later verification) + # Though keeping it here, removing files and re-running downloads feels like a better idea + print(etags) diff --git a/sedd/utils.py b/sedd/utils.py index ba4bd5b..7c0198c 100644 --- a/sedd/utils.py +++ b/sedd/utils.py @@ -4,6 +4,9 @@ import os.path import re +from .data.files_map import files_map, inverse_files_map +from .data.sites import sites + def extract_etag(url: str, etags: Dict[str, str]): res = r.get( @@ -25,22 +28,29 @@ def extract_etag(url: str, etags: Dict[str, str]): print(f"ETag for {filename}: {etag}") -def get_file_name(site_or_url: str): - file_name = f"{re.sub(r'https://', '', site_or_url)}.7z" +def get_file_name(site_or_url: str) -> str: + domain = re.sub(r'https://', '', site_or_url) + + try: + file_name = files_map[domain] + return f'{file_name}.7z' + except KeyError: + return f'{domain}.7z' + + +def is_dump_file(file_name: str) -> bool: + file_name = re.sub(r'\.7z$', '', file_name) - file_name = re.sub(r'^alcohol', 'beer', file_name) - file_name = re.sub(r'^mattermodeling', 'materials', file_name) - file_name = re.sub(r'^communitybuilding', 'moderators', file_name) - file_name = re.sub(r'^medicalsciences', 'health', file_name) - file_name = re.sub(r'^psychology', 'cogsci', file_name) - file_name = re.sub(r'^writing', 'writers', file_name) - file_name = re.sub(r'^video', 'avp', file_name) - file_name = re.sub(r'^meta\.(es|ja|pt|ru)\.', r'\1.meta.', file_name) + try: + inverse_files_map[file_name] + except KeyError: + origin = f'https://{file_name}' + return origin in sites - return file_name + return True -def check_file(base_path: str, file_name: str): +def check_file(base_path: str, file_name: str) -> bool: try: res = os.stat(os.path.join(base_path, file_name)) return res.st_size > 0 @@ -48,7 +58,7 @@ def check_file(base_path: str, file_name: str): return False -def remove_old_file(base_path: str, site_or_url: str): +def remove_old_file(base_path: str, site_or_url: str) -> None: try: file_name = get_file_name(site_or_url) os.remove(os.path.join(base_path, file_name)) @@ -56,6 +66,6 @@ def remove_old_file(base_path: str, site_or_url: str): pass -def is_file_downloaded(base_path: str, site_or_url: str): +def is_file_downloaded(base_path: str, site_or_url: str) -> bool: file_name = get_file_name(site_or_url) return check_file(base_path, file_name) diff --git a/sedd/watcher/__init__.py b/sedd/watcher/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/sedd/watcher/handler.py b/sedd/watcher/handler.py new file mode 100644 index 0000000..ccace45 --- /dev/null +++ b/sedd/watcher/handler.py @@ -0,0 +1,50 @@ +from .state import DownloadState +from ..utils import is_dump_file + +import os + +# watchdog has trouble with new Python versions not having MutableSet +import collections +from sys import version_info + +if version_info.major == 3 and version_info.minor >= 10: + from collections.abc import MutableSet + collections.MutableSet = collections.abc.MutableSet +else: + from collections import MutableSet + +from watchdog.observers import Observer +from watchdog.events import FileSystemEventHandler + + +class CleanupHandler(FileSystemEventHandler): + download_state: DownloadState + observer: Observer + + def __init__(self, observer: Observer, state: DownloadState): + super() + + self.download_state = state + self.observer = observer + + def on_created(self, event): + file_name = os.path.basename(event.src_path) + + # # we can safely ignore part file creations + if file_name.endswith('.part'): + return + + if is_dump_file(file_name): + print(f"Download started: {file_name}") + self.download_state.add(file_name) + + def on_moved(self, event): + file_name: str = os.path.basename(event.dest_path) + + # we can safely ignore part file removals + if file_name.endswith('.part'): + return + + if is_dump_file(file_name): + print(f"Download finished: {file_name}") + self.download_state.remove(file_name) diff --git a/sedd/watcher/observer.py b/sedd/watcher/observer.py new file mode 100644 index 0000000..f34b861 --- /dev/null +++ b/sedd/watcher/observer.py @@ -0,0 +1,28 @@ +from threading import current_thread, main_thread + +from .handler import CleanupHandler +from .state import DownloadState + +# watchdog has trouble with new Python versions not having MutableSet +import collections +from sys import version_info + +if version_info.major == 3 and version_info.minor >= 10: + from collections.abc import MutableSet + collections.MutableSet = collections.abc.MutableSet +else: + from collections import MutableSet + +from watchdog.observers import Observer + + +def register_pending_downloads_observer(output_dir: str): + if current_thread() is main_thread(): + observer = Observer() + state = DownloadState() + handler = CleanupHandler(observer, state) + + observer.schedule(handler, output_dir, recursive=True) + observer.start() + + return state, observer diff --git a/sedd/watcher/state.py b/sedd/watcher/state.py new file mode 100644 index 0000000..53a49ea --- /dev/null +++ b/sedd/watcher/state.py @@ -0,0 +1,21 @@ +from typing import Set + + +class DownloadState: + # list of filenames pending download + pending: Set[str] = set() + + def size(self): + return len(self.pending) + + def empty(self): + return self.size() == 0 + + def add(self, file: str): + self.pending.add(file) + + def remove(self, file: str): + self.pending.remove(file) + + +download_state = DownloadState() From 3d47ec25cd7ccb2bee0d755e9540f816eefb2f5d Mon Sep 17 00:00:00 2001 From: Oleg Valter Date: Sat, 24 Aug 2024 09:26:03 +0300 Subject: [PATCH 09/12] switched removal of old files to archiving in case the criteria change --- sedd/main.py | 12 ++++++++++-- sedd/utils.py | 18 ++++++++++++++++-- 2 files changed, 26 insertions(+), 4 deletions(-) diff --git a/sedd/main.py b/sedd/main.py index c01c911..768f621 100644 --- a/sedd/main.py +++ b/sedd/main.py @@ -195,14 +195,20 @@ def _exec_download(browser: WebDriver): pass else: browser.get(f"{site}/users/data-dump-access/current") - utils.remove_old_file(args.output_dir, site) + + if not args.dry_run: + utils.archive_file(args.output_dir, site) + _exec_download(browser) if args.skip_loaded and meta_loaded: pass else: browser.get(f"{meta_url}/users/data-dump-access/current") - utils.remove_old_file(args.output_dir, meta_url) + + if not args.dry_run: + utils.archive_file(args.output_dir, meta_url) + _exec_download(browser) @@ -242,6 +248,8 @@ def _exec_download(browser: WebDriver): if state.empty(): observer.stop() browser.quit() + + utils.cleanup_archive(args.output_dir) break else: sleep(1) diff --git a/sedd/utils.py b/sedd/utils.py index 7c0198c..07f33ad 100644 --- a/sedd/utils.py +++ b/sedd/utils.py @@ -3,6 +3,7 @@ from urllib.parse import urlparse import os.path import re +import sys from .data.files_map import files_map, inverse_files_map from .data.sites import sites @@ -58,14 +59,27 @@ def check_file(base_path: str, file_name: str) -> bool: return False -def remove_old_file(base_path: str, site_or_url: str) -> None: +def archive_file(base_path: str, site_or_url: str) -> None: try: file_name = get_file_name(site_or_url) - os.remove(os.path.join(base_path, file_name)) + file_path = os.path.join(base_path, file_name) + os.rename(file_path, f"{file_path}.old") except FileNotFoundError: pass +def cleanup_archive(base_path: str) -> None: + try: + file_entries = os.listdir(base_path) + + for entry in file_entries: + if entry.endswith('.old'): + entry_path = os.path.join(base_path, entry) + os.remove(entry_path) + except: + print(sys.exc_info()) + + def is_file_downloaded(base_path: str, site_or_url: str) -> bool: file_name = get_file_name(site_or_url) return check_file(base_path, file_name) From d61dc3154cca3c76d189a73b0089fb445c123020 Mon Sep 17 00:00:00 2001 From: Oleg Valter Date: Sat, 24 Aug 2024 18:13:45 +0300 Subject: [PATCH 10/12] switched watchdog to version 4.0.2 --- requirements.txt | 2 +- sedd/watcher/handler.py | 16 +++------------- sedd/watcher/observer.py | 13 +------------ 3 files changed, 5 insertions(+), 26 deletions(-) diff --git a/requirements.txt b/requirements.txt index eca96dc..de5e0c8 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,3 +1,3 @@ selenium==4.23.1 desktop-notifier==5.0.1 -watchdog==0.8.2 \ No newline at end of file +watchdog==4.0.2 \ No newline at end of file diff --git a/sedd/watcher/handler.py b/sedd/watcher/handler.py index ccace45..00fa424 100644 --- a/sedd/watcher/handler.py +++ b/sedd/watcher/handler.py @@ -1,21 +1,11 @@ -from .state import DownloadState -from ..utils import is_dump_file - import os -# watchdog has trouble with new Python versions not having MutableSet -import collections -from sys import version_info - -if version_info.major == 3 and version_info.minor >= 10: - from collections.abc import MutableSet - collections.MutableSet = collections.abc.MutableSet -else: - from collections import MutableSet - from watchdog.observers import Observer from watchdog.events import FileSystemEventHandler +from .state import DownloadState +from ..utils import is_dump_file + class CleanupHandler(FileSystemEventHandler): download_state: DownloadState diff --git a/sedd/watcher/observer.py b/sedd/watcher/observer.py index f34b861..c78f192 100644 --- a/sedd/watcher/observer.py +++ b/sedd/watcher/observer.py @@ -1,20 +1,9 @@ from threading import current_thread, main_thread +from watchdog.observers import Observer from .handler import CleanupHandler from .state import DownloadState -# watchdog has trouble with new Python versions not having MutableSet -import collections -from sys import version_info - -if version_info.major == 3 and version_info.minor >= 10: - from collections.abc import MutableSet - collections.MutableSet = collections.abc.MutableSet -else: - from collections import MutableSet - -from watchdog.observers import Observer - def register_pending_downloads_observer(output_dir: str): if current_thread() is main_thread(): From 3c94318e6a2b518e70b34c4d5b7cee4b8fafe67c Mon Sep 17 00:00:00 2001 From: Oleg Valter Date: Sat, 24 Aug 2024 18:35:20 +0300 Subject: [PATCH 11/12] fixed typing warning due to Observer now being a variable --- sedd/watcher/handler.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/sedd/watcher/handler.py b/sedd/watcher/handler.py index 00fa424..2b68789 100644 --- a/sedd/watcher/handler.py +++ b/sedd/watcher/handler.py @@ -1,6 +1,6 @@ import os -from watchdog.observers import Observer +from watchdog.observers.api import BaseObserverSubclassCallable from watchdog.events import FileSystemEventHandler from .state import DownloadState @@ -9,9 +9,9 @@ class CleanupHandler(FileSystemEventHandler): download_state: DownloadState - observer: Observer + observer: BaseObserverSubclassCallable - def __init__(self, observer: Observer, state: DownloadState): + def __init__(self, observer: BaseObserverSubclassCallable, state: DownloadState): super() self.download_state = state From 7d974f2d61f372563221a8cfa1a1d4fac650517d Mon Sep 17 00:00:00 2001 From: Oleg Valter Date: Sat, 24 Aug 2024 20:26:55 +0300 Subject: [PATCH 12/12] Keyboard interrupt is not an exception that needs printing --- sedd/main.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/sedd/main.py b/sedd/main.py index 768f621..3a5748c 100644 --- a/sedd/main.py +++ b/sedd/main.py @@ -254,6 +254,9 @@ def _exec_download(browser: WebDriver): else: sleep(1) +except KeyboardInterrupt: + pass + except: exception = sys.exc_info()