Not blocked yet: start -> 2024-03-11T10:11:57.051663083Z now ---> 2024-03-12T03:36:41.353799992Z unix1 -> 1710151917 ("stat --format=%Y ./www.canterlot.com-2024-03-11-9b3a5a01/id" and normal stat says 2024-03-11 10:11:57.553640017 +0000) unix2 -> 1710215173 ("stat --format=%Y ./www.canterlot.com-2024-03-11-9b3a5a01/wpull.log" and normal stat says 2024-03-12 03:46:03.762802180 +0000) size --> 237 MB /z9/warc/012/www.canterlot.com-2024-03-11-9b3a5a01 200x3 -> https://www.canterlot.com/gallery/image/8158-yama-san-from-the-mountains/ + https://www.canterlot.com/gallery/album/1167-ondrea + https://www.canterlot.com/gallery/image/8380-rainpng/ (all recent) est. --> 100 GB final size (with may image files, it could be 200 GB) ran ---> 63,256 seconds (1710215173 - 1710151917) down --> 3.747 KB/s (237/63256) left --> roughly 99 GB (99,000,000 KB) eta ---> roughly 26,421,137 seconds or 306 days (99000000/3.747 and 26421137/60/60/24) notes -> The delay file can be change to contain "3000" (or whatever number) while grab-site is still running and it will then have that delay instead. Doing that seems to result in no problems. grab-site option of interest = --permanent-error-status-codes STATUS_CODES = "A comma-separated list of HTTP status codes to treat as a permanent error and therefore *not* retry (default: 401,403,404,405,410)". The wpull.db file can be opened by running "sqlite3 -column -header -csv 'wpull.db'"; then view tables by running ".tables"; then view rows by running "select * from tablename;". What's the fate of this grab? "Probably" my computer will crash/reboot then I won't return to it, so I'll just get a small portion of that site which requires a delay between requests. Or, I could keep working on it in various ways. A dealy of 5000ms-10000ms will take 306 or 612 days; let's say it will take about a year. A delay of 5000ms will maybe take "only" 150 days to download all of that website. I wish that grab-site was more fault-tolerant. Apparently Common Crawl has a lot of www.canterlot.com, but it doesn't have content.invisioncic.com outlinks and recent data.
Anyone rsync millions of files? It was a drag that bash deleted my paused job that was doing that: >$ utc; rsync -a --info=progress2 /d1/path1/ /d2/path1/; utc # ~2,072,198 items >2024-03-10T14:47:03.267513346Z