In-progress
-----------
-
- [-] Convert to Typed Racket
- [x] build executable (otherwise too-slow)
- [-] add signatures
- [ ] inner
- [ ] imports
- [-] commands:
+ - [x] c | crawl
+ Discover new peers mentioned by known peers.
- [x] r | read
- see timeline ops above
- [ ] w | write
- [x] mentions from timeline messages
- [x] @<source.nick source.url>
- [x] @<source.url>
- - [x] "following" from timeline comments: # following = <nick> <uri>
+ - [ ] "following" from timeline comments: # following = <nick> <uri>
- [ ] Parse User-Agent web access logs.
- - [ ] Update peer ref file(s)
+ - [-] Update peer ref file(s)
+ - [x] peers-all
+ - [x] peers-mentioned
+ - [ ] peers-followed (by others, parsed from comments)
+ - [ ] peers-down (net errors)
+ - [ ] redirects?
Rough sketch from late 2019:
let read file =
...
Backlog
-------
+- [ ] Crawl all cache/objects/*, not given peers.
+ BUT, in order to build A-mentioned-B graph, we need to know the nick
+ associated with the URI whos object we're examining. How to do that?
+- [ ] Crawl downloaded web access logs
+- [ ] download-command hook to grab the access logs
+
+ (define (parse log-line)
+ (match (regexp-match #px"([^/]+)/([^ ]+) +\\(\\+([a-z]+://[^;]+); *@([^\\)]+)\\)" log-line)
+ [(list _ client version uri nick) (cons nick uri)]
+ [_ #f]))
+
+ (list->set (filter-map parse (file->lines "logs/combined-access.log")))
+
+ (filter (λ (p) (equal? 'file (file-or-directory-type p))) (directory-list logs-dir))
+
+- [ ] user-agent file as CLI option - need to run at least the crawler as another user
- [ ] Support fetching rsync URIs
- [ ] Check for peer duplicates:
- [ ] same nick for N>1 URIs
Done
----
+- [x] Dedup read-in peers before using them.
- [x] Prevent redundant downloads
- [x] Check ETag
- [x] Check Last-Modified if no ETag was provided