In-progress
-----------
-
-- [x] Prevent redundant downloads
- - [x] Check ETag
- - [ ] Check Last-Modified if no ETag was provided
- - [ ] Parse rfc2822 timestamps
+- [-] timeline limits
+ - [x] by time range
+ - [ ] by msg count
+ - [ ] per peer
+ - [ ] total
+ Not necessary for short format, because we have Unix head/tail,
+ but may be convinient for long format (because msg spans multiple lines).
- [-] Convert to Typed Racket
- [x] build executable (otherwise too-slow)
- [-] add signatures
- [ ] inner
- [ ] imports
- [-] commands:
+ - [x] c | crawl
+ Discover new peers mentioned by known peers.
- [x] r | read
- see timeline ops above
- [ ] w | write
- [x] mentions from timeline messages
- [x] @<source.nick source.url>
- [x] @<source.url>
- - [x] "following" from timeline comments: # following = <nick> <uri>
+ - [ ] "following" from timeline comments: # following = <nick> <uri>
+ 1. split file lines in 2 groups: comments and messages
+ 2. dispatch messages parsing as usual
+ 3. dispatch comments parsing for:
+ - # following = <nick> <uri>
+ - what else?
- [ ] Parse User-Agent web access logs.
- - [ ] Update peer ref file(s)
+ - [-] Update peer ref file(s)
+ - [x] peers-all
+ - [x] peers-mentioned
+ - [ ] peers-followed (by others, parsed from comments)
+ - [ ] peers-down (net errors)
+ - [ ] redirects?
Rough sketch from late 2019:
let read file =
...
Backlog
-------
+- [ ] Support date without time in timestamps
+- [ ] Associate cached object with nick.
+- [ ] Crawl downloaded web access logs
+- [ ] download-command hook to grab the access logs
+
+ (define (parse log-line)
+ (match (regexp-match #px"([^/]+)/([^ ]+) +\\(\\+([a-z]+://[^;]+); *@([^\\)]+)\\)" log-line)
+ [(list _ client version uri nick) (cons nick uri)]
+ [_ #f]))
+
+ (list->set (filter-map parse (file->lines "logs/combined-access.log")))
+
+ (filter (λ (p) (equal? 'file (file-or-directory-type p))) (directory-list logs-dir))
+
+- [ ] user-agent file as CLI option - need to run at least the crawler as another user
+- [ ] Support fetching rsync URIs
- [ ] Check for peer duplicates:
- [ ] same nick for N>1 URIs
- [ ] same URI for N>1 nicks
- [ ] download times per peer
- [ ] Support redirects
- should permanent redirects update the peer ref somehow?
-- [ ] Support time ranges (i.e. reading the timeline between given time points)
- [ ] optional text wrap
- [ ] write
-- [ ] timeline limits
- [ ] peer refs set operations (perhaps better done externally?)
- [ ] timeline as a result of a query (peer ref set op + filter expressions)
- [ ] config files
Done
----
+- [x] Crawl all cache/objects/*, not given peers.
+- [x] Support time ranges (i.e. reading the timeline between given time points)
+- [x] Dedup read-in peers before using them.
+- [x] Prevent redundant downloads
+ - [x] Check ETag
+ - [x] Check Last-Modified if no ETag was provided
+ - [x] Parse rfc2822 timestamps
- [x] caching (use cache by default, unless explicitly asked for update)
- [x] value --> cache
- [x] value <-- cache