Add crawling TODOs
[tt.git] / TODO
1 # vim:sw=2:sts=2:
2 TODO
3 ====
4
5 Legend:
6 - [ ] not started
7 - [-] in-progress
8 - [x] done
9 - [~] cancelled
10
11 In-progress
12 -----------
13 - [-] Convert to Typed Racket
14 - [x] build executable (otherwise too-slow)
15 - [-] add signatures
16 - [x] top-level
17 - [ ] inner
18 - [ ] imports
19 - [-] commands:
20 - [x] c | crawl
21 Discover new peers mentioned by known peers.
22 - [x] r | read
23 - see timeline ops above
24 - [ ] w | write
25 - arg or stdin
26 - nick expand to URI
27 - Watch FIFO for lines, then read, timestamp and append [+ upload].
28 Can be part of a "live" mode, along with background polling and
29 incremental printing. Sort of an ii-like IRC experience.
30 - [ ] q | query
31 - see timeline ops above
32 - see hashtag and channels above
33 - [x] d | download
34 - [ ] options:
35 - [ ] all - use all known peers
36 - [ ] fast - all except peers known to be slow or unavailable
37 REQUIRES: stats
38 - [x] u | upload
39 - calls user-configured command to upload user's own timeline file to their server
40 Looks like a better CLI parser than "racket/cmdline": https://docs.racket-lang.org/natural-cli/
41 But it is no longer necessary now that I've figured out how to chain (command-line ..) calls.
42 - [-] Output formats:
43 - [x] text long
44 - [x] text short
45 - [ ] HTML
46 - [ ] JSON
47 - [-] Peer discovery
48 - [-] parse peer refs from peer timelines
49 - [x] mentions from timeline messages
50 - [x] @<source.nick source.url>
51 - [x] @<source.url>
52 - [ ] "following" from timeline comments: # following = <nick> <uri>
53 - [ ] Parse User-Agent web access logs.
54 - [-] Update peer ref file(s)
55 - [x] peers-all
56 - [x] peers-mentioned
57 - [ ] peers-followed (by others, parsed from comments)
58 - [ ] peers-down (net errors)
59 - [ ] redirects?
60 Rough sketch from late 2019:
61 let read file =
62 ...
63 let write file peers =
64 ...
65 let fetch peer =
66 (* Fetch could mean either or both of:
67 * - fetch peer's we-are-twtxt.txt
68 * - fetch peer's twtxt.txt and extract mentioned peer URIs
69 * *)
70 ...
71 let test peers =
72 ...
73 let rec discover peers_old =
74 let peers_all =
75 Set.fold peers_old ~init:peers_old ~f:(fun peers p ->
76 match fetch p with
77 | Error _ ->
78 (* TODO: Should p be moved to down set here? *)
79 log_warning ...;
80 peers
81 | Ok peers_fetched ->
82 Set.union peers peers_fetched
83 )
84 in
85 if Set.empty (Set.diff peers_old peers_all) then
86 peers_all
87 else
88 discover peers_all
89 let rec loop interval peers_old =
90 let peers_all = discover peers_old in
91 let (peers_up, peers_down) = test peers_all in
92 write "peers-all.txt" peers_all;
93 write "peers-up.txt" peers_up;
94 write "peers-down.txt" peers_down;
95 sleep interval;
96 loop interval peers_all
97 let () =
98 loop (Sys.argv.(1)) (read "peers-all.txt")
99
100 Backlog
101 -------
102 - [ ] Crawl all cache/objects/*, not given peers.
103 BUT, in order to build A-mentioned-B graph, we need to know the nick
104 associated with the URI whos object we're examining. How to do that?
105 - [ ] Crawl downloaded web access logs
106 - [ ] download-command hook to grab the access logs
107
108 (define (parse log-line)
109 (match (regexp-match #px"([^/]+)/([^ ]+) +\\(\\+([a-z]+://[^;]+); *@([^\\)]+)\\)" log-line)
110 [(list _ client version uri nick) (cons nick uri)]
111 [_ #f]))
112
113 (list->set (filter-map parse (file->lines "logs/combined-access.log")))
114
115 (filter (λ (p) (equal? 'file (file-or-directory-type p))) (directory-list logs-dir))
116
117 - [ ] user-agent file as CLI option - need to run at least the crawler as another user
118 - [ ] Support fetching rsync URIs
119 - [ ] Check for peer duplicates:
120 - [ ] same nick for N>1 URIs
121 - [ ] same URI for N>1 nicks
122 - [ ] Background polling and incremental timeline updates.
123 We can mark which messages have already been printed and print new ones as
124 they come in.
125 REQUIRES: polling
126 - [ ] Polling mode/command, where tt periodically polls peer timelines
127 - [ ] nick tiebreaker(s)
128 - [ ] some sort of a hash of URI?
129 - [ ] angry-purple-tiger kind if thingie?
130 - [ ] P2P nick registration?
131 - [ ] Peers vote by claiming to have seen a nick->uri mapping?
132 The inherent race condition would be a feature, since all user name
133 registrations are races.
134 REQUIRES: blockchain
135 - [ ] stats
136 - [ ] download times per peer
137 - [ ] Support redirects
138 - should permanent redirects update the peer ref somehow?
139 - [ ] Support time ranges (i.e. reading the timeline between given time points)
140 - [ ] optional text wrap
141 - [ ] write
142 - [ ] timeline limits
143 - [ ] peer refs set operations (perhaps better done externally?)
144 - [ ] timeline as a result of a query (peer ref set op + filter expressions)
145 - [ ] config files
146 - [ ] highlight mentions
147 - [ ] filter on mentions
148 - [ ] highlight hashtags
149 - [ ] filter on hashtags
150 - [ ] hashtags as channels? initial hashtag special?
151 - [ ] query language
152 - [ ] console logger colors by level ('error)
153 - [ ] file logger ('debug)
154 - [ ] Suport immutable timelines
155 - store individual messages
156 - where?
157 - something like DBM or SQLite - faster
158 - filesystem - transparent, easily published - probably best
159 - [ ] block(chain/tree) of twtxts
160 - distributed twtxt.db
161 - each twtxt.txt is a ledger
162 - peers can verify states of ledgers
163 - peers can publish known nick->url mappings
164 - peers can vote on nick->url mappings
165 - we could break time periods into blocks
166 - how to handle the facts that many(most?) twtxt are unseen by peers
167 - longest X wins?
168
169 Done
170 ----
171 - [x] Dedup read-in peers before using them.
172 - [x] Prevent redundant downloads
173 - [x] Check ETag
174 - [x] Check Last-Modified if no ETag was provided
175 - [x] Parse rfc2822 timestamps
176 - [x] caching (use cache by default, unless explicitly asked for update)
177 - [x] value --> cache
178 - [x] value <-- cache
179 REQUIRES: d command
180 - [x] Logger sync before exit.
181 - [x] Implement rfc3339->epoch
182 - [x] Remove dependency on rfc3339-old
183 - [x] remove dependency on http-client
184 - [x] Build executable
185 Implies fix of "collection not found" when executing the built executable
186 outside the source directory:
187
188 collection-path: collection not found
189 collection: "tt"
190 in collection directories:
191 context...:
192 /usr/share/racket/collects/racket/private/collect.rkt:11:53: fail
193 /usr/share/racket/collects/setup/getinfo.rkt:17:0: get-info
194 /usr/share/racket/collects/racket/contract/private/arrow-val-first.rkt:555:3
195 /usr/share/racket/collects/racket/cmdline.rkt:191:51
196 '|#%mzc:p
197
198
199 Cancelled
200 ---------
201 - [~] named timelines/peer-sets
202 REASON: That is basically files of peers, which we already support.
This page took 0.055213 seconds and 4 git commands to generate.