#milkymist IRC log for Sunday, 2012-01-22

lekernellarsc: sequential(1) means you accept data on every second cycle, and process it on the other. sequential(0) is what you mean.12:16
lekernelso yes, pipeline(0) = sequential(0) ... but to avoid this, having zero latency is not permitted for pipeline/sequential and one should use combinatorial instead12:17
lekernelwpwrak: not only is it more complicated, but it doesn't make sense, because the base Actor class already includes the control logic for you when you use one of the predefined scheduling models12:18
lekernelso removing scheduling models would increase code redudancy, not reduce it12:18
lekernellarsc: having both properties could be an option, yes.12:20
wpwrakyeah, i was thinking that you may be able to determine what interface and thus scheduling model you need from the functionality12:21
wpwrakbut yes, it would be harder for the computer12:21
lekernellarsc: can every graph of sequential/pipelined actors be described by these two numbers? hmm...12:22
lekernelif so, it'd be very interesting to use them12:23
lekernelbut there will be a problem if the graph has several input ports, which can accept tokens independently12:31
wpwrakif you have out = f(inA, inB), would you express the choice between backpressure (if inX is ready before inY, you stall it until inY is ready) and internal merging (e.g., accept inX and buffer it, perform the operation when inY is ready) be expressed by varying pipeline(N) ?12:34
larscyou actually already have a problem if a pipelined actor is before an sequential actor. for examle if your data arvies in burst you'd probably want to fill the pipeline and then stall12:35
wpwrakyou'd also need a means for expressing if inX and inY are synchronized and always arrive at the same time12:35
wpwrakand of course, if inX and inY are bursty, you'd need to be able to specify the burst size12:36
wpwrakwell, burst = 0 could be backpressure, burst = 1 could be buffer-until-the-other(s)-is/are-ready, etc.12:37
lekernelwpwrak: synchronized inX and inY mean combining them into a single token12:37
wpwrakah, nice12:37
lekernelthe token data type is a tree of integers (i.e. a record, which can include other records as members)12:38
wpwrakif you have unsynchronized data paths that travel together (e.g., VGA + DCC), can you still express them as some sort of compound ? or is that too exotic12:39
larscwell, you can always split them12:40
larscvga + ddc whould describe the physical signal12:41
larsceh, would12:41
lekernellarsc: we can abandon that property, and make pipelines stall the input whenever the output is stalled, no matter what's in them12:42
lekernelactually it uses a tad less resources to do it this way12:42
larsclekernel: yes12:42
larscwe can, but depending on the workload it might be suboptimal12:43
larscand i think the latency properties do not apply to the whole graph, but rather to a path in the graph12:44
larscsignals which are synchronized lie on the same edge12:47
wpwrakhow would write-back memory be modeled ? (non-blocking interface) -- (FIFO) -- (blocking interface) -- (bus access & arbiter) ?12:48
wpwrak(assuming the FIFO is sized for the worst case)12:48
larscwpwrak: that's probably dynamic12:49
larscbut you can give a upper and a lower bound i guess12:49
wpwrakdata-dependent delays could also get interesting. e.g., for SIMD units.12:49
wpwraklarsc: i was wondering more about the interfaces. properly dimensioning the FIFO is another problem. not necessarily a trivial one, of course ;-)12:50
larscwpwrak: which interfaces?12:51
larscyou'd have the same handshake signals as every else i guess12:51
wpwrakbetween the processing elements12:51
wpwrakhmm, but you don't always handshake, do you ?12:52
larscthe idea of migen flow is to remove the handshakes if we know from the scheduling model they are not neccesarry12:52
wpwrakyup. so how would that write-back memory interface work ?12:53
wpwrakor, in general, any FIFO with such properties12:53
wpwraki.e., it needs backpressure at the output but not at the input. it needs a data strobe on both ends.12:54
larscyes12:55
larscbut i still don't get what the question is12:56
lekernelwpwrak: then that FIFO would always assert its ack signal12:56
lekernelyou can't have non-blocking interfaces, only blocking interfaces that may not block in practice12:56
wpwrakso migen wouldn't see a difference ?12:57
lekernelno12:57
wpwraki see12:57
lekernellarsc: and yes, running the pipelines this way is suboptimal. but using verilog and automatic par is suboptimal to programming LUTs and interconnect manually, too.12:57
wpwrakmaybe the verilog synthesizer can figure it out when it sees that the ACK is always set :)12:58
larscit will12:58
larscor at least should12:58
lekernelwpwrak: in some case it does, but it won't model the speed of data in and out the fifo12:58
larsc;)12:58
wpwrakyou mean the traffic profiles = the FIFO size ? yeah, that would be asking for a bit too much :)12:59
wpwrakthere are a few nice bookshelf full of queuing theory and descendant fields of science ... that nobody should be forced to read ;-)13:00
wpwrakin practice, you do everything with simulation anyway (coming from the networking side)13:01
larsclekernel: if our graph can accept inputs independently there has to be an actor somewhere in the graph which can accept input independently and thus is has dynamic scheduling13:20
larsc(given the graph is connected)13:20
larsci wonder if this is minimum cut/maximum flow13:30
larsclekernel: consider this: http://metafoo.de/flow.png edge are actors, circels are sources/sinks. the strobe signal of sink1 still depends on all the sources strobe signals13:42
larscyes, i think this is a min cut/max flow problem. if you have a graph you can calculate boths delays over the whole graph. and if you create handshaking signals based on these you can remove all handshaking inside the graph itself14:05
larscto do this you'd insert supernode synchronizer before the inputs and after the outputs14:05
larscsynchronizers14:06
larscand than you'd implement the handshaking between the supernode input and the supernode output14:06
larsclike this http://metafoo.de/flow2.png14:13
larscand the grey circles are the supernode stuff14:13
larsci wonder if the graph has to be a DAC for this to work14:16
larsci think you can even calculate the maximum flow for each of the delays for a graph independently and it will still work14:20
wolfspraulwpwrak: since you did led color comparisons, I just ran into a remark from Casio in 2010, where they say that for some high-end projectors, they produce component colors in the following way "red by red led, blue by a blue laser and green converted by phosphor from a blue laser"14:49
wpwrakhehe :)14:53
wpwrakthe green leds in M1 are already a lot better than the "super-bright" green leds i have (selected some 5-6 years ago). but still, in comparison to red, green sucks.14:55
wpwrakat least they manage to outshine at ~10 mA and 100% duty cycle my red LEDs (also 5-6 years old) running at ~6 mA, 15%14:57
larscso to optimize the handshaking of any graph you'd do:15:23
larscremove all actors with dynamic scheduling and then calculate the connected components.15:23
larscfor each component check whether it is a DAC or not. If it is not a DAC calculate the minimum set of edges that have to be removed in order for it to be a DAG.15:23
larscRemove these edges from the graph, the nodes which have been connected by a removed edge will need handshake signals, handle them like other sink or source nodes.15:23
larscThen for each component calculate the total delays. Based on these delays generate the handshake signals, these handshake signals will be used for all sinks and all sources of the component.15:23
lekernelDAC?15:27
lekernelI know DAG, but not DAC15:27
larscyeeah fdag15:29
larscdag15:29
lekerneland yes... I had some ideas about removing handshaking inside graphs, but I have not developed them. your method seems correct and interesting :)15:38
lekernelbut I think dealing with DAGs only is sufficient, no?15:38
lekernelif you have a cyclic graph with only seq/pipeline actors, it will never terminate15:39
larscif it also depends on external input it will. e.g. a LFSR15:41
lekernelso, it's not really useful in practice imho15:41
lekernelhmm...15:41
lekernelthen there's another problem. all actors also have a "busy" signal that they should assert when they have data that shouldn't be lost in any of their registers15:43
lekernelthis is useful to signal the completion of a hardware accelerator to a CPU, for example15:43
lekernelif we have a never-ending dataflow system, the global "busy" signal will stay asserted and this doesn't make much sense15:44
lekernel(global busy = OR of all actors' busy signals with the current design)15:45
lekernelfor the LFSR I'd implement it in one actor too :)15:46
larsci don't see a problem with this right now15:48
larscthe lfsr is non busy if there is no new input15:49
lekernelbut once there has been some input, it will be busy trying to send tokens to itself15:51
lekerneland in the case of a pure sequential or pipelined actor, this will never end15:52
larscwe might just need to say that a feedback connection does not cause busy to be asserted15:55
lekernelor simply avoid those complexities entirely, and impose that all sub-graphs made of seq/pipe actors must be acyclic?15:57
lekernelare they really useful?15:57
lekernelseems we need a dynamically scheduled actor in the feedback path anyway, to provide the first value(s)15:59
larscmaybe. i don't know. but given the current defintion of busy in actor.py I still don't see why busy should be asserted permanently if there is a cycle16:00
lekernelbusy is asserted when attempting to send a token which is stalled16:00
larsci think we can start without support for cycles16:00
lekerneluntil the end of the stall16:00
larscbut it will only transition from non-busy to busy if all inputs are pending16:02
lekernelyes, but if one of its inputs is connected to its output, it will be busy because it tries to send data to itself, and cannot receive it until the other inputs also have data16:04
larschm16:05
lekernelwpwrak: regarding those laser phosphors, I wonder if they can be used to make EPR photon pairs. the crystals normally used for this are crap expensive, e.g. http://www.eksmaoptics.com/en/p/beta-barium-borate-bbo-crystals-29816:05
larsclekernel: is a counter really cheaper than a one-hot shift register for small numbers? (referring to the sequential control fragment)16:11
lekerneldepends how small16:12
lekernelbut I haven't done the math (which should be straightforward enough though)16:12
lekernelit's just a small detail16:13
larsci was just thinking about the control fragment for a actor with sequential and pipeline dealy. you'd just use a shift register like for the pipeline control fragment with the length being pipeline delay + sequential delay.16:17
larscand ack_o would only be asserted if the lower n entries are zero16:18
larscwhere n is the sequential delay16:18
larscanother approach is to just stick the sequential logic in front of the pipeline logic and always add a new entry to the shift register when the timer triggers16:19
lekernelyou can use a LFSR instead of the counter. even faster due to no carry propagation (though on a FPGA, the chains are pretty fast)...16:20
lekernelthere are lots of possible optimizations, but the counter is simple to write/understand and good enough imo16:20
lekernelif it turns out that those counters are using more than 0.5% of the FPGA resources or cause timing problems in a design, then I'll pay them more attention16:21
lekernelbut I believe they won't16:22
larsclekernel: hm, I don't quite understand what Cat() does16:28
lekernelthe FHDL thing?16:28
lekernelit simply concatenates bit vector together16:28
lekernelCat(a, b, c) ==> {c, b, a} in Verilog16:29
larschm, ok16:29
wpwraklekernel: (photon pairs) hmm, maybe. but the photons would be at very different energy levels.17:22
lekernelhm?17:25
lekernelas I understand it, those things take one blue photon and turn it into two green photons17:26
lekernelor are there more complicated things, e.g. blue to green + IR?17:27
wpwraki think such things take one blue and emit one green plus one IR (or dissipate the energy in some other way, e.g., mechanical excitation)17:27
wpwrakyeah17:27
wpwrakhttp://en.wikipedia.org/wiki/Stokes_shift17:27
wpwrakhere's a nice animation: http://www.abdserotec.com/resources/flow-cytometry-ebook/principles-of-fluorescence/stokes-shift.html17:29
wpwrak(somewhat different chemistry, though)17:29
lekernelso I guess I should grow my own BBO crystals if I want cheap EPR pairs17:29
wpwrakyeah. have a little crystal garden in the backyard ;-)17:32
larsclekernel: the sequential control logic takes N+1 cycles is this a bug or a feature?17:58
lekernelhmm, not sure what you mean18:02
lekernelwhat it should do, when input tokens are always available and output tokens always accepted, is accept/send tokens in one cycle, then process for N cycles18:03
lekernelif you include communication then yes, N+1 is a feature, not a bug18:03
lekernelbut I have not simulated this code yet, so you can probably easily find problems there18:04
larscok, but the one is actually pipeline latency, isn't it?18:04
lekernelin some way, yes, it's a 'pipeline'18:04
lekernelyou want to switch to the X pipeline stages + Y sequential cycles model?18:05
larsci think it is so much nicer to work with18:06
Fallenoua draft/sum up of mailing list/bunch of ideas/questions about MMU : http://piratepad.net/RSE6AWxIIa19:12
FallenouI'm not sure to have understood everything that has been said on the mailing list about MMU so feel free to correct anything I got wrong and/or to add more accurate thinkings19:13
FallenouOne problem of the design described is exposed at the bottom, I don't know the solution so if you know one :)19:14
wpwrakkewl. that piratepad crashes konqueror19:27
Fallenouoops, maybe I'd better e-mail this19:44
lekernelIf two processes are mapping the same PA with two different VA : we will have Cache coherency troubles.19:46
lekernelHow can we deal with this ?19:46
FallenouI'm not sure whether there was an answer to this problem or not on the mailing list19:47
lekernelthe TLB only affects the upper parts of the address... and if the lower part is sufficient to address the whole cache, the two mappings will end up at the same place in the cache19:47
lekerneluse just a software mapped TLB19:49
lekernels/mapped/managed19:49
Fallenouok, so no hardwired page tree walker19:49
lekernelyes19:49
lekerneljust add special instructions that modify the TLB19:49
Fallenouyep19:49
lekerneland regarding cache aliases, there are no problems even with context switches, if you use virtually indexed physically tagged caches with page and cache sizes appropriately chosen19:51
Fallenouthis part of the mail exchange I didn't understand : about the aliases problems going away if page and cache sizes appropriately chosen19:51
lekernelIn 1 clock cycle the TLB answers it's value which is the 20 bits physical pfn + a few rights bits like READ/WRITE/EXECUTE => nope19:54
lekernelyou have separate data caches and an instruction caches19:55
wpwraki don't think the VA1 != VA2 but PA1 == PA2 case goes away easily (assuming the page is used r/w)19:55
lekernelso you'll have separate data TLB and instruction TLB. then e.g. non-executable pages are simply not loaded into the instruction TLB19:55
Fallenouoh ok19:56
FallenouI thought it was possible to cope with only 1 TLB accessed by both Instruction and Memory stages19:56
Fallenouto save a bit of on-chip ram19:57
lekernelit is certainly possible, but probably more complicated19:57
lekernelit'd save on-chip RAM only if some pages are both data and code and therefore can be shared among the two TLBs, which rarely happens imo19:58
FallenouI guess sometimes it needs to happen19:58
Fallenou(injecting code via load&stores and jumping on it)19:58
lekernelthis only happens at program startup or shared library loads19:59
lekernel(afaik)19:59
Fallenoumaybe modules loading19:59
Fallenoubtw , do you want virtual kernel addresses or physical ones ?19:59
lekernelyes, but that too isn't something that should be optimized (especially since I'm not sure this would actually provide a noticeable speed gain)20:00
lekernelI think the simplest is to disable address translation in kernel mode20:00
Fallenouok so we just flush Caches and TLBs when doing weird things20:00
lekerneland the CPU also boots in kernel mode, so it's backward compatible20:00
Fallenouyep ok20:01
lekernelyou can use the LM32 "software exception" instruction to implement syscalls imo20:01
lekerneland exceptions should also put the CPU in kernel mode20:02
Fallenouthere is system call exception20:02
Fallenouscall instruction20:02
lekernelah, it's even called "system call" :) maybe they had something in mind20:02
Fallenouhehe yep =)20:03
lekernelwpwrak: I think page size = cache size solves a lot of problems20:03
Fallenougreat it's the case at the moment :'20:04
Fallenoubut I still don't see why20:04
Fallenoumaybe I need a drawing20:04
lekernelFallenou: btw the caches are two-way set-associative20:04
Fallenouoh really ?20:04
Fallenoumy milkymist sources are not up to date I guess20:04
Fallenouoh or maybe associativity = 1 means two-way20:05
Fallenou0..120:05
FallenouI read the generate part too fast I guess20:05
lekernelthis isn't very complicated, you can just picture this as two directly mapped caches running in parallel plus some policy that only replaces misses in one of the caches20:06
Fallenouhummm weird20:06
lekernelhm no, you're right20:06
Fallenouok20:06
lekernelsome mistake... it should be 2-way20:06
lekernelI think I changed this to work around some spartan6 xst problems and then forgot to restore it20:06
lekernelanyway, the TLB should work for 1-way and 2-way (and don't be afraid, it's not hard - you only need a second physical tag comparator for the second way)20:07
Fallenouyes it does not seem too had20:07
Fallenoujust a multiplex20:07
Fallenouhard*20:07
lekernelat each memory access you have 3 lookups: one in the TLB, one in the first way of the cache, and one in the second way20:09
Fallenouthen I compare two tags from two ways20:09
lekernelone pipeline stage later, you compare the PA from the TLB to the tag in each way of the cache20:09
Fallenouwith the TLB output20:09
Fallenouand I don't have to bother with TLB coherency between instruction and memory stages ?20:10
Fallenousoftware will have to deal with it (flushing i guess) ?20:11
lekernelnah, I think the software should take care of this20:11
Fallenouok good20:11
lekerneland since address translation is disabled in kernel mode, the kernel can always run even with inconsistent TLB content20:11
Fallenouright20:12
Fallenouthat's good, especially when you run the "tlb miss" exception handler :)20:12
Fallenoudo you mind if I send an e-mail to ML in order to sum up all of this ? and maybe to ask if someone can give a crystal clear explanation about page size = cache size simplification that puts our problems away ?20:14
lekernelno I told you you can ask all the questions you want regarding the MMU topic20:15
lekernelthere was a thread on it http://comments.gmane.org/gmane.comp.multimedia.milkymist.devel/132320:15
lekernelsee Wesley W. Terpstra | 26 Apr 10:0120:16
Fallenouyep I read all this thread actually20:16
FallenouAny access to the shared page will use the same low bits index into that page (12 bits in our example) <=21:05
FallenouI really don't get this in Wesley e-mail :x21:05
Fallenouoh or maybe he says the two processes access the same byte of the same physical page21:06
Fallenouin this case same byte => same offset in the page => same 12 low bits21:06
Fallenouok :o21:06
lekernelthe lower (12?) bits are the same for PAs and VAs21:21
lekernelthe TLB only translates the upper bits21:21
Fallenouyes yes21:22
FallenouI think I begin to understand Wesley e-mail21:23
larsclekernel: hm, one thing i did not consider is that we still need the individual trigger signals.21:25
lekernellarsc: yes, but you can use one single FSM to generate those21:27
lekernelor just a "timeline"21:27
larscwhich is just a shift register?21:28
lekernelyeah basically, though for the one in corelogic it's a counter and comparators21:29
larschm21:29
larsci suppose that makes sense21:31
lekernelit's not always that simple though... eg if you have one pipelined actor feeding a sequential actor21:32
larscor maybe not since we have to keep track of multiple tokens21:32
kristianpaulhow you track a token ? nice mix of FSMs? :-)22:02
lekernelat the moment, with one piece of control logic per actor22:20
lekernellarsc: if there's a lot of actor time-sharing (which isn't implemented at all atm), it definitely makes sense to use a FSM/timeline for control though (then it becomes a bit like the PFPU)22:21
lekernel_at the moment, with one piece of control logic per actor22:29
lekernel_larsc: if there's a lot of actor time-sharing (which isn't implemented at all atm), it definitely makes sense to use a FSM/timeline for control though (then it becomes a bit like the PFPU)22:30
Fallenoulekernel: would you use dual ported ram for the TLB in order to make it easy for lookups to happen AND for any instruction to modify a TLB entry ?22:49
Fallenouor maybe use just a single ported ram and then more complex arbitration22:49
Fallenoulm32 seems to be using dual ported rams for dcache-data icache-data dcache-tags and icache-tags22:52
wolfspraulshould we make it possible for an expansion board to supply power to m1?22:52
larsclekernel: time-sharing as in tdm?23:00
Fallenougn23:09
wolfsprauln823:13
kristianpauli think should no, at least you are talking about a battery pack expansion board for M1 :)23:47
wolfspraulsolar panel roof :-)23:47
wolfspraulnuclear battery board for infinite power23:47
kristianpaulhe23:52
kristianpaulno infinite, even voyager is getting out of power i read23:53
kristianpaulnasa said no more tha 10 years left.. lets read then by then again then :)23:56
--- Mon Jan 23 201200:00

Generated by irclog2html.py 2.9.2 by Marius Gedminas - find it at mg.pov.lt!