#milkymist IRC log for Wednesday, 2013-05-08

stekernI've been playing with running openrisc (mor1kx) on M1 again, this time using milkymist-ng. I've made the stuff available here: https://github.com/skristiansson/milkymist-ng-mor1kx/ if someone might be interested.07:03
stekernit was actually pleasantly painless to drop it in to milkymist-ng, most changes are of sw nature07:05
lekernelstekern, cool08:39
lekernelstill able to run at 83MHz?08:39
lekernelseems so... :)08:40
lekernelNumber of Slice LUTs:                      6,766 out of  27,288   24%08:43
stekernyup, it's not very small (yet)08:43
stekerna couple of features can be omitted (like the internal timer, overflow exceptions, add with carry etc)08:44
lekernelthat's for the whole SoC - doesn't look bad...08:45
lekernellet me get the precise number for the lm32 version08:45
stekernI got ~1700 slices for mor1kx and ~700 for lm32 I think08:45
lekernelNumber of Slice LUTs:                      4,700 out of  27,288   17%08:47
lekernelso, 2K LUTs to go ... :p08:53
stekernyep ;)08:53
stekernI'll update the configuration so it's more identical to lm32, but I think lm32 still beats it08:54
lekernelyou can try this too08:55
stekernthe or1k architecture have a lot of special registers that need quite some space08:56
lekernelcan't push them into BRAM?08:56
lekernelif they are seldom used, multi-cycle access may be acceptable08:57
stekernI've been meaning to run those benchmarks, up until now I've only ran coremark and dhrystone08:57
lekernelbtw one thing I want to do with migen is automatic virtual BRAM ports using multiplied + phase aligned clocks08:58
stekernyeah, pushing hem into bram is something that could be done to some of them, but the whole address space is a bit annoying (basically they are divided into groups)08:58
lekernelBRAM is fast, several hundred Mhz, while the rest of the fabric is the slowness pig we know08:58
lekernelso you could easily have a 4-port BRAM out of a 2-port BRAM with 2x clock multiplication for many designs08:59
stekernhmm, aren't bram outputs usually slower than register outputs?09:27
lekernelthere's some clock-to-output delay, yes09:29
lekernelon slowtan6 it's 2.10ns, or 1.75ns if you enable the output register (ie reads take 2 cycles, pipelined)09:31
lekernelsetup/hold are all under 1ns09:31
lekerneland you can clock at 280MHz max09:32
lekernelmost designs aren't that fast09:32
lekernelthis output register seems pretty useless, if all you get is 0.35ns of extra time at the output at the expense of one more cycle of latency ...09:33
lekernelyou certainly get a better deal by registering outside the BRAM09:37
lekernelanother crippled S6 feature it seems...09:39
stekernit's only for marketing, "with built-in registers" ;)09:40
stekernof course, if you need the result registered and the output delay isn't a problem, you'd benefit from them, but the use cases sounds a bit restricted, yes09:43
Fallenou_lekernel: I don't remember the reason why Milkymist(-ng or not) SoC uses a lm32 core configured with 512 B of I and D cache, knowing that caches can go up to 32 kB on lm3213:57
Fallenou_keeping the caches below or equal to 4 kB is indeed helping for the MMU part (no cache alias problem)14:00
Fallenou_in some of your slides there is a graphic about cache hit probability, you seemed to have chosen 32 kB at that time in order to get cache hit 95% of the time14:04
Fallenou_but I remember you had synthesis issues with big caches as well ...14:05
Fallenou_on the other hand 512 B seems small, when you know you can go up to 4 kB without risking any cache aliasing (caused by VIPT cache)14:05
lekernelit's not 512 bytes, it's 256*16 bytes14:54
lekernelso 4K14:54
lekernelbig caches cause timing problems on slowtan614:54
Fallenouright, I mixed up things, I did 256*2 instead of *16 ... (habit of converting 16 bits to 2 bytes ...)14:57
Fallenouok so 4K, perfect :)14:57
lekernelI might want larger caches when moving to a less slow FPGA...14:59
Fallenouit would be cool to allow software to read cache size then14:59
Fallenoufor the OS to adapt and handle cache aliasing issues when they are possible15:00
lekernelsend a patch :)15:00
FallenouSH4 cpu allows to read cache size for instance15:00
FallenouI won't hesitate ;)15:00
Fallenouit may end up in CFG2 or maybe in a CFG315:00
Fallenoufor now, I will only handle the current cache configuration hard coded in NetBSD kernel15:01
Fallenoufirst things first :)15:01
GitHub97[NetBSD] fallen pushed 2 new commits to master: http://git.io/rq36JA15:04
GitHub97NetBSD/master 7be2287 Yann Sionneau: Update TODO15:04
GitHub97NetBSD/master ed27bd4 Yann Sionneau: Move TLB helpers into cpu.h15:04
GitHub46[NetBSD] fallen pushed 1 new commit to master: http://git.io/ZPUoqA15:06
GitHub46NetBSD/master 81a01e8 Yann Sionneau: Add implementation of pmap.9 MD functions...15:06
lekernelthe figure you are talking about is about the TMU cache, not the CPU cache15:09
Fallenouoh, ok15:10
Fallenou95% seemed a bit high :)15:10
stekernthat's interesting, I would have expected the opposite, larger cache, less tag bits to compare against, less timing problems.15:35
stekernFallenou: on or1k, you can read the cache size out of an spr, but then we are 2000 luts larger than lm32 too ;)15:37
lekernelyou need more BRAMs too15:38
lekernelso they spread on more area on the chip, and then the particularly slow S6 routing does the rest ...15:39
Fallenoustekern: that's convenient :)15:42
stekernlekernel: ah, yeah, that of course makes sense, several brams might slow things down15:46
stekern(aliasing) that's another thing that's slightly annoying in or1k, you max out on 16kb with a 2-way cache if you don't want to worry about it15:50
stekernwhat page size did you decide on in the end?15:51
Fallenoufor now I'm going for 4 kB pages15:53
Fallenouit seems to be the size used almost everywhere (except "big pages" options and such)15:53
stekernoh, so you're actually worse off in that regard ;)16:06
Fallenouwhen you say 16 kB, is it total cache size ? taking into account the associativity ?16:08
stekernwhat are the benefits of having a smaller cache size, for me it was already predefined to 8kb when I did the mmus for mor1kx, so I haven't given it much thought16:08
stekernerr, smaller page size16:09
Fallenouyou get more fine grain management of your virtual memory16:09
Fallenouso less fragmentation I would say16:09
FallenouI mean, the kernel in a few places allocate by multiple of page size16:10
Fallenoubut it does not need that much memory usually (8 kB or 16 kB)16:10
FallenouBut personally I didn't give the page size a big thought16:11
stekernyeah, that's obvoius of course, but is there something else? perhaps that's reason enough though16:12
FallenouI took 4 kB as granted because it's almost everywhere in the litterature16:12
Fallenoufor instance, on recent linux kernel for x86, do they use bigger pages? (for Ubuntu, debian etc)16:12
FallenouI know there is an option for that, but I don't know if it's checked or not16:13
stekern(16kb) yes, total cache, 2*8kb16:13
GitHub177[milkymist-ng] sbourdeauducq pushed 1 new commit to master: http://git.io/b65uyw16:13
GitHub177milkymist-ng/master 8e76c96 Sebastien Bourdeauducq: timer, uart: EventSourceLevel -> EventSourceProcess16:13
GitHub42[migen] sbourdeauducq pushed 1 new commit to master: http://git.io/f9HVcQ16:14
GitHub42migen/master b9b6df6 Sebastien Bourdeauducq: bank/eventmanager: refactor, rename EventSourceLevel -> EventSourceProcess, add fully externally controlled event source16:14
stekernyou could of course do more ways, but then the replacement logic becomes alot more complicated16:14
Fallenouwell not that much, if you use round robin for instance16:15
stekernat least if you use lru16:15
Fallenoubut it's still the same issue of routing more block rams etc16:15
FallenouI wonder if lru has a big impact on performance16:16
Fallenouwhen you have 2 ways for instance16:16
stekernlru for 2-way is dead simple16:18
stekernjust 1 bit to check against16:19
stekernbut, I agree, round robin for 4-way wouldn't be that complex16:20
FallenouI mean, is it really better than rr ?16:20
stekernmaybe I should do that, keep lru for 2-ways, and do rr for 4-ways16:21
stekernit probably depends on the application, I haven't done any comparisons16:22
stekernbut i've got the impression that it would be better16:23
FallenouI would think that when you increase the number of ways, indeed lru can start to get interesting, because you really have a "bigger" choice to make16:26
Fallenou1 among 4 (or 8 or more)16:27
Fallenoubut 1 among 2 seems a poor choice anyway16:27
Fallenouusing rr or lru16:27
FallenouI think only a very precise software benchmark could really give better performance with lru than rr when the associativity is 216:28
Fallenoubut it's just feeling, I don't really know :)16:28
stekernyeah, it probably doesn't make a big difference16:31
GitHub17[migen] sbourdeauducq pushed 1 new commit to master: http://git.io/5GmfyA17:00
GitHub17migen/master 10212e8 Sebastien Bourdeauducq: dma_asmi: cleanup17:00
GitHub59[milkymist-ng] sbourdeauducq pushed 1 new commit to master: http://git.io/-OfjOA19:00
GitHub59milkymist-ng/master 89dbc37 Sebastien Bourdeauducq: cif: do not generate write function for CSRStatus19:00
GitHub126[migen] sbourdeauducq pushed 1 new commit to master: http://git.io/8HoOow19:01
GitHub126migen/master c82b53f Sebastien Bourdeauducq: bank/description/AutoCSR: add autocsr_exclude19:01
GitHub48[milkymist-ng] sbourdeauducq pushed 3 new commits to master: http://git.io/T96SmA20:33
GitHub48milkymist-ng/master 29efa85 Sebastien Bourdeauducq: dvisampler: new DMA engine (buggy)20:33
GitHub48milkymist-ng/master b3d87e1 Sebastien Bourdeauducq: software/videomixer: use new DMA engine20:33
GitHub48milkymist-ng/master 66b4bae Sebastien Bourdeauducq: top: connect dvisampler DMA IRQs20:33
GitHub82[milkymist-ng] sbourdeauducq pushed 1 new commit to master: http://git.io/lTVgFQ20:52
GitHub82milkymist-ng/master d685ed2 Sebastien Bourdeauducq: dvisampler/dma: bugfixes20:52
lekernelyay! all works now.21:00
lekernelthere is some noise on the picture that I suspect is due to poor SI21:00
wpwrakyou get clean frames ?21:00
wpwrakcongratulations !21:00
lekernelon the VGA framebuffer, and in color :)21:00
lekernelwith just a couple random pixels21:01
lekernelprobably SI, it gets worse when the pixel clock increases21:01
lekerneland 800x600 is pure noise (even sync fails)21:04
lekernelwell I hope the direct TMDS board will fix this issue21:04
lekernelI think I can consider myself lucky that at least 640x480 works :) it's pretty much on the brink of failure, and debugging SI would waste days...21:05
wpwrakyeah, the expansion header isn't a great place for high-speed signals21:05
wpwrakindeed. and now you can also do the mixing and fading :)21:06
lekernelif I can get my chB to work... parts for assembling two extra boards are with fedex atm...21:07
Fallenoucongratz :)21:09
wpwrakalways order a generous number of spares :)21:10
Fallenouthere is no correction code on dvi to fix SI caused errors?21:10
larscHDMI has BCH for data islands though21:16
larscDVI is really just VGA in digital21:16
--- Thu May 9 201300:00

Generated by irclog2html.py 2.9.2 by Marius Gedminas - find it at mg.pov.lt!