#milkymist IRC log for Monday, 2011-08-15

FallenouI just discovered it's easy to generate a .pkg or a .dmg once you've written a Portfile (MacPorts)00:09
Fallenoujust a one-line command00:09
FallenouSo I generated lm32-rtems-binutils.dmg and lm32-rtems-gcc.dmg00:09
Fallenouwill upload it somewhere on the wiki I gues00:09
zumbiis is possible to reverse engineer a FPGA botstream?00:11
Fallenouyes it is00:11
Fallenousome companies are generating netlist from bitstreams00:11
zumbiyes, I heard that00:11
zumbibut which tools are needed?00:12
Fallenouthat's why usually bitstreams are encrypted00:12
Fallenouto prevent you from reverse engineering it00:12
FallenouI have no idea00:12
FallenouI never tried such a process00:12
FallenouBut I guess you won't find tools to do that easily out there00:12
zumbiI thought so... :/00:12
Fallenouif such tools exist, they must be jealously kept by those who wrote them00:13
Fallenoubut I don't know, never really searched for it00:13
Fallenoumaybe lekernel knows such softwares00:13
Fallenouask him when he will be back :)00:13
zumbiok, thanks00:13
Fallenouhe is the FPGA expert00:13
wolfspraulhow close in format are the bitstreams between different fpga vendors?00:18
kristianpaulzumbi: in the meantime you can check http://lekernel.net/blog/2010/04/fpga-reverse-engineering-challenge-hackito-ergo-sum/00:44
kristianpauljsut in case00:44
kristianpaululgic website had this information, dunno what happened..00:44
zumbikristianpaul: hey! yes I found ulgic site, but nothing there.. but it looks like it had something in the past01:03
zumbiwolfspraul: each vendor has different bitstream afaik01:04
wolfspraulyes different, but I'm wondering whether the fundamentals are different, or more 'how' different they are01:06
zumbii don't really now01:06
zumbibut i suspect those differ quite a bit01:07
awwpwrak, about the A4809E3R-440DN, 4.312-4.488 V; bad that we need to search compatible part in digikey or muser for easier sample orders.04:41
wolfspraulaw: in the future, we choose components preferably from standard digi-key parts unless there is a very good reason to not do so06:23
awwolfspraul, okay06:24
awincluding Mouser? or NO?06:25
wolfspraulalso OK. _COMMON_ part, that's the key06:25
wolfspraulthe choice of the AIC reset part looks wrong to me. in hindsight we are always smarter but I see nothing that's good about it.06:26
wolfspraulwe even had to buy a whole reel of 3000 parts for 270 USD. all wrong ;-)06:26
wolfspraulthat alone costs 3 USD / board for a run of 90, and 2910 parts forever in our 'archive of bad sourcing decisions'06:26
wolfspraulif we are lucky, we find a matching part from another manufacturer, but I won't hold my breadth06:27
wolfspraulthere's a lot of reset ics, but once you go through the exact requirements we have here it shrinks fast (I did a little digikey searching...)06:28
wpwrakhah, i was wondering how that part ended up in M1 :) and it'd say the best parts come from digi-key and at least one other source :)07:21
wpwrakwouldn't do if some shiny new parts was previously on digi-key's archive of bad sourcing decisions ;-) well, they tell you when it becomes non-stocked, so i guess that's a warning07:22
wpwrakhmm, what's the maximum "5V" voltage the chip needs to survive ? are 6 V enough ?07:38
wpwrak(the A4809 goes up to 12 V)07:38
wolfspraul6V sounds enough (a bit more would probably be better though, I assume this is coming directly from the power adapter?)07:40
wpwrakdirectly after L10, so after the protection circuit, if that one is still around (not sure what the status is there, i remember you have some problems with it)07:41
wolfsprauldon't understand07:43
wolfspraulwhat are you getting at?07:43
wpwrakweren't there some issues with the protection circuit causing troubles ? or are they resolved now ?07:44
wpwraksomething like bad beads07:44
wpwrakor a bad fuse or such07:44
wpwraki don't remember the details. only that some parts were removed. but i don't know if this applies to rc3.07:45
wolfspraulall problems turned out to be faulty measurements07:46
wpwrakoh, cool :) very good. then 6 V should be plenty :)07:47
wpwrakhow about this guy ? http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=APX803-44SAG-7DICT-ND07:51
wpwraka bit power-hungry in comparison, but who cares07:51
wolfsprauloh wow, taht is the same one I found earlier, I guess my skills do slowly raise a bit up from zero...07:53
wolfspraulit says 140ms 'minimum' reset timeout07:53
wolfspraulwhat does that mean? the current one has 200 ms07:53
wolfspraulalso I wasn't sure whether the pins are at the right places, can it be dropped on the existing rc3 footprint?07:54
wpwraklook a the range. seem to be a very fuzzy parameter. it's nominally 200 ms, too07:54
wpwrak(pin-compatible) as far as i can tell, yes. same size, same pin assignment.07:55
wpwrakthis one looks like a second source: http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=576-3834-1-ND07:55
wpwrakalso exists in a slower variant ;-) (1.12 s)07:55
wolfspraulwe can just buy several at once to try some theories, if that helps07:57
wpwrakthat may not be a bad idea. something for the R&D lab :)07:58
wpwrakSOT-23 for such a part seems mighty big, though07:58
wpwraklet me run a package comparison ...07:58
lekernelsot-23 is fine... that's what being used atm08:02
wpwrakhmm no, sot-23 seems to be the most common choice08:02
lekernelso there is space for it08:02
lekerneland it's easier to rework in case of yet another fuckup08:02
wpwraklekernel: yes, i was looking for that makes sense to stock for future R&D08:03
wpwrakof course :)08:03
wolfspraulaw: you there? before you reflash your next board, can you ping us here? then we can try to force USB into full-speed mode as Werner described08:06
wolfsprauljust wait until the next time you need to reflash, then we do it...08:07
awwolfspraul, alright.08:08
wpwrak3rd candidate: http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=MAX6348UR44%2BTCT-ND08:09
wolfspraulwow is that expensive08:09
wolfspraulthe first one is best08:09
wpwrakwell, 3rd source :)08:10
wolfspraulof course for samples we can buy a few08:10
wolfspraulI'm wondering how you can sell something that's 5 or 10 times more expensive than a competitor that can be used as a drop-in replacement08:10
wpwrak(buy maxim) i wouldn't bother. they're for the "due diligence" appendix ;-)08:10
wolfspraulmaybe they have some very outstanding performance parameters that some customers need?08:11
wolfspraulor tolerances? or some customers just totally trust their brand?08:11
wpwrakmaybe the military likes then ? :)08:11
wolfspraulok, some old government contracts or other large bureaucratic customers keeping those parts alive? another option08:12
wolfspraulthe diodes inc. one is ca. 14 cents / 1k, the Maxim one 1.35 USD / 1k08:13
wolfspraulalmost 10 times more08:13
wpwrakmaybe it's because they have such a large choice of parameter and output configurations08:15
wpwrakof course, for all we know, AIT may have a lot more. that data sheet alone could "generate" something like 700 different parts.08:16
wolfspraulAIT parts were used on Ben/AVT08:17
wpwrak(if all specified part number combinations really exist, which seems unlikely)08:17
wpwrakaah, that's where it comes from :)08:17
wolfspraulyes, I also wondered :-)08:17
wpwrakit had that "friends from taiwan" feeling to it :)08:17
wpwraklike so many of those parts we had in openmoko. without data sheets, no second source in the known universe, etc. :)08:18
wolfspraulwe can't even say much about the part or manufacturer, but us being such a little guy with so much design verification and changes all the time, it's a difficult source08:18
wpwrakand of course, the company dead before openmoko :)08:18
wolfspraulonce you are making large quantities of whatever all the time, they may be the best source of all08:19
wolfspraulwho knows08:19
wpwrakyou can always switch back once you're sure08:19
wolfspraulin large quantities datasheet availability doesn't matter08:19
wolfsprauloh yes, definitely08:19
wpwrakthere's probably great potential if penny-pinching parts08:19
wpwrakeach cent you save is a million dollars once you reach 100M+ quantities :)08:20
wolfspraulwhat matters is that your source can follow your forecast flexibly, that the quality of their parts is stable, that you have a good sales contact for problems, etc.08:20
wolfspraulbut at our quantities and level of uncertainty, that's all pretty much the last thing we worry about :-)08:20
wpwrakwhat's what we dream of worrying about ;-)08:21
wolfspraulok so those 3 reset parts are all the same idea, should we buy a few of each? anything else to add?08:23
wolfspraulI understand that this fix is surely a fix, since with the 2.6v reset ic we are out of spec. so the fix is correct in any case. the unknown is whether it fixes the flash corruption.08:23
wolfspraulif it does - nothing else to worry about. if it does not - then what?08:24
wpwrakif it doesn't, then it may be a sw or fpga problem. e.g., sending out spurious transactions08:25
awnew steps:1. insert DC jack08:26
aw2. middle button08:26
aw3. wait for booting, wait for render, let it render 30 seconds08:26
aw4. unplug DC jack08:26
aw5. insert DC jack08:26
aw6. press middle button but then run the test software over jtag serial08:26
aw7. run the test software only until the CRC check is finished, and record the results08:26
aw8. if the CRC check fails, abort the render cycles here08:26
aw9. if the CRC check passes, unplug DC jack08:26
aw10. go back to step #108:26
wpwraki would only get the ~200 ms from diodes and the 1.12 s from micrel08:26
wolfspraul1.1s ?08:26
awnow 0x7c: is available. hope that we can run into a flash problem occurred soon08:26
wolfspraulaw: you ran 10 render cycles with crc checks on 0x7c?08:27
wolfspraulremember when you do the next flashing, ping us here08:27
awhope from now on can catch flash problem then dig into08:27
wolfspraulfor the usb full-speed thing08:27
wpwrakaw: (new steps) sounds good. i wouldn't call it a "render cycle", though :)08:28
awokay. ping guys. ;-)08:28
wolfspraulah ok08:28
wolfspraullet's see (opening werner's instructions :-))08:28
wolfspraulaw: get the board ready, plug usb cable into your notebook as usual08:28
wolfspraulafter connecting the cable, run 'dmesg'08:29
wolfspraulin the last few lines, you should see something like "usb 2-1: new high speed USB device [...]"08:29
wolfsprauldo you see that?08:29
wpwrak(nor corruption analysis) i think we'll know more about this when we get better data from the crc experiment. e.g., whether there are patterns in where and when it strikes.08:30
awwolfspraul, what does this mean? for each board or when meet "next flashing"?08:30
awyes, i just saw Werner's email and marked firstly08:30
wolfspraullet's try now08:30
wolfspraulif it works, we will probably do it for each board08:30
wolfspraulbut let's try08:30
wolfspraulyou ready?08:30
wolfspraul1. plug in usb cable, like you normally flash08:30
wolfspraul2. run 'dmesg'08:31
wpwrak(analysis) so far, we only have very spurious results, and many have causal dependencies in them, which further twist the probabilities. so it's hard to tell anything from the existing data, except that bad things happen.08:31
wolfspraulwpwrak: how would you call it [instead of render cycle]08:31
wolfspraul'render cycle' because it's a full cycle from power on to rendering back to power off08:31
wpwrak(cycle) does the cycle even involve rendering anything ? i thought it was now just  power up -> CRC -> power down08:32
wolfsprauldid you read the list #1 - #1008:32
wpwrakor does the test sw render ?08:33
aw[15332.010338] ftdi_sio 2-3:1.1: device disconnected08:33
aw[15491.956073] usb 2-3: new high speed USB device using ehci_hcd and address 808:33
aw[15492.093767] usb 2-3: configuration #1 chosen from 1 choice08:33
aw[15492.096349] usb 2-3: Ignoring serial port reserved for JTAG08:33
aw[15492.099598] ftdi_sio 2-3:1.1: FTDI USB Serial Device converter detected08:33
aw[15492.099664] usb 2-3: Detected FT2232H08:33
aw[15492.099670] usb 2-3: Number of endpoints 208:33
aw[15492.099676] usb 2-3: Endpoint 1 MaxPacketSize 51208:33
aw[15492.099682] usb 2-3: Endpoint 2 MaxPacketSize 51208:33
wolfspraulboot, render (30 seconds), cycle, test software (crc)08:33
aw[15492.099687] usb 2-3: Setting MaxPacketSize 51208:33
aw[15492.099989] usb 2-3: FTDI USB Serial Device converter now attached to ttyUSB008:33
wolfspraulok enough08:33
wolfspraulaw: do you see the "usb 2-3:"08:33
wpwrakah, step 3 !08:33
wpwrakright, i skipped some steps :) i wouldn't power cycle twice per loop08:34
wolfspraulso that means the m1 board was connected to your notebook _BUS_ 2, and _PORT_ 308:34
wolfspraulaw: ok?08:34
wpwrakdrop steps 2-508:34
wolfspraulwpwrak: wait let me do the full-speed thing first08:34
wpwraknod :)08:35
awwolfspraul, yes, saw "usb 2-3"08:35
wolfspraulI can do multiple threads in parallel, but Adam probably cannot08:35
wpwrakyes yes :)08:35
wolfspraulaw: ok, that means 'bus' 2, 'port' 308:35
awi can't. now do full speed first08:35
wolfspraulnow: echo 3 >/sys/bus/usb/drivers/usb/usb2/../companion08:36
wolfspraulthe '3' and '2' are coming from your dmesg output08:36
wolfspraulnote that there needs to be a space after the '3' in "echo 3 >/sys/bus/..."08:37
wolfspraulafter executing the 'echo' line, run 'dmesg' again and paste some lines from the end here08:38
awadam@adam-laptop:~/m1_adam/snapshots/2011-07-13/for-rc3$ echo 3 >/sys/bus/usb/drivers/usb/usb2/../companion08:38
awbash: /sys/bus/usb/drivers/usb/usb2/../companion: Permission denied08:38
awping? you there?08:40
wpwraksudo, yes08:40
wpwraksudo /bin/bash08:41
wpwrakthen run the command08:41
xiangfuecho 3 | sudo tee /sys/bus/usb/drivers/usb/usb2/../companion08:41
wpwrakwow :)08:42
aw[15492.099989] usb 2-3: FTDI USB Serial Device converter now attached to ttyUSB008:43
aw[16147.368219] usb 2-3: USB disconnect, address 808:43
aw[16147.368829] ftdi_sio ttyUSB0: FTDI USB Serial Device converter now disconnected from ttyUSB008:43
aw[16147.368870] ftdi_sio 2-3:1.1: device disconnected08:43
aw[16147.624074] usb 6-1: new full speed USB device using uhci_hcd and address 208:43
aw[16147.767106] usb 6-1: not running at top speed; connect to a high speed hub08:43
aw[16147.795229] usb 6-1: configuration #1 chosen from 1 choice08:43
aw[16147.803204] usb 6-1: Ignoring serial port reserved for JTAG08:43
aw[16147.807510] ftdi_sio 6-1:1.1: FTDI USB Serial Device converter detected08:43
aw[16147.807554] usb 6-1: Detected FT2232H08:43
aw[16147.807557] usb 6-1: Number of endpoints 208:43
aw[16147.807559] usb 6-1: Endpoint 1 MaxPacketSize 6408:43
aw[16147.807562] usb 6-1: Endpoint 2 MaxPacketSize 6408:43
wpwrakyes ! :)08:43
aw[16147.807564] usb 6-1: Setting MaxPacketSize 6408:43
aw[16147.808882] usb 6-1: FTDI USB Serial Device converter now attached to ttyUSB008:43
awmm...full speed device now.08:43
wpwraktriumph ! :)08:43
awthen? does this mean that I have to enter commands everytime when test each board?08:44
wpwrakmake sure you use the longest cable you have ;-)08:44
wpwrakthe port configuration should be permanent (until you reboot the PC)08:44
awoah..sorry that i used a shorter cable..okay...change to long cable08:44
wpwrakbut you can check with dmesg. unplug and replug, then see if it still comes up as full-speed08:44
awumm..sounds good (until reboot the PC)08:45
wolfspraulwhy long cable?08:45
awi see08:45
wolfspraulwe are not trying to fix every bug on the planet08:45
wpwrakworst case: you need to run the command each time you re-plug the usb-jtag08:45
wpwrakwolfspraul: opportunistic testing :)08:45
wolfspraulwait, let's be clear and precise08:45
awhmm...sounds different idea..i standby and listening firstly. ;-)08:45
wolfspraulI am focusing on the run of 90 boards, already badly delayed08:46
wolfspraulwe can postpone discoveries of all kinds until after sales have started08:46
wolfspraulfull-speed is good08:46
wolfspraulAdam can switch to 100% full-speed for the rest of the run now08:46
wolfspraulbut I would say the same thing about the short cable08:46
wolfspraulwe are trying to fix rc3 bugs, not make sure Adam's entire lab is bug free08:46
wolfspraulmy opinion08:47
wpwrak(postpone) well, as you wish. confirmation that full-speed is the cure may create an action item before shipping, though.08:47
wolfspraulcure of which bug?08:47
wolfspraullibusb bug?08:47
wolfspraulwe don't even know which bug :-)08:47
wpwrakcure of the reflash failures08:47
wpwrakwell, there's that, yes08:48
wolfspraulaw: which m1 board do you have attached now?08:48
wpwrakof course, are we sure there's even a bug in libusb ? :)08:48
wolfspraulthat's exactly what I want to avoid getting into now08:48
awwolfspraul, 0x7c08:48
wpwrakthat's the fun bit with stochastic bugs - it happens, then you change X and it doesn't happen. but are you sure it went away because you changes X or just because you didn't test often enough ? :)08:49
wpwrakanyway, we can deal with this later, okay08:49
wolfspraulaw: above you said 0x7C is available (testing finished)08:49
wolfspraulare you planning to reflash 0x7C now?08:49
wpwraki think a fully tested and okay board is a good start08:49
wpwrakno need to reflash until CRC errors happen08:49
wolfspraulyes but I don't understand whether or why Adam wants to reflash 0x7C now, if he just said it's 100% pass08:50
wolfspraulprobably a misunderstanding somehwere...08:50
awwolfspraul, yes 0x7c was done successfully with "new steps" for rendering.08:50
wolfspraulaw: ok, so that sounds like 0x7C is finished.08:51
wolfspraullet's make a little test with our new full-speed happiness08:51
awbut 0x7c not ready for reflashing with "full speed" reflash. i just tried to learn commands. ;-)08:51
awso what's next step here though?08:52
awor just when I meet d2/d3 dimly list again? then ping here?08:52
wolfspraulaw: you don't need to reflash anything just because the USB speed is full-speed now08:53
wolfspraulthe idea is that for new boards that you reflash from now on, you make sure they are flashed in full-speed08:53
awso i keep using shorter usb cable and fix usb failure boards first. ;-)08:53
wolfspraulaw: should we try a test on 0x32 ?08:53
wolfspraulthose things are unrelated08:54
wolfspraulyes, keep using the short cable08:54
wpwrakaw: you should reflash after each CRC failure. we assume that "d2/d3 dim" would also be a CRC failure. but there can be other CRC failures that do not cause "d2/d3 dim"08:54
wolfspraulaw: I just told him earlier to not reflash after crc failure to not remove evidence.08:54
wolfspraulI meant wpwrak : I just told adam earlier ...08:54
wolfspraulthat's the hard part now, avoiding confusion08:55
awumm...confused me now08:55
wolfspraulaw: wait one second08:55
wolfspraulyes of course08:55
wpwrakaw: well, detect CRC error -> analyze -> reflash :)08:55
wolfspraulaw: long usb cable versus short usb cable08:55
wolfspraulthat one first08:55
wolfspraulaw: please do _NEVER_ use the long cable08:55
wolfspraulalwyas use the short cable08:55
wolfspraulUSB full-speed versus high-speed08:55
wpwrakwolfspraul: i would use it later :) it's still the cable you ship, no ?08:55
wolfspraulno confusion08:56
wolfspraulI want to finish this run.08:56
wolfspraulaw: USB full-speed versus high-speed08:56
wolfspraulfrom now on, please always set your USB speed to full-speed on your notebook, before running reflash_m1.sh08:56
awokay: 1. NEVER use long cable 2. from now on use full speed commands08:57
wolfspraulyou can check in dmesg08:57
wolfspraulwhen you see "new high-speed device detected", that's wrong08:57
wolfspraulyou want to see "full-speed"08:57
wolfspraulif it says 'high-speed', you need to use the echo command to force it to full-speed08:57
awalright to use echo commands08:58
wolfspraulyou can check it every time after you plug in a new m1, before running reflash_m1.sh08:58
wolfspraulaw: do you think it's clear how you check full-speed, and force to full-speed ?08:58
wolfspraulif you are not clear, before running reflash_m1.sh, just paste the last lines from dmesg here08:58
awwolfspraul, clear...from now on always use full-speed even i don't meet d2/d3 dimly or else flash problems.08:59
wolfspraulbecause you have to set full-speed _BEFORE_ running reflash_m1.sh08:59
wolfspraulso you cannot know at that point whether you run into problems or not09:00
awsurely if I meet flash problem again, i ping here in parallel09:00
wolfspraulthe full-speed thing is _BEFORE_ running reflash_m1.sh09:00
wolfspraulevery time before you run reflash_m1.sh, you check the full-speed thing09:00
wolfspraulabout reflashing09:00
wolfspraulI think after you have flashed some m1 board once, and it can boot (and render), after that you should _NEVER_ reflash it a second time.09:01
awokay.you just told me. ;-)09:01
wolfspraulit you run into any error after a successful rendering, leave the board untouched. just note the error, and put the board side.09:01
wolfspraulwe can then study the test results and think about which step to take on which board.09:02
wolfspraulthat's all from me :-)09:02
wolfspraul3 easy items09:02
wolfspraul1) always use short cable09:02
awokay...and keep testing another board first09:02
wolfspraul2) always make sure USB is in full-speed before running reflash_m1.sh09:02
wolfspraul3) do not reflash a board again after it has already rendered09:02
wolfspraulaw: wait, one more thing09:03
wolfspraulI want to make one test with 0x3209:03
wolfspraulaw: can we make one special test with 0x32 ?09:03
awnow? why not?09:04
wolfspraulwell, just asking :-)09:04
wolfspraulso yes, please get 0x32, and plug it in, and paste the last lines of dmesg here09:04
wpwrakhmm, i'd rather focus on one board at a time. so 0x7c. cycle until CRC, analyze CRC, then reflash and test 0x7c some more09:04
wolfspraul"plug it in", I mean usb jtag09:04
wpwrakwolfspraul: 0x32 is with usb problems ?09:05
wolfspraulyou said all flashing problems would go away with full-speed09:05
wolfspraulI'm trying to pick one where that may actually happen, maybe 0x32 http://en.qi-hardware.com/wiki/Milkymist_One_run_3_schedule#Test_Results09:05
wolfspraul0x34 was strange, so maybe not09:05
wolfspraulits' just a quick test09:06
wolfspraulin 2 minutes we know that full-speed will make no difference on 0x32 :-)09:06
wolfspraulhe he09:06
wolfspraulgrant me that joy09:06
wpwraksure. let's get the small things out of the way first :)09:07
wolfspraulif this works, I'll get my first afternoon beer09:07
wpwrak0x3a too, no ?09:07
aw0x32: 17567.832765] ftdi_sio 6-1:1.1: device disconnected09:07
aw[17656.533069] usb 6-1: new full speed USB device using uhci_hcd and address 309:07
aw[17656.673148] usb 6-1: not running at top speed; connect to a high speed hub09:07
aw[17656.700375] usb 6-1: configuration #1 chosen from 1 choice09:07
aw[17656.707317] usb 6-1: Ignoring serial port reserved for JTAG09:07
aw[17656.712410] ftdi_sio 6-1:1.1: FTDI USB Serial Device converter detected09:07
aw[17656.712563] usb 6-1: Detected FT2232H09:07
aw[17656.712570] usb 6-1: Number of endpoints 209:07
aw[17656.712576] usb 6-1: Endpoint 1 MaxPacketSize 6409:07
aw[17656.712582] usb 6-1: Endpoint 2 MaxPacketSize 6409:07
aw[17656.712587] usb 6-1: Setting MaxPacketSize 6409:07
aw[17656.717182] usb 6-1: FTDI USB Serial Device converter now attached to ttyUSB009:08
awnow is full speed, so what steps you want to try?09:08
wolfspraulaw: perfect. just run reflash_m1.sh09:08
wolfspraulyes 0x3A is nice too! thanks. I didn't see it...09:08
awbe noticed that now it stays d2/d3 dimly lit.09:08
awwait...use xiangfu's last 'erase' version, right?09:09
wolfspraulsure why not09:09
wolfspraulalways use the new reflash_m1.sh with erase now, I see no reason why not09:09
wpwrakseems we have more: 0x55, 0x67, 0x6d, 0x6f, 0x70, 0x77, ...09:09
wpwrak0x7a is a bit weird, but may also be the same09:10
wolfspraulwell you are brave09:11
awhmm...stops at 'Bitstream length: 1484404'09:11
awstandby next analysis step now..he he ;-)09:11
awwhat's meaning of "GRRRR"? ;-)09:12
wpwrakseems that my "full speed" theory is wrong :-(09:12
wpwrakah well, in any case it shouldn't make things worse ...09:13
awwpwrak, not bad that a way would be came out from you. :-) never sad though..we here you.09:13
wolfspraulaw: let's try the same quick test on 0x3A09:13
wolfspraulI do believe full-speed is good, we should always use it and it will help eliminate a few strange flashing problems. But I don't believe it has any impact on the physical/electrical condition of a particular m1 board.09:14
wolfspraulI trust the little jtag board and the ftdi chip. once the nor is written it's written. the strangeness must come from the m1 boards themselves.09:15
aw0x3a: good still detect with full speed and stays d2/d3 dimly lit after powered -on. now to reflash. ;-)09:16
awmm...same stopped at 'Bitstream length: 1484404'09:16
wolfspraulaw: try to disconnect/reconnect the jtag-serial board too09:16
wolfspraulaw: now try to disconnect/reconnect the jtag-serial board09:17
wolfspraul(power off everything first)09:17
awhmm...need to power off09:17
wolfspraulthen reflash_m1.sh in full-speed again09:17
awsame stopped there. :-(09:18
wolfspraulone sec09:18
wolfspraulcan you try reflashing with Xilinx Impact?09:19
wolfsprauland the xilinx cable09:19
awhmm...seems different image i quite don't know this.09:19
awneed to ask xiangfu before do this. :-)09:19
wolfspraulwe can do that later09:20
awlast time rc2 I used Lekernel's image09:20
wolfspraulon all boards with flashing problems, we can try Xilinx Impact and Xilinx cable later09:20
wolfspraulaw: ok, let's stop the full-speed tests right now09:20
wolfspraulWerner had another idea I like09:20
wolfspraulaw: you just finished 0x7C, right?09:21
awso from now on i still use full-speed to continue tests?09:21
awyes, finished 0x7c09:21
wolfspraulalways full-speed09:21
awalright full-speed now09:21
wolfspraulso werner wants to make a special test on 0x7c09:21
wolfspraullike this:09:21
wolfspraulwait I write first09:21
wolfspraul1. plug DC jack in09:22
wolfspraul2. middle button, escape to test software, run test software until CRC checks09:22
wolfspraul3. unplug DC jack09:22
wolfspraul4. go back to step #109:22
wolfsprauljust that09:22
wolfspraulWe are hoping that after some cycles, the CRC checks will find a corruption09:23
wolfspraulthe cycles should be fast, so you can try 100 or 20009:23
wolfspraulstart with 100 :-)09:23
wpwrakand please count the cycles09:23
wpwrakerr, i'd stop at the first CRC error09:23
wolfsprauloh sure09:23
wpwrakthen analyze09:23
wolfspraulsorry that wasn't clear09:24
wolfspraulaw: of course you stop at the first CRC error09:24
wolfspraulwpwrak: be warned (well, I warn myself). I believe this kind of testing may damage the nor chip or more, and turn a board unflashable for days or forever. :-)09:25
wolfspraulaw: no worries, I just explain my theory to Werner... You can have fun :-) We have enough boards now to ruin some :-)09:25
awwolfspraul, ha...yes, from last rc2 experiences. ;-)09:25
wolfspraulwe should have taken it much more seriously on rc209:26
wolfspraulI learnt a lot09:26
wolfspraulbut that's another story, now we try to rescue rc3 and make good boards09:26
wpwrakwolfspraul: the chip should be good for a few kcycles09:26
wolfspraulyou will see soon09:26
wolfspraulit's a bug somewhere, an electrical problem09:26
wolfspraulsome kind of shock, over-current, over-voltage, whatever09:27
wolfspraulyou saw Adam's reaction just now when I wrote this :-)09:27
aw2 times09:27
wpwrakwolfspraul: hmm, let's hope it's not overvoltage or such. the reset chip replacement couldn't fix that.09:27
wolfspraulI know09:28
wolfspraulso I keep asking "how comfortable are we" :-)09:28
wolfspraulbecause I'm not :-)09:28
wolfspraulI made some big mistakes in rc2, like I said - already learning...09:28
wolfspraulbut that analysis doesn't help now, so let's make the best out of the rc3 situation we have right in front of us09:29
wolfspraulsometimes all you have left is that some luck happens09:29
wolfspraula lucky day!09:29
wolfspraulmaybe today?09:29
wolfspraullet's look for signs!!09:29
wpwraksigns and portents :)09:30
wolfspraulwpwrak: do we have any theory what kind of damage or impact may turn the nor chip, or something else, unflashable for several days, but then flashable again?09:34
wolfspraulbecause Adam has seen that so many times now that we can rule out it just being some sort of noise09:34
wolfspraulAdam will regularly let an unflashable board 'rest' for several days, and then try again, because we have seen a lot come back alive after such a resting period09:35
xiangfuwpwrak, you may already saw my patches on urjtag 'lockflash' 'unlockflash'. I have some question about how this urjtag works.09:35
wolfspraulit's not 5 minutes, or an hour, the effect is noticeable after 1 day or 2 days or so09:35
wpwrakhmm no, no idea09:35
wpwrakcould be some temperature dependency09:36
wolfspraulnot sure09:36
xiangfuwpwrak, http://dpaste.com/594592/ line 20 ~ 28 is read back the lock bit and check.09:36
wolfspraulbecause the boards worked fine before, including reflashing09:36
wpwrakone test could be like this: if board X magically recovers, try all other boards with reflash problems at that time too. if it's temperature, some of them may also come back09:36
wolfspraulyou mean room temperature?09:37
wolfsprauldefinitely not. it's a time based phenomenom.09:37
xiangfuwpwrak, I want know how urjtag know the 'cfi_array->address' ?  for now I understand: 1. upload the fjmem.bit 2. then nor flash working 3. how urjtag know what is the address of nor flash data port?09:37
awwolfspraul, last in rc2 we damaged our boards by "fast-powered cycling" though..not keep 5 seconds between power-on like this time09:37
wolfspraulyes sure, and the reset circuit is also there. let's just focus on trying to reproduce the flash bug now, I'm only saying if it falls into an unflashable state, I wouldn't be surprised.09:38
wolfspraullike 0x32 or 0x3A we just looked at09:38
wpwrakxiangfu: (address of flash) isn't this configured somewhere ?09:39
awwpwrak, why no use high speed to capture the tests I am doing?09:39
wpwrakwolfspraul: (time) hard to distinguish the two09:39
awi felt this test if use full-speed?09:39
wolfspraulit shouldn't matter. you think it's slower now?09:40
wolfspraulI think you should always use full-speed, even for this test.09:40
awsince last week i met CRC err by high speed. :-)09:40
wolfsprauldon't say that otherwise Werner will jump up and hurt his head09:41
awso i just wanted to clarify what purpose you wanted to catch?09:41
awoah...sorry ;-)09:41
wolfspraulno no, just joking09:41
wolfspraulI am just joking09:41
wolfspraulaw: I think always use full-speed09:41
wolfspraulfor everything09:41
awalright .;-)09:41
wpwrakxiangfu: are you sure about  URJ_BUS_WRITE (bus, adr + 0x02, CFI_INTEL_CMD_READ_IDENTIFIER);  ?09:42
wpwrakxiangfu: the data sheet seems to want 0x1a (table 8, page 19)09:42
wpwrakxiangfu: ah, sorry, misread it. it's not 0x1A but IA :)09:44
wpwrakxiangfu: so, if i understand things right: URJ_BUS_WRITE (bus, adr, CFI_INTEL_CMD_READ_IDENTIFIER);09:45
wpwrakxiangfu: and then sr = URJ_BUS_READ (bus, cfi_array->address+2);09:46
wpwrakhmm, vanished :(09:46
wpwrakaw: you're running the CRC check each time ?09:47
awhaven't spotted CRC err though. ;-)09:48
wolfspraulit could take 100-200 cycles09:50
wpwrakwasn't 100-200 the rate of "dim LEDs" ?09:51
wolfspraulunfortunately we know so little. it could be that some boards will never exhibit the problem.09:51
wpwrakwith the CRC check, we should hit it ~10-20 times more often, assuming uniform distribution09:51
wolfspraulmaybe it is caused by some unfortunate part tolerances coming together09:51
wpwrakthat could be the case, too09:51
wolfspraulI don't believe that, but let's see09:51
wpwrakmaybe it's also a question of giving the board enough time to discharge09:52
wolfspraulif we know for sure that some boards are safe, they are good to go09:52
wolfspraulthe bad thing is that we currently do 10 render cycles (30 seconds each) in our testing09:52
wolfsprauland we had boards failing on cycle #2 #6 #9 etc.09:52
wolfspraulnot good09:52
wolfspraulwhy should '10' be the magic number to determine that the board is stable?09:52
wpwrakif we have the baseline probability, we can calculate how many tests you need to be, say, 99% sure the problem doesn't appear09:53
wolfspraulwe don't need to look at or find root causes for all sorts of strange flash/dim lit/reconfig/whatever boards. we have enough time for that once we have cleared 40, 50, 60 or more to go out09:53
wolfspraulI think what helps is if we can more clearly see the different bugs separately that are probably overlapping here09:54
wolfspraulwhich is why I like the full-speed stuff, short cable, crc checks, etc.09:54
wolfspraulalso the reset ic idea09:54
wolfspraulnot just idea, that seems to be a clean fix/improvement that is good no matter what other things we find09:55
wolfspraulwpwrak: speaking about that. you really want the 1.12s delay ic?09:56
wpwrakyeah, if the reset chip does anything useful at all, then this is an improvement09:56
wolfspraulI mean - can that work at all?09:56
wolfspraulyes I think the reset ic is fine, helps09:57
wolfspraulpretty sure about that09:57
wolfspraulso we order 100 of this: http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=APX803-44SAG-7DICT-ND09:57
wpwrak(1.12 s) for R&D, it may be good to have. not sure it would be desirable in M109:57
wolfsprauland 3 or 4 of this: http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=576-3835-1-ND09:57
wpwraki usually get at least 10, unless the item is very expensive :)09:58
wolfspraulI'm half Chinese09:59
wolfspraulso 509:59
wolfspraulfor the cycle testing adam is doing on 0x7A now, I propose we stop that at 100 successful cycles09:59
wolfsprauland let Adam continue to go through the whole batch as planned10:00
awno , ox7c10:00
wolfspraulsorry 0x7C10:00
wolfspraulthat's because we already have several improvements now (short cable, full-speed, crc checks in test software which is logged), and then we have more testing data to look at and thing about10:00
wolfspraulthink about10:00
wolfspraulthen we can zoom in on clusters, or try to find clusters, or try to find boards where it is easy to reproduce some particularly interesting behavior10:01
wpwrakyeah, 100 should be plenty. i would have expeced to see an error much earlier. maybe we've removed the step that actually causes the problem. but let's try a few more boards first.10:01
wolfspraulyes we may have removed the step10:01
wpwrak(zoom in) yes10:01
wolfspraulor the problem is only showing on particular boards10:01
awso remove new steps?10:02
wpwrakthat could also be10:02
wolfspraulaw: no, just continue10:02
wolfspraulWerner and I are discussing the next steps10:02
wpwrakmaybe, when 0x7c is done, pick one that has had NOR corruption before10:02
wolfspraulWerner cannot wait looking at the interesting stuff NOW ;-)10:02
wolfspraulalso from now on, Adam will run crc checks between the 10 render cycles10:03
wolfspraulthat may show something (or not)10:03
wolfspraulwe could increase the 10 render cycles to 15 ?10:03
wolfspraulthey are time consuming though10:03
wolfspraulthat's 30 minutes testing for each board, easily10:03
wolfspraulnah let's only do 10 now10:05
wolfspraulI don't need more evidence that boards fail at #12 or #1410:05
wolfspraulI need to find the root cause10:05
wolfspraulin that thinking we could even reduce the cycles to 5 :-)10:06
wpwraklekernel: in adam's usual test, boot and render for some minutes, do NOR access (read or write) occur after the rendering starts ?10:06
wolfspraul30 seconds render10:06
wpwrakwolfspraul: wait wait .. for now, we don't have rendering in the loop10:06
wolfspraulI am thinking once he's back to going through the batch10:07
wolfspraulI think we can reduce to 5 cycles10:07
wpwrakwolfspraul: let's keep the simplified loop and apply it to a board that's know not to be immune10:07
wolfspraulbut with crc checks in between10:07
wolfspraulwhich one?10:07
wolfspraul(looking at list)10:07
wolfspraulwpwrak: how about 0x39 ?10:08
wolfspraulclean and simple10:08
wolfspraulone cycle - and out :-)10:08
wolfspraulmaybe too simple, maybe a little later...10:08
wpwrak0x39 sounds excellent :)10:09
wolfspraul0x54, also nice10:09
wolfspraul8th cycle10:09
wolfspraulok, 0x3910:09
wpwrakanother up to 100 tries with 0x3910:09
wolfsprauloh my10:10
wolfsprauldinner time for Adam :-)10:10
wpwrakif that still doesn't do anything, add rendering to the loop10:10
wpwrakcan you go from test to render ? or do you have to reset in between ?10:10
wolfspraullike I said, instead of doing time consuming tests on single boards now, we can also proceed going through the batch with the process that we improved in details10:10
wolfspraulwpwrak: maybe he could go from test to render over software reset (press three buttons), instead of pulling the DC cable10:11
wpwrakwe need a larger number of tests for now. statistical baseline.10:11
wolfspraulyes but on which boards?10:11
wpwrak(sw reset) yes, that's an option10:11
wolfspraulyou may be hitting on a board that may never show the problem, just wasting time10:11
wpwrak0x39 looks promising :) it did it once. we know it can ;-)10:11
wpwrakif it all of a sudden doesn't do it, that's interesting, too10:12
wolfspraulwpwrak: what do you want to do on 0x39 actually? reflash it?10:13
wolfspraulfirst try whether it boots now10:13
wolfspraulboards have come back after X days, though I am not sure exactly from which of the multiple failure conditions we may actually be looking at10:14
wpwrakyeah, would be fun if the NOT corruption would somehow have healed itself ;-)10:14
wolfspraulso first try to boot 0x39, see what happens. if no reconfigure -> reflash_m1.sh with erase and full-speed10:14
wolfspraulI am telling you we have seen enough such cases now10:14
wpwraktry to boot and it it boots, run the CRC check10:15
wolfspraulgood idea10:15
wpwraki have that mental image of a guy in a prison cell counting the days with scratch marks on the wall. adam must be doing something slimiar, counting the tries until he can lay the board to rest :)10:18
wolfspraulwpwrak: I do think he should continue going throguh the batch first, before 0x3910:23
wpwrakaw: ah, and please paste (to pastebin.com, or similar) the console output of the 100th run10:23
wolfspraulbut if you want him to do 0x39 next, ok with me10:23
wpwraki'd prefer 0x39. let's make the thing happen before changing an unknown set of variables10:24
wolfspraulaw: thanks a lot!10:26
wolfspraulcan you post the console output of the last run to pastebin.com ?10:26
wolfspraulhave you used pastebin.com before?10:26
awmy logs are over though10:26
awwon't show 100 times for you!10:26
awbut you can only trust my log though. :-)10:27
wpwrakperfect. thanks !10:27
awor tell me how my terminal log can save longer message it can. ha ;-)10:27
wpwraknaw, just wanted to see one :)10:27
wolfspraulaw: what terminal program do you use?10:28
wpwraknothing unusual there, so the tests appear to be good10:28
awI need to wrap my rubbish for preparation pm 7:00 and dinner10:28
wpwraknow, 0x39. this will be fun :)10:28
wolfspraulaw: want to go to dinner first, or next test?10:28
awwpwrak, I'll back soon with 0x39 though following your idea. ;-)10:29
awis okay? ;-)10:29
wolfspraulyes perfect10:29
wolfspraulaw: you prefer mouser instead of digi-key, right?10:30
wolfspraulis mouser faster, or what is the reason?10:30
awwolfspraul, yes. Mouser won't charge me extra business fee when over NTD3000 which including shipping fee. ;-)10:30
wolfspraulok, so mouser is cheaper10:31
wolfspraulalright, I will lookup the reset parts in mouser, I want to get the order out asap10:31
awthe digikey will always charge me an extra business tax 5% of whole order price of batch.10:31
awthat's why i used to order in Mouser. ;-)10:31
wolfspraul5%, ok10:32
awi gotta go though10:32
wolfsprauldoesn't sound like a big drama10:32
wolfspraulok, later10:32
wpwrak(5% charge) interesting. no such charge here, it seems. at least i see nothing that looks like it in the invoice10:32
awi'm back. so 0x39 for next. ;-)11:33
awso use same steps like 0x7c's? 100 times, right?11:34
wpwrakfirst, let's see if it boots in its present state (without reflashing)11:41
wpwrakif it does, please run the CRC test11:43
wpwrakif it doesn't boot, do you know how to read out the NOR via gdb ?11:43
aw1. it can boot > rendering now11:47
aw2. CRC is okay while testing11:49
awi don't know how to read out of NOR via gdb11:49
awso now I go for 10 times of rendering -> power cycle -> middle btn -> test program -> CRC -> power-cyle -> rendering ?11:50
awor 100 times?11:50
awor i learn to read NOR firstly?11:51
wolfspraulit's rendering now?11:55
awit's in test program now11:56
wolfspraulaw: what is 0x39 doing now? (sorry, just back)11:56
wolfspraulcrc was ok?11:56
awit was okay11:56
wolfspraulthat confirms my suspicion that the 'cannot reconfigure' bug we see is not always a nor corruption11:57
wolfspraulin fact the only time we saw a corruption for now was on an rc2, which may not be comparable11:57
wolfspraulaw: maybe let's do 20 cycles of the same style as 0x7c before11:57
awin rc2? was xiangfu show it us, right?11:58
wolfspraulI think we should ignore that result until we have rc3 data11:58
awum...got it11:58
wolfspraullet's do 20 cycles like before11:58
wolfspraulmy prediction: there will be no problem11:58
awalright, so i start to count11:58
wolfspraulbut who knows :-)11:58
awoah~~man! d2/d3 dimly lit now. :(11:59
wolfspraulwell great11:59
awso next?11:59
wolfspraulthat was the first power cycle?12:00
wolfspraulyou just unplug, replug -> d2/d3 dimly lit?12:00
awbelongs to SECOND power on12:01
wolfsprauldo you know how to read nor via jtag?12:01
awso it's second powered -on12:02
awdon't know12:02
awi see xinagfu seem have read flash script12:02
wolfspraulI wish we could rework the reset ic to 5v/4.4v on 0x3912:03
wolfspraulI think you should continue with the rest of the batch now, I have no further questions about 0x39 right now12:03
wolfspraulunless xiangfu shows up and tells us how to read nor quickly, or unless wpwrak has other questions12:04
awthis https://github.com/milkymist/scripts/blob/master/scripts/read_flash_m1.sh  ?12:04
wolfspraulwe can go back to 0x39 later12:04
wolfspraulsure, you can try :-)12:04
wolfspraulread the standby bitstream first12:04
wolfspraulah yes, you can just run it12:05
awhow the command's syntax?12:05
wolfsprauljust run ./read_flash_m1.sh12:05
awand which file it will write in?12:06
wolfspraulin your home dir ~/.qi/milkymist/readback/_date_/standby.fpg12:07
wolfspraulif it works12:07
wolfsprauljust be brave, run and see what happens12:07
wolfspraulconnect jtag with usb full-speed as always12:07
awadam@adam-laptop:~/m1_adam/snapshots/2011-07-13/for-rc3$ sudo ./read_flash_m1.sh12:11
aw./read_flash_m1.sh: 6: Syntax error: newline unexpected12:11
awafter chmod +x read_flash_m1.sh12:11
awxiangfu, how to use https://github.com/milkymist/scripts/blob/master/scripts/read_flash_m1.sh ?12:12
wolfspraulstrange how did you download it? try this url https://raw.github.com/milkymist/scripts/master/scripts/read_flash_m1.sh12:14
wolfspraulmaybe something wrong with newlines?12:14
awi used : wget --no-check-certificate https://github.com/milkymist/scripts/blob/master/scripts/read_flash_m1.sh12:15
wolfspraultry my url12:16
xiangfuyes. it should be "https://raw.github.com/milkymist/scripts/master/scripts/read_flash_m1.sh"12:16
wolfspraulmaybe you got the entire web page :-)12:16
awoah :-) poor adam12:18
wolfspraulwhat? it worked?12:19
awnot work. it stops the same . wait ...i copy msg log12:20
wolfspraulI think we stop work on 0x39 now. wait a day and maybe it boots again :-)12:21
awyes, let's stop 0x39 now12:21
wolfspraulI suggest you go back to the normal procedure, continue with all boards and all known fixes12:22
wolfspraulI propose a change to the render cycles, we already said that you run the crc test software after each render cycle12:22
wolfspraulI also think you should reduce the number from 10 to 512:22
wolfspraulso here's the list:12:22
wolfspraul1. only use short cable (as before)12:22
awmm..so your normal procedure is now becoming:12:22
awgo on12:22
wolfspraul2. always run reflash_m1.sh in usb full-speed mode12:23
wolfspraul3. run the test software (crc part) after each render cycle12:23
wolfspraul4. reduce the number of render cycles from 10 to 512:23
wolfspraulthat's all12:23
awgot it12:23
wolfspraulaw: so what you've found is that if a board is in d2/d3 dimly lit status, it cannot be reflashed over jtag-serial, and the nor can also not be read over jtag-serial12:37
wolfspraulwe could try to reseat (disconnect/reconnect) the jtag-serial board, and we could try to reflash with Xilinx Impact12:37
wolfspraulbut I suggest to do that later12:38
wolfspraulthe real showstopper is to find the reason why a board can go from seemingly normal to this state. we have to fix that before boards can go out.12:38
wolfsprauland the only idea right now seems to be the new reset ic12:38
wolfspraulI think whether it's from the fpga, software or electrical, the m1 is doing something really bad to the nor chip under some circumstances12:43
awwolfspraul, after it's in d2/d3 dimly lit, cannnot be read over jtag-serial, but bad that I forgot to reflash it again. but from previous other boards's histories, once board is in d2/d3 dimly lit, i t seems always stopped at "Bitstream length: 1484404"12:43
awbut we can try reflash 0x39 tomorrow12:43
wolfspraulthe famous "let's wait 1 day"12:43
wolfspraulI think let's continue with all boards first12:43
wolfspraulmore fixes, more data12:44
wolfspraulI need complete overview over the failure clusters12:44
wolfspraulin parallel the new reset ics are ordered12:44
awokay..i continue tests12:44
wolfspraulaw: do you know how to _READ_ the nor chip with Xilinx Impact?12:44
wolfspraulyou could try to read the nor from 0x39 with Xilinx Impact12:45
wolfspraulbut yeah, I suggest - do that later12:45
awhmm...need to do this later though ;-)12:45
wpwrakhmm, interesting ..13:57
wpwrak(sorry, fell asleep and missed part of the fun)13:57
wpwrakso maybe we don't have a NOR corruption after all. that would be good :)13:58
lekernelwpwrak, any other ideas about what is happening, then?13:59
lekerneltemperature dependent timing failures?13:59
wpwrakcould be some analog domain weirdness of the diode-based reset circuit ... but i don't have any clear error path for that13:59
wpwrakwhat's puzzling me is that JTAG and normal operation run into trouble with the NOR14:00
wpwrakotherwise, i would have suspected problems with the timing of NOR bus access cycles14:01
wpwrakmaybe some of the signals are just too weak ? a voltage check could help to clarify this14:01
wolfspraulthe sample set was smaller (run of 40 instead of 90), and there was a lot less testing in rc2 than rc3, but I am pretty sure this same 'kind' of bug already existed in rc214:03
wolfspraulso I think that rules out anything new that got introduced by the reset ic or diode14:03
wolfspraulbig guess though, just from thinking about what cases I saw or remember14:03
wpwrakwolfspraul: you think it's the same at the NOR corruption ?14:04
wolfspraulwell, the best data I have now are the rc3 test results14:04
wpwrak(which could of course just be invalid data showing up, without the NOR itself being compromised)14:05
wolfspraulso I scan them, top to bottom and back up, on the 'notes' column14:05
wolfspraulwhat I see now, even though Adam is not finished yet, is easily 20-30 boards that all fall into one 'group'14:05
wolfsprauld2/d3 dimly lit, cannot reconfigure, cannot reflash14:06
wolfsprauljust counted: 1714:08
wolfspraul46 have passed, 1 adam, 17 in that 'group', 26 in other failure states currently14:09
wolfspraulthat 26 will come down more14:09
wpwrakplus, 0x39 seems to be able to enter this state, whatever it is, relatively easily. let's make this our preferred candidate for now.14:09
wpwrakand if it is in this state, it doesn't seem to get out without a power cycle. but maybe this is just a lack of time14:09
wolfspraulso that's a big group already (17, counted conservatively), and growing14:10
wpwrakhas a board with dim LEDs been left running for a long time, say, overnight ?14:10
wolfspraulyou mean in dim LED state?14:10
wolfspraulafaik it's not running then14:10
wolfsprauldim LED means no boot14:10
wpwraklekernel: on CRC failure, will the FPGA just keep on trying forever ? or does it eventually give up ?14:11
lekerneliirc it tries 3 times or something like that14:11
wpwrakwolfspraul: i mean leave it on, see if it eventually succeeds14:11
lekernelbut i'm not sure14:11
wpwraklekernel: and then ?14:11
lekernelstays in unconfigured state14:12
wpwrakbleh :-(14:12
lekernelin any case, loading fjmem.bit will stop all other configuration attempts14:12
wpwrakwhat's fjmem.bit ?14:12
wolfspraulwpwrak: you want to do voltage check on which wires?14:14
wpwrakthe "three button salute" triggers a reset, right ? does it also work if unconfigured ?14:14
lekernelwpwrak, the bitstream that is loaded to give urjtag a "fast" jtag access to the flash14:14
wpwrakwolfspraul: basically all the NOR signals. pick a convenient line, e.g., OE, do, say, a read cycle, then see how they behave14:15
wolfspraulwpwrak: 0x39 does not just 'enter' this state easily, more importantly it is 'in' this state right now and we don't know how to get it out14:17
wpwrakwolfspraul: if one can't quite decide whether it should be 0 or 3.3 V, we may have found our problem. maybe set trigger on OE#, then start with RP#, WE#, DQ0, A0, then do the rest of DQx and Ax14:17
wpwrakwolfspraul: it does seem to get out of the state sometimes.14:17
wpwrakwolfspraul: ah, before DQ0, also CE014:18
wolfspraulyes, but next time we have a board in a state we may zoom in then14:18
wolfspraulthere's two different things I think14:18
wpwraklekernel: maybe something to check out be if all the FPGA I/O cells of NOR pins are properly configured14:18
wolfspraulsome event that gets it into this state, and some situation or effect that holds it there14:18
wpwrakwolfspraul: yes. could be temperature plus tolerances. the tolerances enable. the temperature makes it happen.14:19
wpwrakor maybe humidity, phase of the moon, ... ;-)14:19
wolfspraulI doubt it's room temperature. parts temperature - yes, possible.14:19
wpwrakpart temp. starts at room temp. :) the you do a bit of testing, it fails, keeps on failing, you give up, put it away, and then it works, until ...14:20
wpwrakof course, if we're unlucky, probing the signals "fixes" it14:21
wolfspraulpah, tough14:22
wolfspraulcan't seem to be able to pin it down14:22
wolfspraulI already ordered some more nor flash, just in case :-)14:22
wolfspraulhave to ramp up the efforts a bit14:22
wpwrakyou suspect the NOR could simply be bad ?14:22
wolfspraulunfortunately adam doesn't have a tsop-56 or whatever package it is tester that could test and scan the entire nor chip at once :-)14:23
wolfspraulI don't know14:23
wolfspraul'bad' as in what?14:23
wpwrakwhere's a cheap 56 channel analog scope with active probes when you need one ? ;-)14:23
wolfspraulmaybe we are operating it outside of spec?14:23
wolfspraulnot 'bad' as in broken parts or so, no14:23
wolfspraulnot at this rate of 20% or more14:24
wpwrakyou got it from a reputable source ?14:24
wolfspraulahh :-)14:24
wolfspraulyes I think so14:24
wolfsprauland no, there is no indication that that's the problem14:25
wolfspraulthis chip is made on a 65nm process14:25
wolfspraulafaik nobody in China can do it yet14:25
wolfspraulanyway, no, the parts are good14:25
wolfspraulalthough if replacing some 'fixes' the bug of course I'd do that for now14:25
wpwraklekernel: is there also a "slow" jtag access to flash ? i.e., just good old bit-banging ?14:26
wolfspraulbut since we don't even know how to test whether the bug 'exists' on a particular board or not (if it is even board dependant), that wouldn't help either14:26
wpwrakwolfspraul: yeah, let's consider signal integrity for now14:27
lekernelafaik fjmem.bit is just bit banging, but you don't need to scan the 450+ pins of the BGA every time14:27
wolfspraullekernel: can you image images of the same sources we have now for Xilinx Impact?14:27
wolfspraulor are they the same?14:28
wpwraklekernel: so fjmem.bit is different from the regular NOR access algorithm ? i.e., much slower bus cycles ?14:28
lekernelbut this has nothing to do with failure of the flash _after_ it has been written14:28
lekernelyou need to convert to .mcs to use xilinx impact14:28
wolfspraulthat's something we can try to bypass any libusb/urjtag/jtag-serial issue, although I don't think that's the root cause of the problem14:28
lekernelwith srecord for example14:28
wpwraklekernel: and when the FPGA boots from NOR, does it always use a built-in bus protocol or does it, say, load a bit from NOR, then switches ?14:29
lekernelit uses the hardwired configuration system14:29
lekernelit seems you can send commands from the flash to change a few things while it's running, but I don't know14:30
wpwrakokay, so we have 2-3 entirely different bus protocol implementations. seems unlikely that all of them would just be wrong.14:30
wolfspraulwpwrak: hey, you will like this14:30
Action: wpwrak ducks14:31
wolfspraulI followed the wiki to find our flash source, and it is the World Peace Industrial Group!!!14:31
wolfspraulif that's not trustworthy, then sorry, I cannot help you14:31
wolfspraulI'm serious14:31
wolfspraulworld peace14:31
wolfspraulthat's where we buy from14:31
wpwrakhaven't we met them before ?14:32
Fallenouthey cannot sell wrong parts then :-)14:32
wpwrakFallenou: well, osama got the nobel peace price, so ...14:32
wpwrakerr, obama. damn.14:32
wolfspraulno but they are fine, really14:32
Fallenoui wtf'ed a few seconds14:33
wolfspraulalso in this kind of part you rarely have problems, unless you really buy returned/used parts or so, and who does that...14:33
wolfspraulthis part is too high-end14:33
wpwrakwell, could be rejects14:33
wolfspraulit's not14:34
Fallenouwell you would not be the first to have troubles with flash parts14:34
wpwrakbut let's assume for now it's a bus problem14:34
wpwrakFallenou: understatement of the year ;-) you should work as a nucelear power spokesperson :)14:34
wpwrakFallenou: of course, you admitted that a problem exists at all. so maybe not :)14:35
Fallenou hehe14:35
wolfspraulis it possible that the problem is in the fpga not the nor chip?14:36
wpwrakof course. same story there.14:36
wpwrakit doesn't seem to be a configuration problem, though, since the hardwired bus protocol also trips14:37
Fallenoui meant i heard a few people complaining about flash parts behaving strangely , even just soldered brand new ones14:37
Fallenoubad blocks problems and so on14:37
wolfspraulno I don't mean in terms of bad parts or so, that's not the case for sure. I'm just wondering what kind of problem it might be, theoretically.14:37
wpwrakFallenou: NOR or NAND ?14:38
wolfspraulwe could unsolder the nor of 0x39 and put it on a good board to see what happens there14:38
Fallenouwas nand i think14:38
wolfsprauloh no14:38
wolfspraulnow Werner will be busy for a while14:38
Fallenoumaybe it does not apply here at all14:38
wpwrakFallenou: bad blocks are normal in NAND. and they have a very subtle definition of what constitutes a "good" block, too :)14:38
wolfspraulthe problem of reseating a nor chip to another board is that it's quite intrusive and may create or mask problems14:39
wpwrakFallenou:  a "good" block is one with 0 or 1 error, i.e., few enough errors that the ECC can still fix it14:39
wolfspraulso we may just get noise back14:39
wpwrakwolfspraul: i'd look at the signal first. if we trust both FPGA and NOR, the problem must be on the bus :)14:40
Fallenouwell ok nevermind sorry for the noise ;)14:40
wpwrakwolfspraul: first step: do something that exercises the bus and see if there's an anomaly14:40
wolfspraulwe could take 0x39 and try to read the standby bitstream14:41
wolfsprauland compare with a board where that works14:41
wolfspraulwhat happens on 0x39 now - the ftdi chip loads a small bitstream into the fpga, and then it tries to read from nor via fpga14:42
wolfspraulbut that fails/hangs ?14:42
wolfspraulany visibility into that?14:42
wolfspraulcan the fpga 'log' all bus activity? :-)14:42
wolfspraulhe he14:42
wolfsprauljust thinking, maybe nonsense14:43
lekernelurjtag might have some debug mode14:43
lekernelalso, inputting the commands manually one by one instead of using the batch script would already help14:43
lekerneland you can 'pld load' directly the soc design14:44
lekerneland run the test program14:44
wpwraklekernel: does the FPGA take its master clock from the video codec ? or is there some other crystal ?14:44
wpwrakah, Y2 .. so there must be a Y1 ...14:45
lekernelfor configuration, it uses an internal oscillator14:45
wpwrak(found Y1, it's audio)14:46
wolfspraulpld load bitstream is interesting, we should have a little script for that too14:46
wolfsprauljust in case14:46
wolfspraulbut most likely the test program would then fail accessing the nor, no?14:46
wpwrak(internal) okay, so no risk of weird clock due to an unconfigured oscillator14:46
wolfsprauldefinitely something to try though14:46
wpwrakmaybe crosstalk, reflections, ...14:47
lekernelwhy would they happen suddenly?14:47
lekernelalso the jtag interface has a very slow clock14:48
wpwrakmaybe it happens all of the time, just barely below the threshold14:48
wpwrak(jtag) so fjmem.bit is clocked by jtag, not the internal osc ?14:49
wolfspraulxiangfu: can you make a script that uses urjtag to pld load the soc and then runs the test program, all without accessing the nor chip?14:50
lekernelapparently there's some clock in fjmem too14:52
lekernelbut failure of the clock would not impede configuration14:53
lekernelthe dim LEDs would go off at power up even if the 50MHz oscillator has failed14:53
wpwrakah, by mwalle. just when he left for vacation !14:56
lekerneli don't see fjmem related to the board failure problem14:56
wpwrakjust a place where one could put diagnostic things. after all, it's one of the areas where we do experience the problem14:57
wpwraklet me give the NOR data sheet a careful read ...14:57
lekernelthe whole things looks as if the NOR chip stops responding now14:58
lekernelmaybe its reset pin is held active by a crappy reset circuit?14:58
wpwrakthat's one possibility15:01
wpwrakwhat's surprising then is that rc2 would suffer too. but of course, if we're seeing two different bugs, that would explain it15:01
wpwrake.g., rc2 corrupts NOR due to absent reset circuit, while rc3 fails to read NOR due to present but faulty reset circuit15:01
wpwrakread or write, maybe15:02
lekernelI have not had or heard of a non-reflashable rc2 board ...15:02
lekernelwolfspraul, do you have rc2 boards that can't be reflashed?15:02
lekernelwith a fully non responsive flash chip?15:02
wolfspraulcould be, wait (checking wiki)15:03
wpwraklekernel: do we know the chip is unresponsive ? or does flashing just give up on the first offense ?15:03
lekernelit seems urjtag doesn't even detect the flash chip with CFI. of course this has to be confirmed.15:03
wolfspraulwe have to ask Adam tomorrow, there are several marked 'hold' http://en.qi-hardware.com/wiki/Milkymist_One_RC2_Test_Plan#Report_of_Milkymist_One_RC2_Board15:05
wolfspraul0x1A, 0x2C15:06
wolfspraulbut the problem with rc2 is that it's a smaller run15:06
wolfsprauland most boards are sent out, with the tougher rc3 checks more might have shown problems15:06
wolfspraulI don't think Adam has any single functioning rc2 board left15:07
wolfspraulbut there are 3-4 as 'hold', so they must have some issues15:07
wpwrakthere's always an internal pull-up on STS ?  (replacing DNP R61)15:07
wolfspraulbut I would hesitate to put any one of them into the same rc3 'flash problem' category we can see quite clearly now15:07
wolfspraulthat's why I'm so eager to focus on this _group_ we can see clearly on rc3 now15:07
wolfspraulso we don't get lost in some rare exotic cases15:08
lekernelI'm not sure STS is used anywhere15:08
wpwrakokay, let's ignore rc2 then. this means that we still have 0 confirmed NOR corruptions in rc3.15:08
lekernelfjmem has it15:08
wolfspraulwpwrak: yes, possible! [rc2 corrupts nor but rc3 has a different problem]15:09
wolfspraulthere is also still the chance that the 4.4v reset ic will do the magic15:10
wolfspraulalthough verifying that will be hard15:10
wolfspraulSebastien said "urjtag doesn't even detect the flash chip with CFI"15:16
wolfspraulhe was probably looking at some log, but which one?15:16
wolfsprauland what does that mean? the detection is probably also a longer sequence of signals going back and forth15:16
wolfspraulcould still be on the fpga or nor side, or in between :-)15:16
xiangfuthere is 'debug all' in jtag. which will output a loooot of message :D15:21
wolfsprauladd it as a commented-out line to all scripts15:22
wolfspraulthat's a good start :-)15:22
kristianpaulhum, a common thing is also the clock that feed fpga main clock (clk50), but i guess this is unlikelly a problem..15:25
kristianpaulyeah, pld load will confirm at least fpga is okay, if you can get full soc to load, afaik bios still need to be read from NOR15:28
xiangfukristianpaul, yes. I will try to full load soc tomorrow. I maybe needs build a load-able soc bitstream tomorrow.15:34
Fallenouhum I have troubles compiling rtems , it fails in the zlib part15:34
kristianpaulzlib magically solves by recompiling it again i think15:35
kristianpaulcompile and install15:35
kristianpaulwell that was time agoooo15:35
FallenouI guess now flickernoise is using rtems' zlib, since it's no longer in the requirements on the wiki15:35
Fallenoutoo bad rtems zlib does not compile :o15:36
Fallenouat least on my mac os15:36
wpwrakmaybe it's just an analog domain problem on the signals we can't tell by looking at the schmatics. or maybe it's a chain of events that sets off the trouble.15:38
wpwrakanyway, next step: try to boot 0x39. while it does reconfigure, reboot. when if fails to reconfigure, try to read back the NOR. that should clarify the NOR corruption theory.15:39
wpwrak(at least a little :)15:40
lekernelFallenou, (zlib issue) please post that to the RTEMS mailing list; I have it, JP Bonn has it, and my friend Ralf is denying any problem exists16:41
kristianpaul(denying) that used to happen :-)16:41
Fallenoulekernel: ahah ok16:51
Fallenoulekernel: you opened a PR ?16:52
lekernelno I posted on the ML and all I got was stupid replies from Ralf16:52
Fallenouok will post and then open a PR16:53
Fallenoulekernel: how do you workaround it ?16:53
lekerneltypedef long z_off64_t in zconf.h16:53
Fallenouok found your e-mail17:00
Fallenoubootstrap is damn slow17:20
Action: Fallenou is testing with their cvs head17:20
Fallenoulekernel: it does work with their lm32_evr cvs head : http://pastebin.com/5JHfDyLj17:30
FallenouI did the same steps as in your last e-mail about this problem17:30
Fallenouit built all lm32_evr bsp without any error17:30
Fallenouwill try milkymist bsp from their cvs head17:31
lekernellm32_evr didn't work for me either17:35
Fallenoufor me it did work17:36
Fallenougotta go bbl17:36
Thihihttp://kukka.siilo.fi/~kuutio/11-08-13-kissastuskausi.mkv - you guys might be interested in this. A small sample of what I do with a projector and a camera. Music has been ripped off from Boards of Canada.20:01
Action: wpwrak wonders if the NOR problem could still be INIT_B -> FLASH_RESET contamination22:08
wpwrake.g., if "fix2" has a design flaw or if it frequently gets implemented in the wrong way22:09
wpwrakone test could be to remove D16 (FLASH_RESET_N to reset out). this should then remove any contamination, but may bring back the NOR corruption.23:13
wpwrakoh, and an alternative to using logic gates instead of the diodes in rc4 could be to have a second reset chip, dedicated on FLASH_RESET_N.23:13
wpwrakthat could also be used to test whether properly separating FLASH_RESET_N from PROGRAM_B_2 and INIT_B would solve all the NOR problems. i.e., remove D16, add a reset chip in parallel to the existing one, and let it drive exclusively FLASH_RESET_N23:21
--- Tue Aug 16 201100:00

Generated by irclog2html.py 2.9.2 by Marius Gedminas - find it at mg.pov.lt!