2018-11-20 21:54:48
ok, we need most bang for buck on worker nodes. 1g network, roughly 8 cores at 2.4ghz

Fireduck
2018-11-20 21:55:21
so call it 19 ghz

Fireduck
2018-11-20 22:09:49
bang per buck is tough if you do not limit the power budget to something sensible as refurbished high end enterprise will get you quite far

Rotonen
2018-11-20 22:10:04
ha

Fireduck
2018-11-20 22:10:25
So that ~20ghz bullshit number I just made up will get about 1MH/s and about max out the gigE link

Fireduck
2018-11-20 22:10:28
so that is the top end

Fireduck
2018-11-20 22:11:13
that can be done fanless

Rotonen
2018-11-20 22:12:05
Thinking something like that. ~$500

Fireduck
2018-11-20 22:15:30
maybe even load 3 of 4 gb of the field in local memory

Fireduck
2018-11-20 22:30:37
why the ryzen?

Rotonen
2018-11-20 22:31:31
also that power supply is way overspec for that, you'll not be in the efficient range of that thing

Rotonen
2018-11-20 22:31:56
good

Fireduck
2018-11-20 22:32:03
why not ryzen?

Fireduck
2018-11-20 22:32:09
seems to be a good price point

Fireduck
2018-11-20 22:32:23
meh, i'd go for a sub-100$ cpu for that

Rotonen
2018-11-20 22:32:36
it's still dual channel memory

Rotonen
2018-11-20 22:32:52
the limitation I am working around is not the memory channel in this case

Fireduck
2018-11-20 22:33:36
also the gpu seems unnecessary if you can just boot those off pxe

Rotonen
2018-11-20 22:33:55
mostly for console on initial setup

Fireduck
2018-11-20 22:34:04
if end up building more than one, can move it around

Fireduck
2018-11-20 22:34:26
i'm an autodiscovery fabric kinda thinker when it comes to clustering stuff like that

Rotonen
2018-11-20 22:34:41
it should figure out on its own what it is and what it should do

Rotonen
2018-11-20 22:34:59
figure out the setup once and just hook more things into the network

Rotonen
2018-11-20 22:35:06
agreed

Fireduck
2018-11-20 22:35:34
current plan is we have a few nodes that can store and serve field 8 from ram

Fireduck
2018-11-20 22:35:42
put those on 10g switch ports

Fireduck
2018-11-20 22:35:52
then hang a bunch of 1g workers off the switch

Fireduck
2018-11-20 22:36:06
i'm still not figuring out how you get enough data into the crunching part so the cpu makes any sense

Rotonen
2018-11-20 22:36:18
would need boxes and arrows kinda representations

Rotonen
2018-11-20 22:36:35
Have you read my arktika notes on marshalling for network calls?

Fireduck
2018-11-20 22:36:46
yes, but the cpu still reads off ram

Rotonen
2018-11-20 22:37:13
sure, and the ram is plenty fast. ddr4 is like 40GB/s per channel or something

Fireduck
2018-11-20 22:37:43
you're read count limited and if you marshall into 4k you still get an up to factor of 256 improvement there

Rotonen
2018-11-20 22:38:06
I am absolutely CPU limited currently :wink:

Fireduck
2018-11-20 22:38:12
I have lots of bandwidth and data

Fireduck
2018-11-20 22:39:08
based on the shares you post, i'm not really seeing that

Rotonen
2018-11-20 22:39:58
hehe, true

Fireduck
2018-11-20 22:40:24
I should switch back to field 7 for now

Fireduck
2018-11-20 22:41:20
i'm not even sure how one would metrify if any snowblossom mining is cpu limited or not

Rotonen
2018-11-20 22:41:36
the limiting factor is the 'it is not aliens' of astronomy for all things computers

Rotonen
2018-11-20 22:41:57
I have a mode for that

Fireduck
2018-11-20 22:42:10
I have an arktika layer that returns random results

Fireduck
2018-11-20 22:42:19
to see how fast you could mine if you were never waiting for data

Fireduck
2018-11-20 22:42:41
i'm aware of the fake layer, yes

Rotonen
2018-11-20 22:43:06
but that's missing a lot of pipeline blockers vs. real mining

Rotonen
2018-11-20 22:43:38
sure. I get similar results if I to network mining from a machine with the field in memory.

Fireduck
2018-11-20 22:43:43
i'm mostly trying to think of what happens on the micro ops level on a cpu when mining

Rotonen
2018-11-20 22:43:50
magic

Fireduck
2018-11-20 22:44:07
i'm still pegging you as the aliens guy on that one

Rotonen
2018-11-20 22:44:44
on bursts i can see that happening as there can be a neat queue to consume off of

Rotonen
2018-11-20 22:45:00
but it's still a flowthrough problem so eventually both sides hit the read size issue?

Rotonen
2018-11-20 22:46:13
unless there is something fancy in how a cpu directly consumes stuff as interrupts off the network interface on some higher end networking gear in the mix, i'm not figuring out how it misses the ram read density limits on either side

Rotonen
2018-11-20 22:50:16
could be i'm missing something the lower level data caches do when consuming off a neater queue

Rotonen
2018-11-20 22:52:07
yeah, I have no idea

Fireduck
2018-11-20 22:52:11
I just see what works

Fireduck
2018-11-20 22:52:46
if i did not misattribute your miners, you were doing on peaks 300k .. 400k on the pool, which matches what two ram channels should be doing

Rotonen
2018-11-20 22:53:17
that was me having trouble with /dev/shm inefficencies vs direct in process memory

Fireduck
2018-11-20 22:53:40
have you recently metrified fscache vs. memfield?

Rotonen
2018-11-20 22:53:41
once I switch back to field 7, you should see me go to about 1mh/s, which maxes out my 1g network

Fireduck
2018-11-20 22:53:52
between the memory node and the workers

Fireduck
2018-11-20 22:54:31
i'm not seeing performance differences on just reading off the disk when the field is cached vs. fiddling with memfield

Rotonen
2018-11-20 22:54:48
and way less of a hassle to not have to bother with the heap size nonsense

Rotonen
2018-11-20 22:54:52
probably cause your are CPU bound :wink:

Fireduck
2018-11-20 22:55:11
was pushing 8MH/s last time i metrified, of course possible

Rotonen
2018-11-20 22:55:20
reading as far as I can from /dev/shm on my ddr2 box, was about 18GB/s

Fireduck
2018-11-20 22:55:29
reading from direct in process ram was about 120GB/s

Fireduck
2018-11-20 22:55:50
that could be something strange about that system, I don't know

Fireduck
2018-11-20 22:56:05
that beast is a sparc, so most bets off for comparability

Rotonen
2018-11-20 22:56:12
also solaris

Rotonen
2018-11-20 22:57:06
but cat the chunks to /dev/null once and see what you get 'off the disk'?

Rotonen
2018-11-20 22:57:09
i'm left curious

Rotonen
2018-11-20 22:59:55
I'll check that out

Fireduck
2018-11-20 23:00:13
not sure why that didn't occur to me

Fireduck
2018-11-20 23:00:43
also for emulating precache just dd with a blocksize and a count to /dev/null

Rotonen
2018-11-20 23:00:50
or even easier with a chunked field

Rotonen
2018-11-20 23:00:58
also xargs -P for doing that in parallel

Rotonen
2018-11-20 23:01:45
with dd you get a fun sequential read disk benchmark on top

Rotonen
2018-11-20 23:03:10
I'll try it in a bit, I need to head out

Fireduck
2018-11-20 23:14:09
looking into this out of curiousity a bit more, it could be the magic sauce for arktika is the translation lookaside buffer being parallel per default

Rotonen
2018-11-20 23:16:08
i have a curious hunch i've been benchmarking with set associative caches in between

Rotonen
2018-11-20 23:24:15
but i'm still not ultimately seeing how anything gets around a 128 byte fetch taking about 40ns

Rotonen
2018-11-20 23:26:36
unless something makes those 128 bytes luckier than they ought to be

Rotonen
2018-11-20 23:26:56
where queues and fetch ordering might actually play a role

Rotonen
2018-11-20 23:27:28
and/or pipelining execution order trickery and micro op fusions

Rotonen
2018-11-20 23:28:00
m68k was so simple in comparison to the modern madness

Rotonen
2018-11-20 23:31:01
what is 128 bytes?

Fireduck
2018-11-20 23:31:05
yeah, it's probably the uncore global queue stuff shuffling things around
https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf

Rotonen
2018-11-20 23:31:09
a dram access is 128 bytes

Rotonen
2018-11-20 23:31:19
ah

Fireduck
2018-11-20 23:31:52
the network message will be a tight 8k of data, if that all gets loaded in l2 cache or something it should be good

Fireduck
2018-11-20 23:32:54
but really, I don't know

Fireduck
2018-11-20 23:32:56
looking at the newer generations of that document, i'm seriously losing track to how the hell stuff works under the hood
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/xeon-e5-2600-v2-uncore-manual.pdf

Rotonen
2018-11-20 23:34:41
phew, they wrote one for dummies, let's see if one can still play catch-up
https://software.intel.com/en-us/articles/how-memory-is-accessed This article explains why optimal usage of the memory subsystem can have massive performance benefits

Rotonen
2018-11-20 23:35:31
hm, 256 byte reads

Rotonen
2018-11-20 23:35:59
that'd mean i'm off by a factor of two on both the latency and read size, but i'm glad that works out about the same all the way through ddr2 to ddr4 :stuck_out_tongue:

Rotonen
2018-11-20 23:38:43
`Before 3D XPointâ„¢ DIMMs, interleaving was done per one or two cache lines (64 bytes or 128 bytes), but DIMM non-volatile memory characteristics motivated a change to every four cache lines (256 bytes). So now four adjacent cache lines go to the same channel, and then the next set of four cache lines go to the next channel.`
ok that explains

Rotonen
2018-11-20 23:41:27
oh nifty, they actually rolled their own performance counters for this stuff
https://software.intel.com/en-us/vtune-amplifier-help-getting-started Use the Getting Started document to get up and running with a basic Hotspots analysis using your own application on your host system. Windows* Linux*

Rotonen
2018-11-20 23:44:16
@Fireduck you got really lucky that every new memory tech improvement is essentially eaten away by the CPUs wanting to do larger and larger fetches per default as 'no one does small fetches anymore' :smile:

Rotonen
2018-11-20 23:45:15
and the page cache should peform the same as memfield as far as i can tell

Rotonen
2018-11-20 23:46:14
also intel is clever with its performance tips as every so often there's `Use More of the Hardware` in there :stuck_out_tongue:

Rotonen
2018-11-20 23:46:43
this is reading almost like an arktika mining guide `The extra hardware may be an additional core, an additional processor, or an additional motherboard. ` :smile:

Rotonen
2018-11-20 23:47:36
Yeah, who the hell wants 16 bytes?

Fireduck
2018-11-20 23:49:16
damn, i think this might actually be why people on windows are struggling
https://stackoverflow.com/questions/44355217/why-doesnt-the-jvm-emit-prefetch-instructions-on-windows-x86...

Rotonen
2018-11-20 23:50:12
the intel guides tell you to either throw more threads at fetching data or making all fetches through explicit `_mm_prefetch`, and the jvm does not do that on windows

Rotonen
2018-11-20 23:50:33
but, more threads should do the same, but that explains why people have to ramp up a lot more and still not quite get there

Rotonen
2018-11-20 23:51:16
Interesting

Fireduck
2018-11-20 23:53:21
heh, you can hamstring your hardware by turning that off on the hardware level, did not know :stuck_out_tongue:
https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors Disclosure of H/W prefetchers control on some Intel processors This article discloses the MSR setting that can be used to control the various h/w prefetchers that are available on Intel processors based on the following microarchitectures: Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, and Broadwell.

Rotonen
2018-11-20 23:53:31
i guess that'll help debugging ghosts in the machine

Rotonen
2018-11-20 23:54:19
though you'd have to roll your own efi script between the bios handoff and your boot loader to do that

Rotonen
2018-11-20 23:54:23
or roll your own efi

Rotonen
2018-11-20 23:54:59
oh god

Fireduck
2018-11-20 23:55:07
nope, someone figured a way around
https://01.org/msr-tools MSR Tools project provides utilities to access the processor MSRs and CPU ID directly. This project is composed of three different user space console applications.

Rotonen
2018-11-20 23:55:26
and intel did it itself apparently :smile:

Rotonen
2018-11-20 23:56:49
but as jvm added the prefetch stuff in 2001, i suppose no one is around to answer why they omitted windows and why no one got back around to that anymore
https://bugs.openjdk.java.net/browse/JDK-4453409

Rotonen
2018-11-20 23:58:16
but i guess one is way off the happy path when looking into things where the relevant assembly is warning posted with underscore prefixing

Rotonen