@asdf Is that you?
I am working on a new miner, the threading is getting complicated
it should close the gap between NVME and ram mining. Not entirely, but closer anyways.
how much difference betweet NVME and RAM mining so far?
Currently, pretty big. A good RAM miner would be about 4 MH/s. A solid NVME miner would be about 100kh/s.
I should be noted the RAM miner will probably cost at least $7k or so and the NVME could be probably $400
But with my new setup, in theory that same NVME miner rig could do maybe 700kh/s
Assuming it has an ok chunk of ram to work with (like 32gb or so)
but that is numbers in a spreadsheet right now. I need to write a good bit of code with a fairly crafty thread model without overwhelming the garbage collection or creating a NUMA nightmare
so we shall see
really a big gap
I am new here, and figuring out how to get a start since I don't have RAM OR NVME in hand at the moment.
You can run a node and buy a little SNOW on an exchange
you don't need to mine
but if you want to mine, if you have a computer with a free M.2 slot, you can get an NVME for not much money
The Intel 760P line seems like the best bang for buck
the the samsung 970 PRO are nice as well
OK, thanks
How much SNOW would I get by estimation if I buy an Interl 760P to mine.
Assuming you get 56kh/s, use the calculator on http://snowblossom-explorer.org/
really not much, only less than 0.5 SNOW/day even I have 100kh/s according to the Cal.
yeah
can I connect to a testnet to test my hardware performance?
For mining? Not really. The testnet uses very small fields so you'll end up with it all in cache
No reason to not test mining performance on mainnet
OK, got it
Anyone try NVME RAID 0 before?
yes, do raid 1 until you cannot fit the field anymore
Intel VROC would be interesting, AFAIK no one tried that yet
how much improvement?
nothing unexpectable either way
Hey everyone, been a while since I've bee around here. just saying hi! I see were on a new field
have you guys settled on the logo yet?
i might throw something together if youre still open to submissions
Not really settled
@Rotonen I need some sort of guide on NUMA programming
I think you have an idea of where I am at, but I am just guessing on how to write code to work well in a NUMA setup
Basically, trying to reduce the number of cross thread locks, since they could be cross socket as well.
Also trying to have data used by a single thread until needed
but I might be doing everything wrong
If you take a look at https://github.com/snowblossomcoin/snowblossom/blob/master/miner/src/surf/MagicQueue.java that might help ``` package snowblossom.miner.surf; import java.util.concurrent.LinkedBlockingQueue; import java.nio.ByteBuffer; import java.util.Map; import java.util.HashMap; import java.util.LinkedList; /** * Data optimization based on guesses about how NUMA works * and trying to keep things simple for the GC. So probably all wrong. */ public class MagicQueue { /** * Collection of ByteBuffers for each bucket, ready to be read */ private final LinkedList<ByteBuffer>[] global_buckets; private final int max_chunk_size; /** * Each thread accumulatedd data in this map before they are saved * to the global buckets. */ private final ThreadLocal<Map<Integer, ByteBuffer> > local_buff; public MagicQueue(int max_chunk_size, int bucket_count) { this.max_chunk_size = max_chunk_size; global_buckets = new LinkedList[bucket_count]; for(int i=0; i<bucket_count; i++) { global_buckets[i] = new LinkedList<>(); } local_buff = new ThreadLocal<Map<Integer, ByteBuffer>>() { @Override protected Map<Integer,ByteBuffer> initialValue() { return new HashMap<Integer, ByteBuffer>(bucket_count*2+1, 0.5f); } }; } /** * returns a ByteBuffer that is ready to accepts writes up to data_sizee * as needed. Might already have data in it. Can only be used in this thread. * Might not get saved to the global bucket until flush is called. */ public ByteBuffer openWrite(int bucket, int data_size) { Map<Integer, ByteBuffer> local = local_buff.get(); if (local.containsKey(bucket)) { if (local.get(bucket).remaining() >= data_size) return local.get(bucket); writeToBucket(bucket, local.get(bucket)); global_buckets[bucket].add(local.get(bucket)); } local.put(bucket, ByteBuffer.allocate(max_chunk_size)); return local.get(bucket); } /** * @param data A byte buffer open for writes */ private void writeToBucket(int bucket, ByteBuffer data) { LinkedList<ByteBuffer> lst = global_buckets[bucket]; synchronized(lst) { ByteBuffer last = lst.peekLast(); if ((last != null) && (last.remaining() >= data.position())) { data.flip(); last.put(data); } else { lst.add(data); } } } /** * Returns null of a ByteBuffer with position 0 and limit set to how much data is there. * ready for reading. */ public ByteBuffer readBucket(int bucket) { LinkedList<ByteBuffer> lst = global_buckets[bucket]; synchronized(lst) { ByteBuffer bb = lst.poll(); if (bb == null) return null; bb.flip(); return bb; } } public void flushFromLocal() { for(Map.Entry<Integer,ByteBuffer> me : local_buff.get().entrySet()) { int b = me.getKey(); ByteBuffer bb = me.getValue(); writeToBucket(b, bb); } local_buff.get().clear(); } } ```
@Fireduck step 1, C
@Fireduck or rethink as IPC across thread pool processes and use numactl
@Fireduck otherwise you'll slam into the wall of the GC not yet being numa aware
see JEP 345 and associated discussions
on a brief literary overview over the past 5 years, seems someone in glasgow has been doing something close enough to the byte queue you're trying to go for http://www.dcs.gla.ac.uk/~jsinger/pdfs/lcpc15.pdf
ha, that does look like the same problem
I am probably just prematurely optimizing anyways
is there anything which *needs* to be reaped mining-time?
just turn off the GC?
or try go - nigh native gRPC, goroutines suit the problem well, windows is a first class citizen in the packaging ecosystem
and that turns the GC off for the objects, yeah :smile:
I can avoid the GC pretty well by just reusing a lot of bytebuffers
on a boxes and arrows level that library should do everything you need, actually
yes, but you cannot set per object policies
At this point, I am not up to turning off GC
I'd be more likely to implement all the mining in C++ rather than try to do that
go or rust would be sexy
grpc on rust looks painful, though
go is terrible. rust is probably terrible.
I have a strong dislike of "sexy" technologies
i like go, 'if i can think it in bash pipes xargs parallel, i can think it in go'
i guess that's about the aim of the whole thing too
a generalized unixpipe emulator
yeah, for me there are exploratory learn new tech projects
and get shit done projects
and they make very different choices
i suppose you get paid for the first kind? :stuck_out_tongue:
well, at work we use scala, so yeah
I like the concept of functional programming, but in practice it makes me a little crazy
but seriously, browse through this for the how and see about taking a stab at it yourself https://github.com/xerial/jnuma A Java library for accessing NUMA (Non Uniform Memory Access) API
yeah, functional is fine, until you *need* a side effect
which is about every program ever, which actually *did* something
why write any code at all if you didn't want a side effect?
right
going through the tests seems a good starting point for diving into 'how is this thing done' https://github.com/xerial/jnuma/blob/develop/src/test/scala/xerial/jnuma/NumaTest.scala ``` /* * Copyright 2012 Taro L. Saito * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ //-------------------------------------- // // NumaTest.scala // Since: 2012/11/22 2:24 PM // //-------------------------------------- package xerial.jnuma import util.Random import java.nio.{ByteOrder, ByteBuffer} import java.util import java.io.{OutputStream, FileOutputStream} /** * @author leo */ class NumaTest extends MySpec { "Numa" should { "report NUMA info" taggedAs ("report") in { val available = Numa.isAvailable val numNodes = Numa.numNodes() debug("numa is available: " + available) debug("num nodes: " + numNodes) for (i <- 0 until numNodes) { val n = Numa.nodeSize(i) val f = Numa.freeSize(i) debug("node %d - size:%,d free:%,d", i, n, f) } val nodes = (0 until numNodes) for (n1 <- nodes; n2 <- n1 until numNodes) { val d = Numa.distance(n1, n2) debug("distance %s - %s: %d", n1, n2, d) } def toBitString(b: Array[Long]) = { val s = for (i <- 0 until Numa.numCPUs()) yield { if ((b(i / 64) & (1L << (i % 64))) == 0) "0" else "1" } s.mkString } for (node <- nodes) { val cpuVector = Numa.nodeToCpus(node) debug("node %d -> cpus %s", node, toBitString(cpuVector)) } val numCPUs = Runtime.getRuntime.availableProcessors(); val affinity = (0 until numCPUs).par.map { cpu => Numa.getAffinity() } debug("affinity: %s", affinity.map(toBitString(_)).mkString(", ")) val preferred = (0 until numCPUs).par.map { cpu => Numa.runOnNode(cpu % numNodes) Numa.setPreferred(cpu % numNodes) val n = Numa.getPreferredNode Numa.runOnAllNodes() n } debug("setting prefererd NUMA nodes: %s", preferred.mkString(", ")) val s = (0 until numCPUs).par.map { cpu => Numa.setAffinity((cpu + 1) % numCPUs) if (cpu % 2 == 0) (0 until Int.MaxValue / 10).foreach { i => } Numa.getAffinity() } debug("affinity after setting: %s", s.map(toBitString(_)).mkString(", ")) val r = (0 until numCPUs).par.map { cpu => Numa.resetAffinity() Numa.getAffinity() } debug("affinity after resetting: %s", r.map(toBitString(_)).mkString(", ")) } "allocate local buffer" in { for (i <- 0 until 3) { val local = Numa.allocLocal(1024) Numa.free(local) } } "allocate buffer on nodes" in { val N = 100000 def access(b: ByteBuffer) { val r = new Random(0) var i = 0 val p = 1024 val buf = new Array[Byte](p) while (i < N) { b.position(r.nextInt(b.capacity() / p) * p) b.get(buf) i += 1 } } val bl = ByteBuffer.allocateDirect(8 * 1024 * 1024) val bj = ByteBuffer.allocate(8 * 1024 * 1024) val b0 = Numa.allocOnNode(8 * 1024 * 1024, 0) val b1 = Numa.allocOnNode(8 * 1024 * 1024, 1) val bi = Numa.allocInterleaved(8 * 1024 * 1024) time("numa random access", repeat = 10) { block("direct") { access(bl) } block("heap") { access(bj) } block("numa0") { access(b0) } block("numa1") { access(b1) } block("interleaved") { access(bi) } } Numa.free(b0) Numa.free(b1) Numa.free(bi) } def radixSort8(buf: ByteBuffer) = { val K = 256 val N = buf.capacity() val pile = Array.ofDim[Int](K) // count frequencies buf.position(0) for (i <- 0 until N) pile(buf.get(i) + 128) += 1 // count cumulates for (i <- 1 until K) { pile(i) += pile(i - 1) } def split { for (i <- 0 until N) { var e = buf.get(i) var toContinue = true while (toContinue) { val p = e + 128 val pileIndex = pile(p) - 1 pile(p) -= 1 if (pileIndex < i) toContinue = false else { val tmp = buf.get(pileIndex) buf.put(pileIndex, e) e = tmp } } buf.put(i, e) } } split } def radixSort8_local(buf: ByteBuffer) = { val K = 256 val N = buf.capacity() - (2 * 4 * K) val countOffset = buf.capacity() / 4 val pileOffset = countOffset + K // count frequencies buf.position(0) for (i <- 0 until K) { buf.putInt(countOffset + i * 4, 0) } for (i <- 0 until buf.capacity()) { val ch = buf.get(i) + 128 val prevCount = buf.getInt(countOffset + ch * 4) buf.putInt(countOffset + ch * 4, prevCount + 1) } // count cumulates for (i <- 0 until K) { val prev = if (i == 0) 0 else buf.getInt(countOffset + (i - 1) * 4) val current = buf.getInt(countOffset + i * 4) buf.putInt(pileOffset + i * 4, prev + current) } def split { for (i <- 0 until N) { var e = buf.get(i) var toContinue = true while (toContinue) { val p = e + 128 val pileIndex = buf.getInt(pileOffset + p * 4) - 1 buf.putInt(pileOffset + p * 4, pileIndex) if (pileIndex < i) toContinue = false else { val tmp = buf.get(pileIndex) buf.put(pileIndex, e) e = tmp } } buf.put(i, e) } } split } def radixSort8_array(b: Array[Byte]) = { val K = 256 val N = b.length val count = new Array[Int](K + 1) // count frequencies for (i <- 0 until N) { count(b(i) + 128) += 1 } // count cumulates for (i <- 1 to K) count(i) += count(i - 1) def split { val pile = new Array[Int](K + 1) Array.copy(count, 1, pile, 0, K) for (i <- 0 until N) { var e = b(i) var toContinue = true while (toContinue) { val p = e + 128 val pileIndex = pile(p) - 1 pile(p) -= 1 if (pileIndex < i) toContinue = false else { val tmp = b(pileIndex) b(pileIndex) = e e = tmp } } b(i) = e } } split } "perform microbenchmark" taggedAs ("bench") in { val bufferSize = 4 * 1024 * 1024 when("buffer size is %,d".format(bufferSize)) val numaBufs = (for (i <- 0 until Numa.numNodes()) yield "numa%d".format(i) -> Numa.allocOnNode(bufferSize, i)) :+ "numa-i" -> Numa.allocInterleaved(bufferSize) //val bdirect = ByteBuffer.allocateDirect(bufferSize) val bheap = ByteBuffer.allocate(bufferSize).order(ByteOrder.nativeOrder()); val bufs = numaBufs ++ Map("heap" -> bheap) // fill bytes def fillBytes(b: ByteBuffer) = { var i = 0 b.clear() while (b.remaining() > 0) { b.put(i.toByte) i += 1 } } def fillI…
that is interesting
api-wise that thing has everything i've seen you publicly cuss about not being in there or about how to tackle
whether or not that's well done, that's something you need to test, benchmark, potentially modify
right
and it's apache 2.0 licensed