Sampling from data streams

Sampling data from a continuous stream of data is a useful technique to efficiently extrapolate information from a potentially large body of data. There are a couple of sampling strategies in literature that vary in their degree of complexity. I'd like to introduce you to a rather simple sampling strategy that is easy to implement as well as easy to reason about and might take you a long way until you have to go for more advanced solutions. I'm talking about Bernoulli sampling.

more ...




Using Kafka for JUnit with Spring Kafka

The last articles gave a couple of examples on how to write Kafka-enabled integration tests at various levels of abstraction using Kafka for JUnit. For component-tests, we kept the scenarios quite simple and built a minimal producer and consumer on top of the official kafka-clients library for Java. This is perfectly fine and my personal recommendation is that you stick to this approach if you have any requirements that aren't particulary standard. Oftentimes, another abstraction layer on top of kafka-clients that integrates well with your chosen application framework will suffice though. We will take a look at the Spring ecosystem for that matter. Hence, the question is: What do I need to do if I want write integration tests with Kafka for JUnit in the context of a Spring-based application that leverages Spring Kafka to integrate messaging capabilities?

more ...