Streaming Data Project
Streaming isn't just for Netflix aficionados! This project will focus on streaming data, using tools like Apache Kafka and ksqlDB. The data source that we will use for Extraction will be a Python script that generates random text. This will be ingested by Kafka and transformations will be done in ksqlDB. Further transformation and loading of the data will be handled in a future post.
Let's Start With Some Text!
As mentioned, this process will start with a python script that will generate oodles of random sentences. This functionality is brought to us by the module "Faker," which is described as "a Python package that generates fake data for you." We're also going to use a Kafka connector for python, kafka-python, which will help us generate a data stream and shove it into Kafka's hungry maw.
Once they're installed via pip (or pip3, as some would have it), the import string is straightforward:
from kafka import KafkaProducer
from faker import Faker
Next, we have a variable, just to make the upcoming string a bit easier:
fake = Faker()
We then have a connection string that will send everything to Kafka:
producer = KafkaProducer(bootstrap_servers=['localhost:9092'])
Finally, we have a while loop that will keep generating random sentences until the lights go out:
while True:
message = fake.sentence()
producer.send('my-topic', message.encode('utf-8'))
Setting up Kafka:
Although I could have set up everything I needed with a Docker container, sometimes it's good to get your hands dirty and install everything yourself. All of the subsequent error messages that you'll doubtless encounter will hone your troubleshooting skills and give you a closer connection to the technologies that you're working with. In this case, Kafka was installed on Linux Mint, using the latest build directly from the Apache website. Once installed, the "bin" folder is your central base of operations and there are some commands that you will usually need to get started - including initializing Zookeeper, which Kafka cannot run without:
- bin/zookeeper-server-start.sh config/zookeeper.properties
- bin/kafka-server-start.sh config/server.properties
- bin/kafka-console-producer.sh --topic my-topic --bootstrap-server localhost:9092
- bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my-topic --from-beginning
Of course, you should probably pick something more descriptive and less boring than "my-topic" for your topic name, but for our purposes this will do. If you already ran the python script that generates the "Faker" text, then you'll see a big stream of sentences in your Kafka producer:
This is all working as intended. Mind you, there's an endless amount of sources that you can use to ingest streaming data with Kafka: IoT devices and sensors, stock market APIs, flat files, and of course, the venerable Twitter. Experimenting with all of these is encouraged but for now, we have a simple, workable example of streaming text to work with.
Setting Up KsqlDB:
Next in the chain, we have ksqlDB, which was also downloaded directly from its Apache source. Once downloaded, a look at their quick-start documentation was tremendously useful. If you have Kafka already running, then you just have to create a stream:
CREATE STREAM myStream (message VARCHAR) WITH (kafka_topic='my-topic', value_format='delimited');
and then you're running queries in no time:
ksql> SELECT COUNT(*) FROM myStream EMIT CHANGES;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|KSQL_COL_0 |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|21335 |
|45167 |
|68530 |
|90752 |
Comments
Post a Comment