Streaming Data Project


 

Streaming isn't just for Netflix aficionados! This project will focus on streaming data, using tools like Apache Kafka and ksqlDB. The data source that we will use for Extraction will be a Python script that generates random text. This will be ingested by Kafka and transformations will be done in ksqlDB. Further transformation and loading of the data will be handled in a future post.

Let's Start With Some Text!

As mentioned, this process will start with a python script that will generate oodles of random sentences. This functionality is brought to us by the module "Faker," which is described as "a Python package that generates fake data for you." We're also going to use a Kafka connector for python, kafka-python, which will help us generate a data stream and shove it into Kafka's hungry maw.

Once they're installed via pip (or pip3, as some would have it), the import string is straightforward:

from kafka import KafkaProducer

from faker import Faker

Next, we have a variable, just to make the upcoming string a bit easier:

fake = Faker()

We then have a connection string that will send everything to Kafka:

producer = KafkaProducer(bootstrap_servers=['localhost:9092'])

Finally, we have a while loop that will keep generating random sentences until the lights go out:

while True:

    message = fake.sentence()

    producer.send('my-topic', message.encode('utf-8'))


Setting up Kafka:

Although I could have set up everything I needed with a Docker container, sometimes it's good to get your hands dirty and install everything yourself. All of the subsequent error messages that you'll doubtless encounter will hone your troubleshooting skills and give you a closer connection to the technologies that you're working with. In this case, Kafka was installed on Linux Mint, using the latest build directly from the Apache website. Once installed, the "bin" folder is your central base of operations and there are some commands that you will usually need to get started - including initializing Zookeeper, which Kafka cannot run without:

  • bin/zookeeper-server-start.sh config/zookeeper.properties
  • bin/kafka-server-start.sh config/server.properties
  • bin/kafka-console-producer.sh --topic my-topic --bootstrap-server localhost:9092
  • bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my-topic --from-beginning

Of course, you should probably pick something more descriptive and less boring than "my-topic" for your topic name, but for our purposes this will do. If you already ran the python script that generates the "Faker" text, then you'll see a big stream of sentences in your Kafka producer:


This is all working as intended. Mind you, there's an endless amount of sources that you can use to ingest streaming data with Kafka: IoT devices and sensors, stock market APIs, flat files, and of course, the venerable Twitter. Experimenting with all of these is encouraged but for now, we have a simple, workable example of streaming text to work with. 

Setting Up KsqlDB:

Next in the chain, we have ksqlDB, which was also downloaded directly from its Apache source. Once downloaded, a look at their quick-start documentation was tremendously useful. If you have Kafka already running, then you just have to create a stream:

CREATE STREAM myStream (message VARCHAR) WITH (kafka_topic='my-topic', value_format='delimited');

and then you're running queries in no time:

ksql> SELECT COUNT(*) FROM myStream EMIT CHANGES;

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

|KSQL_COL_0                                                                                                                                                                                  |

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

|21335                                                                                                                                                                                       |

|45167                                                                                                                                                                                       |

|68530                                                                                                                                                                                       |

|90752                                                                                                                                                                                       |

ksql> SELECT * FROM myStream WHERE message LIKE '%society%';
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|MESSAGE                                                                                                                                                                                     |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Ability idea society amount tough begin.                                                                                                                                                    |
|Sell such significant society policy product read within.                                                                                                                                   |
|Tend pull less need early social better society.                                                                                                                                            |
|Bit drive society attorney the.                                                                                                                                                             |
Of course, these queries are all of a very simple type, but in the next installment, we'll look at creating tables, combining streams, and loading the data to an external endpoint. Happy streaming!

Comments

Popular posts from this blog

The Basics of IICS

Real Estate Data Pipeline, Part 1

Imperial to Metric Conversion (and vice-versa) Script