In this article will discover a event-driven-stream-based way to create analytics solutions like Google Analytics, Plausible, Aptabase and other using only existing tools.
We'll use the following tools for creating a self-hosted analytics system.
Tool | Purpose |
Nginx | Revers proxy: the main endpoint, and rate limiter |
Benthos | Stream processor: gather analytics events |
Kafka | Event store: stores all events |
Vector | Log aggregator: aggregate the logs to put into the OLAP database |
Grafana | Dashboard: view all the data in real-time and history browsing |
Clickhouse | OLAP database: long-term storage for the events |
The requirements for this article are to have general knowledge in:
Docker & Docker Swarm
Nginx
and, Stream processing in general.
The Idea
When utilizing existing tools together we'll be able to have a production ready product. And, all what we need to do is to just update the underlying technologies being in used.
Here's the end results ๐
Why?
The main advantage of self-hosting solutions is to have control over the data.
The second, it's usually cheaper.
Last advantage, this type of production environment scaled easily, meaning, you need a wider throughput? just add more nodes containing the same services and put it all behind a load balancer.
Architecture
First, will use nginx
as the only direct endpoint to the server. nginx
will reverse proxy each request based on the domain to one of two inner endpoints:
benthos
for event capturinggrafana
for accessing the analytics dashboard
benthos
will have http_server
input configured to handle incoming requests. Then, benthos
will check to see if the message is indeed for the app, and will add some meta data, then it will deliver (produce) the data directly to a kafka
topic, and response with 201
(empty successful response) to the user immediately.
The reason for that is to make sure the event capturing process will fast as possible, between 10-50ms total! sending the event to kafka
is a good way to ensure just that.
On the other hand we have a vector
container subscribed to the kafak
topic for receiving the events. vector
will process each event and will add them to OLAP database.
In the process vector
will parse the user-agent, the user country and other event details.
Then, the management-user can access the grafana
dashboard to view data in real-time or to filter by dates.
vector
or bentos
for both parts of the streaming process. The use of two different services was for becoming familiar with both technologies.Deployment Steps
In the lab repo you can find
terraform
files for one-liner deployment ofN+B+K+V+G+C
stack to Hetzner and Cloudflare (for DNS).
Files and Folders
Clone the lab repo and cd
into 01-your-own-analytics/docker
. You will see these files.
docker-compose.yml
The main architecture file.env
Contains all necessary environment variables as Database passwords, etc.nginx/analytics.conf
Contains nginx config file for analytics endpoint with rate-limit of 60 requests per second.nginx/dashboard.conf
Contains nginx config file for Grafana dashboard endpoint.bentohs/config.yml
- Benthos config filevector/config.yml
- Vector config filevector/mmdb
- MindMap GeoLite database.grfana/dashboards/*.json
dashboards settingsgrfana/provisioning/dashboards/clickhouse.yaml
Dashboards loading instructions.grfana/provisioning/datasources/clickhouse.xml
ClickhHouse datasource settings.scripts/install.ts
Installer script for creating ClickHouse database and table.(Optional)
scripts/seed.ts
Seeding a year-worth of dummy data.
We use docker-compose.yml
file to describe the architecture for ease of use, and it would be suitable for most use-cases. in large environment you'll probably use Docker Swarm or a fully blown Kubernetes environments.
The project contains a justfile
which its a simple script runner or "just
is a handy way to save and run project-specific commands."
You can download the right one for your system here.
For local development with just you can run, you'll be ask for password, this is for editing your hosts
file, check inside the justfile
for more details.
just install-local
To remove everything, run
just remove
Now, let's go through the deployment process.
Setting nginx endpoints
First, we need to update the domains inside nginx/analytics.conf
and nginx/dashboard.conf
files. Here's the analytics configuration for example.
limit_req_zone $binary_remote_addr zone=base_limit:10m rate=5r/s;
server {
...
server_name analytics.test;
location / {
limit_req zone=base_limit;
...
proxy_pass http://input-pipeline:4195;
}
}
You can also notice that we are setting rate-limit for the analytics endpoint only, as we want to prevent abuse against the endpoint from which we are gathering data.
The rate limit settings are, 5 Requests per second for given IP.
As for the dashboard endpoint it's recommended putting it behind Firewall or a different type of blocker as CloudFlare ZeroTrust which is free up to 50 users.
Env file
Inside the .env
file you can set strong password and different usernames for the databases. The only field you might want to change is PROJECTS
PROJECTS=app,website
This variable contains list of projects names. This project names will be in use when you will create events.
Installation
Now, we can start everything by running
docker compose up -d
Creating Kafka topic.
Run this command to execute binary inside the kafka
container to create Kafka topic named analytics
with 7 days message retention, after that Kafka will delete them.
docker exec event-store kafka-topics.sh \
--create --topic analytics \
--config retention.ms=604800000 \
--bootstrap-server localhost:9092
Now it's better to restart the vector
container as the topic didn't exist till now.
docker compose restart vector
Creating Clickhouse table
Run this command to create Clickhouse table.
export $(grep -v '^#' .env | xargs) \
&& docker run --rm \
--env CLICKHOUSE_HOST --env CLICKHOUSE_USER --env CLICKHOUSE_PASSWORD \
--env CLICKHOUSE_PROTOCOL --env CLICKHOUSE_PORT --env CLICKHOUSE_DB \
--network=analytics \
-v ./scripts:/app oven/bun bun /app/install.ts
This command will run the typescript
file using bun
runtime.
Optional - seed some demo data
To see the dashboard with data while browsing you can run the following command to seed years worth of data into the database.
export $(grep -v '^#' .env | xargs) \
&& docker run --rm \
--env CLICKHOUSE_HOST --env CLICKHOUSE_USER --env CLICKHOUSE_PASSWORD \
--env CLICKHOUSE_PROTOCOL --env CLICKHOUSE_PORT --env CLICKHOUSE_DB \
--network=analytics \
-v ./scripts:/app oven/bun bun /app/seed.ts
Create your first event(s)
Now we are ready to create our first event. we'll use curl
to POST
an event to our Benthos (through nginx
) endpoint.
curl --location 'http://analytics.test' \
--header 'X-App: app' \
--header 'Content-Type: application/json' \
--data '{
"type": "click",
"user_id": "1",
"session_id": "FaAa31",
"timestamp": "2024-04-17T04:04:12",
"locale": "en",
"path": "/getting-started",
"appVersion": "0.0.1",
"details": {
"meta_a": "meta_a",
"meta_b": "meta_b"
}
}'
or JavaScript fetch
for example
const body = JSON.stringify({
"type": "click",
"user_id": "1",
"session_id": "FaAa31",
"timestamp": "2024-04-17T04:04:12",
"locale": "en",
"path": "/getting-started",
"appVersion": "0.0.1",
"details": {
"meta_a": "meta_a",
"meta_b": "meta_b"
}
});
await fetch("http://analytics.test", {
method: "POST",
headers: {
"X-App": "app",
"Content-Type": "application/json"
},
body,
});
User ID vs Session ID
User ID is an identifier that you'll try to maintain as long as you can as this is the way you can different between users, for mobile you can use some device id like identifierForVendor in iOS, and AndroidID for Android. In web, you can store random unique ID inside the browser Local Storage.
Session ID on the other hand is only for short period of time, for that you can create a random unique id combined with the current user ID and store it in memory. In the browser you can also store it inside Session Storage.
View the data in Grafana
Now access your Grafana. For the first time you'll to login using admin
as username and password. Then, you'll be prompted to change the password.
You can also see events occurring in real-time by enabling Grafana auto load from the upper top right.
That's it!
ClickHouse is very! efficient, 119,716,577 rows weight total of 21.91GB
, but, on disk it's compressed and weight only 4.93GB
!
You can deploy a N+B+K+V+G+C
stack on a โฌ 16.18 server at Hetzner and get
4 vCPU
8GB RAM
160GB SSD
and 20TB! Bandwidth,
To have that number of events you'll need to create 9,976,381 events a month, or 332,546 events a day, or 3-4 events in a given second.
This deployment of N+B+K+V+G+C
its just simple example, and, you can of course adapt it to your needs.
๐ช Expelliarmus!