2025年大数据工具之dbt

大数据工具之dbt大数据工具之 dbt 简介 ETL 是英文 Extract Transform Load 的缩写 用来描述将数据从来源端经过抽取 extract 转换 transform 加载 load 至目的端的过程 ETL 一词较常用在数据仓库 dbt

大家好,我是讯享网,很高兴认识大家。

大数据工具之dbt

简介

ETL,是英文Extract-Transform-Load的缩写,用来描述将数据从来源端经过抽取(extract)、转换(transform)、加载(load)至目的端的过程。ETL一词较常用在数据仓库.

dbt 是一个很不错的进行ETL中的T处理的工具,灵活简单,我们需要写的就是select 语句,剩下的dbt 帮助我们进行处理

官网地址:https://docs.getdbt.com/

环境要求

1、Python版本需要3.7及以上

2、需要拉取国外镜像,有可能失败,添加国内镜像 daocloud镜像加速器:

curl -sSL https://get.daocloud.io/daotools/set_mirror.sh | sh -s http://d1d9aef0.m.daocloud.io 该脚本可以将 --registry-mirror 加入到你的 Docker 配置文件 /etc/docker/daemon.json 中 

讯享网

Docker安装

github地址:https://github.com/dbt-labs/dbt-core/tree/v1.0.0/docker

因为dbt一般依赖定时任务驱动执行数据转换,所以在官方的基础镜像上将crontab打包进去,一共四个文件(注意路径目录):

crontab

讯享网# /etc/crontab: system-wide crontab # Unlike any other crontab you don't have to run the `crontab' # command to install the new version when you edit this file # and files in /etc/cron.d. These files also have username fields, # that none of the other crontabs do. SHELL=/bin/sh PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin # Example of job definition: # .---------------- minute (0 - 59) # | .------------- hour (0 - 23) # | | .---------- day of month (1 - 31) # | | | .------- month (1 - 12) OR jan,feb,mar,apr ... # | | | | .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat # | | | | | # * * * * * user-name command to be executed #17 * * * * root cd / && run-parts --report /etc/cron.hourly #25 6 * * * root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) #47 6 * * 7 root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly ) #52 6 1 * * root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly ) #每小时执行一次dbt项目 0 0 0/1 * * root cd /usr/app/dbt_project && dbt run > /var/log/dbt-crontab.log 2>&1 

Dockerfile

ARG BASE_IMAGE=python:3.8-slim-bullseye FROM $BASE_IMAGE RUN apt-get update \ && apt-get dist-upgrade -y \ && apt-get install -y --no-install-recommends \ git \ ssh-client \ software-properties-common \ make \ build-essential \ ca-certificates \ libpq-dev \ cron \ vim \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* COPY ./requirements.txt / COPY ./entrypoint.sh / RUN chmod +x entrypoint.sh RUN pip install --upgrade pip setuptools RUN pip install --requirement requirements.txt ENV PYTHONIOENCODING=utf-8 ENV LANG C.UTF-8 WORKDIR /usr/app VOLUME /usr/app RUN /etc/init.d/cron start ENTRYPOINT ["/entrypoint.sh"] 

entrypoint.sh

讯享网#!/bin/bash set -x # 保存环境变量,开启crontab服务 env >> /etc/default/locale /etc/init.d/cron start /bin/bash 

requirements.txt

agate==1.6.3 attrs==21.2.0 Babel==2.9.1 certifi==2021.10.8 cffi==1.15.0 charset-normalizer==2.0.8 click==8.0.3 colorama==0.4.4 dbt-core==1.0.0 dbt-extractor==0.4.0 dbt-postgres==1.0.0 future==0.18.2 hologram==0.0.14 idna==3.3 importlib-metadata==4.8.2 isodate==0.6.0 Jinja2==2.11.3 jsonschema==3.1.1 leather==0.3.4 Logbook==1.5.3 MarkupSafe==2.0.1 mashumaro==2.9 minimal-snowplow-tracker==0.0.2 msgpack==1.0.3 networkx==2.6.3 packaging==21.3 parsedatetime==2.4 psycopg2-binary==2.9.2 pycparser==2.21 pyparsing==3.0.6 pyrsistent==0.18.0 python-dateutil==2.8.2 python-slugify==5.0.2 pytimeparse==1.1.8 pytz==2021.3 PyYAML==6.0 requests==2.26.0 six==1.16.0 sqlparse==0.4.2 text-unidecode==1.3 typing-extensions==3.10.0.2 urllib3==1.26.7 Werkzeug==2.0.2 zipp==3.6.0 

执行docker build,制作镜像:

讯享网docker build --tag dbt:v1.0.0 . 

docker compose启动:

version: '3' services: dbt: image: dbt:v1.0.0 container_name: dbt volumes: - /root/docker_vol/dbt/app:/usr/app # dbt project - /root/docker_vol/dbt/dbt-profiles:/root/.dbt # dbt profiles - /root/docker_vol/dbt/crontab-conf/crontab:/etc/crontab # crontab config tty: true stdin_open: true 

构建项目

创建一个新的 dbt 项目

讯享网[root@paratera128 dbt]# docker exec -it dbt /bin/bash root@08fb74a70adf:/usr/app# dbt init dbt_project 07:05:56 Running with dbt=1.0.0 Which database would you like to use? [1] mariadb [2] mysql [3] mysql5 (Don't see the one you want? https://docs.getdbt.com/docs/available-adapters) Enter a number: 3 07:06:08 Profile dbt_project written to /root/.dbt/profiles.yml using target's sample configuration. Once updated, you'll be able to start developing with dbt. 07:06:08 Your new dbt project "dbt_project" was created! For more information on how to configure the profiles.yml file, please consult the dbt documentation here: https://docs.getdbt.com/docs/configure-your-profile One more thing: Need help? Don't hesitate to reach out to us via GitHub issues or on Slack: https://community.getdbt.com/ Happy modeling! root@08fb74a70adf:/usr/app# ls dbt_project logs 

修改配置文件

进入挂载卷vol下profiles目录,修改profiles.yml文件,根据target修改对应环境配置即可:

dbt_project: target: dev outputs: dev: type: mysql server: 192.168.137.128 port: 3306 database: matomo schema: matomo username: root password: root driver: MySQL ODBC 8.0 ANSI Driver prod: type: mysql server: [server/host] port: [port] # optional database: [schema] # optional, should be same as schema schema: [schema] username: [username] password: [password] driver: MySQL ODBC 8.0 ANSI Driver 

调试测试

讯享网root@08fb74a70adf:/usr/app# dbt debug 07:35:34 Running with dbt=1.0.0 dbt version: 1.0.0 python version: 3.8.12 python path: /usr/local/bin/python os info: Linux-3.10.0-1160.el7.x86_64-x86_64-with-glibc2.2.5 Using profiles.yml file at /root/.dbt/profiles.yml Using dbt_project.yml file at /usr/app/dbt_project.yml Configuration: profiles.yml file [OK found and valid] dbt_project.yml file [OK found and valid] Required dependencies: - git [OK found] Connection: server: 192.168.137.128 port: 3306 database: None schema: matomo user: root Connection test: [OK connection ok] All checks passed! 

应用范例

有需求如下:

有原始表:matomo_log_visit,希望对数据进行如下处理

1、过滤数据,作为下一步的数据依据,因为只是过程数据,所以生成视图表:matomo_log_visit_material_view

2、从上一步视图数据中摘取有意义的字段插入物理表:matomo_log_visit_material

3、为了优化性能,希望对于产生的数据希望是以增量的形式生成


讯享网

从dbt_project.yml项目配置文件可以看出,dbt run执行的脚本路径如下:

models: dbt_project: # Config indicated by + and applies to all files under models/example/ example: +materialized: table #如果.sql脚本不指定类型,默认生成table 

dbt 执行models脚本,生成的视图或者物理表名称都跟脚本名称一样,所以两步处理脚本如下:

matomo_log_visit_material_view.sql

讯享网{ 
  
    
  { config(materialized='view')}} SELECT a.*,now() as create_time FROM matomo.matomo_log_visit a WHERE idsite = 1 and visit_total_time > 0 

matomo_log_visit_material.sql

select idvisit,idsite,user_id,visit_first_action_time,visit_total_time,visit_goal_buyer,referer_type,location_ip,create_time from { 
  
    
  { ref('matomo_log_visit_material_view') }} 

设置执行顺序:

schema.yml

讯享网version: 2 models: - name: matomo_log_visit_material_view description: "A starter dbt model" columns: - name: idvisit description: "The primary key for this table" tests: - unique - not_null - name: matomo_log_visit_material description: "A starter dbt model" columns: - name: idvisit description: "The primary key for this table" tests: - unique - not_null 

执行dbt run

root@1dc4f79d8882:/usr/app/dbt_project# dbt run 07:16:40 Running with dbt=1.0.0 07:16:41 Found 2 models, 4 tests, 0 snapshots, 0 analyses, 170 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics 07:16:41 07:16:41 Concurrency: 1 threads (target='dev') 07:16:41 07:16:41 1 of 2 START view model matomo.matomo_log_visit_material_view................... [RUN] 07:16:41 1 of 2 OK created view model matomo.matomo_log_visit_material_view.............. [SUCCESS 0 in 0.09s] 07:16:41 2 of 2 START table model matomo.matomo_log_visit_material....................... [RUN] 07:16:41 2 of 2 OK created table model matomo.matomo_log_visit_material.................. [SUCCESS 8176 in 0.17s] 07:16:41 07:16:41 Finished running 1 view model, 1 table model in 0.32s. 07:16:41 07:16:41 Completed successfully 07:16:41 07:16:41 Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2 

可以看到分别生成了视图和物理表:

在这里插入图片描述

到这里,其实已经完成了前两步,第三步还没有达到增量的效果,这一点可以通过两次执行dbt run看出来,数据量没变,create_time时间变了,说明是整体更新,不是增量更新!

怎么处理呢?

首先:对于视图表来说,他只是一个虚拟存在的表,不涉及到物理表譬如寻址那种性能损耗比较大的问题,所以我们明确的可以知道要进行修改的脚本是matomo_log_visit_material.sql

然后,对于增量更新来说,常用的参照物是有序id,时间等字段

1、如果使用id,那么我们取最大id跟新行数据id比较,我们修改matomo_log_visit_material.sql如下:

讯享网{ 
  
    
  { config(materialized='incremental', unique_key='visit_first_action_time') }} select idvisit,idsite,user_id,visit_first_action_time,visit_total_time,visit_goal_buyer,referer_type,location_ip,create_time from { 
  
    
  { ref('matomo_log_visit_material_view') }} {% if is_incremental() %} idvisit > (select max(idvisit) from { 
  
    
  { this }}) {% endif %} 

往原始表中新增一条数据,idvisit字段值比库中最大值还大,再次执行dbt run,你会发现并不会整体更新:

在这里插入图片描述

2、如果使用时间,我们可以取最大时间跟当前时间比较,我们修改matomo_log_visit_material.sql如下:

{ 
  
    
  { config(materialized='incremental', unique_key='visit_first_action_time') }} select idvisit,idsite,user_id,visit_first_action_time,visit_total_time,visit_goal_buyer,referer_type,location_ip,create_time from { 
  
    
  { ref('matomo_log_visit_material_view') }} where {% if is_incremental() %} visit_first_action_time >= (select max(visit_first_action_time) from { 
  
    
  { this }}) {% endif %} 

往原始表新增一条数据,visit_first_action_time字段时间设置成比当前库中最大时间还大,再次执行dbt run,你会发现并不会整体更新:

在这里插入图片描述

项目文档

  • 生成
讯享网root@1dc4f79d8882:/usr/app/dbt_project# dbt docs generate 09:11:41 Running with dbt=1.0.0 09:11:41 Found 2 models, 4 tests, 0 snapshots, 0 analyses, 170 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics 09:11:41 09:11:42 Concurrency: 1 threads (target='dev') 09:11:42 09:11:42 Done. 09:11:42 Building catalog 09:11:42 Catalog written to /usr/app/dbt_project/target/catalog.json 
  • 部署(静态服务)
root@1dc4f79d8882:/usr/app/dbt_project# dbt docs serve 09:13:45 Running with dbt=1.0.0 09:13:45 Serving docs at 0.0.0.0:8080 09:13:45 To access from your browser, navigate to: http://localhost:8080 09:13:45 09:13:45 09:13:45 Press Ctrl+C to exit. 
  • 访问
    在这里插入图片描述
小讯
上一篇 2025-04-06 18:12
下一篇 2025-01-16 22:38

相关推荐

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容,请联系我们,一经查实,本站将立刻删除。
如需转载请保留出处:https://51itzy.com/kjqy/117269.html