使用 Hadoop MapReduce 完成 JSON 数据的多维度统计与树状结构构建

在大数据平台中，原始数据往往以半结构化形式（如 JSON Lines）存储在分布式文件系统（如 HDFS）中，并按业务或时间维度组织为目录层级。当需要对这些数据进行内容分析并保留其物理存储结构的语义时，传统的扁平化聚合就不再足够——我们希望输出结果既能反映统计指标，又能体现原始路径的层次关系。

Hadoop MapReduce 提供了一种天然适合此类任务的编程模型：Mapper 负责逐条解析和初步计算，Reducer 则可基于键（如文件路径）进行聚合，并进一步构建结构化输出。本文将通过一个完整示例，展示如何用 Python 编写 MapReduce 作业，实现：

对 JSON 行数据的内容分类与 token 统计；
按文件路径聚合结果；
输出一个嵌套 JSON 对象，等价于目录树结构。

设计思路

整个流程分为两个阶段：

Map 阶段：
- 读取每行 JSON，提取关键字段；
- 根据业务规则判断类别并计算指标（如 token 数）；
- 以“文件路径”作为输出键，确保同一文件的所有记录被发送到同一个 Reducer。
Reduce 阶段：
- 聚合来自同一文件的所有记录，得到该文件的总指标；
- 将所有文件按路径拆解为目录层级；
- 自底向上构建嵌套字典，每个节点包含汇总指标；
- 最终输出为标准 JSON，天然表达树状结构。

这种设计充分利用了 MapReduce 的分组能力和路径信息的结构性，避免了额外的后处理步骤。

Mapper 实现

Mapper 的核心是利用 Hadoop 提供的环境变量 map_input_file 获取当前处理的文件路径，并对每行 JSON 进行轻量解析和分类。

#!/usr/bin/env python3
import sys
import os
import json
def count_tokens(s):
    return len(s.split()) if s.strip() else 0
file_path = os.getenv('map_input_file', 'unknown')
for line in sys.stdin:
    line = line.strip()
    if not line:
        continue
    try:
        data = json.loads(line)
        text = data.get("text", "")
    except Exception:
        continue
    if text.startswith("[tag1][tag1]"):
        # fast 类型：标签后内容有效
        content = text[14:]  # len("[tag1][tag1]") == 14
        tokens = count_tokens(content)
        print(f"{file_path}\tfast:{tokens}")
    elif "[tag1]" in text:
        # slow 类型：提取两个 [tag1] 之间的内容
        i1 = text.find("[tag1]")
        i2 = text.find("[tag1]", i1 + 6)
        if i2 == -1:
            continue
        inner = text[i1 + 6:i2]
        tokens = count_tokens(inner)
        print(f"{file_path}\tslow:{tokens}")
    else:
        continue

注意：虽然分类逻辑依赖特定标签格式，但其本质是“基于内容规则的多维打标”，可轻松替换为其他业务判断（如日志级别、用户类型等）。

Reducer 实现

Reducer 不仅要聚合指标，还要重建路径的层级关系。我们采用“路径拆分 + 嵌套字典 + 递归汇总”的策略，代码简洁且高效。

#!/usr/bin/env python3
import sys
import json
from collections import defaultdict
# 聚合每个文件的总 token 数和行数
file_metrics = defaultdict(lambda: {"tokens": 0, "lines": 0})
for line in sys.stdin:
    try:
        path, stat = line.strip().split('\t')
        _, cnt = stat.split(':')
        count = int(cnt)
        file_metrics[path]["tokens"] += count
        file_metrics[path]["lines"] += 1
    except:
        continue
# 构建嵌套目录树
tree = {}
for path, metrics in file_metrics.items():
    parts = [p for p in path.split('/') if p]
    if not parts:
        continue
    node = tree
    # 创建中间目录
    for part in parts[:-1]:
        if part not in node:
            node[part] = {"tokens": 0, "lines": 0, "children": {}}
        node = node[part]["children"]
    # 添加文件叶节点
    filename = parts[-1]
    node[filename] = {"tokens": metrics["tokens"], "lines": metrics["lines"]}
# 递归汇总子节点指标
def aggregate(children):
    total_t, total_l = 0, 0
    for name, child in children.items():
        if "children" in child:
            t, l = aggregate(child["children"])
            child["tokens"] += t
            child["lines"] += l
        total_t += child["tokens"]
        total_l += child["lines"]
    return total_t, total_l
# 汇总顶层节点
for name, node in tree.items():
    if "children" in node:
        t, l = aggregate(node["children"])
        node["tokens"] += t
        node["lines"] += l
print(json.dumps(tree, indent=2))

该 Reducer 的输出是一个合法 JSON 对象，其嵌套结构与 HDFS 目录完全对应，每个节点包含两个核心指标：tokens（总 token 数）和 lines（总记录数）。

执行与集成

使用 Hadoop Streaming 提交作业：

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
  -input /data/jsonl_input \
  -output /data/output \
  -mapper "python3 mapper.py" \
  -reducer "python3 reducer.py" \
  -file mapper.py \
  -file reducer.py

Hadoop 会自动将 map_input_file 设置为当前输入分片对应的文件路径，无需额外配置。