哔哩轻小说 (linovelib) 分析笔记¶
创建日期: 2025/05/25
修改日期: 2025/10/06
2025/08 ~ ??? 混淆逻辑¶
一、pctheme.js 的正则替换映射¶
pctheme.js 中包含形如 "链式 replace(new RegExp(...) , '...')" 的批量替换,约 100 对。
例如:
eval(function(p,a,c,k,e,r){e=String;if(!''.replace(/^/,String)){while(c--)r[c]=k[c]||c;k=[function(e){return r[e]}];e=function(){return'\\w+'};c=1};while(c--)if(k[c])p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c]);return p}('4=4.0(1 2("","3"),"的")...',5,5,'replace|new|RegExp|gi|k'.split('|'),0,{}));
解混淆后本质为:
- 对正文文本执行若干
str.replace(new RegExp(SRC, 'gi'), DST) SRC多为私有区或特殊部件字符 (如),DST为常用汉字 (如的)
示例:
k = k
.replace(new RegExp("", "gi"), "的")
.replace(new RegExp("", "gi"), "一")
.replace(new RegExp("", "gi"), "是")
.replace(new RegExp("", "gi"), "了")
.replace(new RegExp("", "gi"), "我")
.replace(new RegExp("", "gi"), "不")
.replace(new RegExp("", "gi"), "人")
.replace(new RegExp("", "gi"), "在")
.replace(new RegExp("", "gi"), "他")
.replace(new RegExp("", "gi"), "有")
.replace(new RegExp("", "gi"), "这");
处理方法:
解析 pctheme.js 文本,正则匹配 new RegExp\(["'](.+?)["'],"gi"\)\s*,\s*["'](.+?)["'] 收集为 map[SRC] = DST 并保存为 json。
参考还原 (Python):
import json
from pathlib import Path
LINOVELIB_MAP_PATH = Path("/path/to/map.json")
_PCTHEMA_MAP: dict[str, str] = json.loads(
LINOVELIB_MAP_PATH.read_text(encoding="utf-8")
)
def _map_subst(text: str) -> str:
"""
Apply PC theme character substitution to the input text.
"""
return "".join(_PCTHEMA_MAP.get(c, c) for c in text)
二、chapterlog.js 的段落打乱与恢复 (Seeded Fisher–Yates)¶
chapterlog.js 使用 javascript-obfuscator 混淆,机制要点:
- 仅处理
#TextContent下非空<p>段落。 - 若段落数 <= 20: 顺序不变。
- 若段落数 > 20: 前 20 段固定,其余段按章节 ID 派生的种子进行 Fisher–Yates 打乱。
- 伪随机序列采用线性同余:
s = (s*9302 + 49397) % 233280,选位j = floor(s/233280*(i+1))。 - 种子
seed = chapterId*127 + 235。
核心逻辑复原如下:
var chapterId = ReadParams.chapterid;
if (!chapterId) return;
var textContainer = document.querySelector("#TextContent");
if (!textContainer) return;
var allNodes = Array.prototype.slice.call(textContainer.childNodes);
var paragraphs = []; // 收集非空 <p>
for (var i = 0; i < allNodes.length; i++) {
var node = allNodes[i];
if (node.nodeType === 1 && node.tagName.toLowerCase() === "p" && node.innerHTML.replace(/\s+/g, "").length > 0) {
paragraphs.push({ node: node, idx: i });
}
}
var paragraphCount = paragraphs.length;
if (!paragraphCount) return;
function shuffle(array, seed) {
var len = array.length;
seed = Number(seed);
for (var i = len - 1; i > 0; i--) {
seed = (seed * 9302 + 49397) % 233280; // 线性同余伪随机
var j = Math.floor(seed / 233280 * (i + 1)); // Fisher–Yates 选位
var tmp = array[i]; array[i] = array[j]; array[j] = tmp;
}
return array;
}
var seed = Number(chapterId) * 127 + 235; // 种子派生
var order = [];
if (paragraphCount > 20) {
var fixed = [], rest = [];
for (var i = 0; i < paragraphCount; i++) (i < 20 ? fixed : rest).push(i);
shuffle(rest, seed);
order = fixed.concat(rest); // 前 20 固定,其余打乱
} else {
for (var i = 0; i < paragraphCount; i++) order.push(i);
}
// 映射
var reordered = [];
for (var i = 0; i < paragraphCount; i++) {
reordered[order[i]] = paragraphs[i].node;
}
参考还原 (Python):
def _chapterlog_order(n: int, cid: int) -> list[int]:
"""
Compute the paragraph reordering index sequence used by /scripts/chapterlog.js.
:param n: Total number of non-empty paragraphs in the chapter.
:param cid: Chapter ID (used as the seed for the shuffle).
"""
if n <= 0:
return []
if n <= 20:
return list(range(n))
fixed = list(range(20))
rest = list(range(20, n))
# Seeded Fisher-Yates
m = 233_280
a = 9_302
c = 49_397
s = cid * 127 + 235 # seed
for i in range(len(rest) - 1, 0, -1):
s = (s * a + c) % m
j = (s * (i + 1)) // m
rest[i], rest[j] = rest[j], rest[i]
return fixed + rest
def restore_paragraphs(paragraphs: list[str], cid: int) -> list[str]:
order = _chapterlog_order(len(paragraphs), cid_int)
reordered_p = [""] * len(paragraphs)
for i, p in enumerate(paragraphs):
reordered_p[order[i]] = p
return reordered_p
??? ~ 2025/08 混淆逻辑¶
说明: 以下为 2025/08 之前的混淆方案,现已被替代。
一、混淆脚本与字体注入¶
部分章节页面包含经混淆压缩的脚本,其作用是动态注入 @font-face 并将自定义字体应用到指定段落。还原后的核心逻辑如下:
const sheet = new CSSStyleSheet();
sheet.replaceSync(`
@font-face {
font-family: read;
font-display: block;
src: url('/public/font/read.woff2') format('woff2'),
url('/public/font/read.ttf') format('truetype');
}
#TextContent p:nth-last-of-type(2) {
font-family: "read" !important;
}
`);
document.adoptedStyleSheets = [
...document.adoptedStyleSheets,
sheet
];
页面对倒数第二个 <p> 应用自定义字体 read; 最后一个 <p> 恒为空行。与起点的做法相似,但此处字体文件固定,非按请求动态生成。
对应 CSS:
@font-face {
font-family: read;
font-display: block;
src: url('/public/font/read.woff2') format('woff2'),
url('/public/font/read.ttf') format('truetype');
}
#TextContent p:nth-last-of-type(2) {
font-family: "read" !important;
}
二、字体渲染测试¶
为验证该字体映射情况, 使用 Pillow 库对示例字符串进行渲染:
import textwrap
from pathlib import Path
from PIL import Image, ImageDraw, ImageFont
CELL_SIZE = 64
FONT_SIZE = 52
FONT_PATH = Path.cwd() / "read.woff2"
TEXT_SAMPLE = "「床瑰蛾妹」" # 即使与全世界为敌
def render_text(
text: str,
font: ImageFont.FreeTypeFont,
cell_size: int = CELL_SIZE,
chars_per_line: int = 16,
) -> Image.Image | None:
"""
Render a string into a image.
"""
lines = textwrap.wrap(text, width=chars_per_line) or [""]
img_w = cell_size * chars_per_line
img_h = cell_size * len(lines)
img = Image.new("L", (img_w, img_h), color=255)
draw = ImageDraw.Draw(img)
for row, line in enumerate(lines):
for col, ch in enumerate(line):
x = (col + 0.5) * cell_size
y = (row + 0.5) * cell_size
draw.text((x, y), ch, font=font, fill=0, anchor="mm")
return img
if __name__ == "__main__":
font = ImageFont.truetype(str(FONT_PATH), FONT_SIZE)
image = render_text(TEXT_SAMPLE, font)
image.show()

观察可见:
部分字符渲染为空白, 如示例中所示。
页面编码形式通常表现为「两位 PUA 字符 + 一位汉字」的交替组合; 但 read 字体未为这些常见汉字嵌入字形,因此渲染为空。
三、空字形统计与判定¶
read.ttf 中存在大量 "映射存在但字形为空" 的条目,以 ttx 导出可见 (ttx ./read.ttf):
cmap映射到诸如glyph07404hmtx中对应width=0glyf的TTGlyph不含轮廓
示例:
<hmtx>
<mtx name="glyph07404" width="0" lsb="0"/>
<mtx name="glyph07405" width="0" lsb="0"/>
...
<cmap_format_4 ...>
<map code="0x4e00" name="glyph07404"/>
<map code="0x4e01" name="glyph07405"/>
...
<TTGlyph name="glyph07404"/><!-- contains no outline data -->
<TTGlyph name="glyph07405"/><!-- contains no outline data -->
由此可见,0x4e00 和 0x4e01(即“ 一 ”、“ 丁 ”)虽在映射表中存在,但对应的字形数据为空,且宽度为 0,渲染时显示为空白。
判定原则:
glyf中无轮廓hmtx中水平宽度为 0
满足任一条件即可认定为空字形。
示例统计 (基于 fontTools):
from fontTools.ttLib import TTFont
def count_blank_glyphs(path: str) -> list[str]:
font = TTFont(path)
cmap = font.getBestCmap()
hmtx_table = font["hmtx"]
blank_chars: list[str] = []
for code, glyph_name in cmap.items():
width, _ = hmtx_table[glyph_name]
if width == 0:
blank_chars.append(chr(code))
return blank_chars
if __name__ == "__main__":
blanks = count_blank_glyphs("read.ttf")
print(f"空字形数量: {len(blanks)}")
统计结果: 共有 3500 个空字形。
四、字体还原思路与结果¶
由于 read.ttf/read.woff2 地址固定,所有章节复用同一字体,无需逐章 OCR。
字体信息示例:
字体名称: MI LANTING
版本: Version 2.3.3;GB Outside YS Regular
TrueType Outlines
由此可定位原始字体来源 (如 MI LANTING)。
对 read.ttx 的 cmap 观察可知混淆主要位于 PUA 区间 0xE000 - 0xF8FE。
先通过渲染检查是否是混淆的范围 (在此省略)
再测试确认该区间字符数:
def extract_font_charset(
font_path: str | Path,
lower_bound: int | None = None,
upper_bound: int | None = None,
) -> set[str]:
"""
Extract the set of Unicode characters encoded by a TrueType/OpenType font.
:param font_path: Path to a TTF/OTF font file.
:param lower_bound: Inclusive lower bound of code points (e.g. 0x4E00).
:param upper_bound: Inclusive upper bound of code points (e.g. 0x9FFF).
:return: A set of Unicode characters present in the font's cmap within the specified range.
"""
with TTFont(font_path) as font_ttf:
cmap = font_ttf.getBestCmap() or {}
charset: set[str] = set()
for cp in cmap:
if lower_bound is not None and cp < lower_bound:
continue
if upper_bound is not None and cp > upper_bound:
continue
charset.add(chr(cp))
return charset
ENCRYPTED_FROM = 0xE000
ENCRYPTED_TO = 0xF8FE
read_chars = extract_font_charset("read.ttf", ENCRYPTED_FROM, ENCRYPTED_TO)
print(f"共提取到 {len(read_chars)} 个字符")
一共 3606 个字符
随后对 read.ttx 与 MI LANTING.ttx 进行字形级匹配还原。
直接 O(n×m) 全量比较在两侧均约 2.7 万字形时成本过高,故采用以下剪枝与加速:
- 仅在 PUA 区间 (
0xE000-0xF8FE) 匹配 - 利用空字形先验: 已知
read中约 3500 个 "空字形" 在原字体中有真实字形,可优先匹配 - 基于字形组件与轮廓构建 规范化指纹 并哈希预筛
- 预筛失败再回退全量严格比较
- 未匹配项单独记录,后续人工复核
经优化后,预估耗时由约 8 小时降至约 1 分钟 (环境相关)。
核心脚本 (点击展开)
#!/usr/bin/env python3
from lxml import etree
from lxml.etree import _Element
import hashlib
import json
from pathlib import Path
from functools import lru_cache
from tqdm import tqdm
ENCRYPTED_FROM = 0xE000
ENCRYPTED_TO = 0xF8FE
def load_ttx_glyphs(ttx_path: str) -> dict[str, _Element]:
"""Load TTGlyph elements: glyphName -> TTGlyph Element"""
root = etree.parse(ttx_path).getroot()
glyphs: dict[str, _Element] = {}
for g in root.findall(".//TTGlyph"):
name = g.get("name")
if name:
glyphs[name] = g
return glyphs
def load_cmap_map(ttx_path: str) -> dict[int, str]:
"""Build cmap: codepoint(int) -> glyphName(str)"""
root = etree.parse(ttx_path).getroot()
cmap_map: dict[int, str] = {}
for map_node in root.findall(".//map"):
code = map_node.get("code")
name = map_node.get("name")
if not code or not name:
continue
try:
cp = int(code, 16) if code.startswith("0x") else int(code)
except Exception:
try:
cp = int(code)
except Exception:
continue
cmap_map[cp] = name
return cmap_map
def load_hmtx_widths(ttx_path: str) -> dict[str, int]:
"""Read <hmtx><mtx name=... width=.../> -> glyphName -> width(int)"""
root = etree.parse(ttx_path).getroot()
widths: dict[str, int] = {}
for mtx in root.findall(".//hmtx/mtx"):
name = mtx.get("name")
w = mtx.get("width")
if name is None or w is None:
continue
try:
widths[name] = int(w)
except Exception:
continue
return widths
def make_blank_checker(glyphs: dict[str, _Element], widths: dict[str, int]):
"""
Return a callable is_blank(glyph_name) that memoizes:
- True if no contour AND (no components OR all components are blank)
- OR width == 0 (hmtx)
"""
@lru_cache(maxsize=None)
def _is_blank(name: str) -> bool:
if widths.get(name) == 0:
return True
g = glyphs.get(name)
if g is None:
return True # missing -> treat as blank to be safe
contours = g.findall("contour")
if contours:
return False
comps = g.findall("component")
if not comps:
return True
for c in comps:
child = c.get("glyphName")
if not child:
continue
if not _is_blank(child):
return False
return True
return _is_blank
def _compare_components_xml(comp_nodes_a: list[_Element], comp_nodes_b: list[_Element]) -> bool:
"""Strict compare for <component> lists — exact equality on x,y,scalex,scaley."""
if len(comp_nodes_a) != len(comp_nodes_b):
return False
for ca, cb in zip(comp_nodes_a, comp_nodes_b):
ax = ca.get("x") or ca.get("dx")
ay = ca.get("y") or ca.get("dy")
bx = cb.get("x") or cb.get("dx")
by = cb.get("y") or cb.get("dy")
if ax != bx or ay != by:
return False
asx = ca.get("scalex") or ca.get("scale")
asy = ca.get("scaley") or ca.get("scaleY")
bsx = cb.get("scalex") or cb.get("scale")
bsy = cb.get("scaley") or cb.get("scaleY")
if (asx is not None) or (bsx is not None):
if asx != bsx:
return False
if (asy is not None) or (bsy is not None):
if asy != bsy:
return False
return True
def _contour_sig(c: _Element) -> str:
parts = []
for pt in c.iter("pt"):
x = pt.get("x") or ""
y = pt.get("y") or ""
on = pt.get("on") or ""
parts.append(x); parts.append(","); parts.append(y); parts.append(","); parts.append(on); parts.append(";")
return "".join(parts)
def _compare_contours_xml(glyph_a: _Element, glyph_b: _Element) -> bool:
"""Fast contour comparison by serialized point strings."""
contours_a = glyph_a.findall("contour")
contours_b = glyph_b.findall("contour")
if len(contours_a) != len(contours_b):
return False
for ca, cb in zip(contours_a, contours_b):
if _contour_sig(ca) != _contour_sig(cb):
return False
return True
def compare_glyphs_ttx(
glyphs_a: dict[str, _Element], name_a: str,
glyphs_b: dict[str, _Element], name_b: str,
) -> bool:
"""Component comparison first; otherwise contour comparison."""
ga = glyphs_a.get(name_a); gb = glyphs_b.get(name_b)
if ga is None or gb is None:
return False
comps_a = ga.findall("component")
comps_b = gb.findall("component")
if comps_a and comps_b:
return _compare_components_xml(comps_a, comps_b)
return _compare_contours_xml(ga, gb)
def glyph_fingerprint(elem: _Element) -> str:
"""Name-agnostic fingerprint based on components transforms or contour points."""
comps = elem.findall("component")
if comps:
parts = ["C|", str(len(comps)), "|"]
for c in comps:
x = c.get("x") or c.get("dx") or ""
y = c.get("y") or c.get("dy") or ""
sx = c.get("scalex") or c.get("scale") or ""
sy = c.get("scaley") or c.get("scaleY") or ""
parts.extend((x, ",", y, ",", sx, ",", sy, ";"))
return "".join(parts)
contours = elem.findall("contour")
parts = ["O|", str(len(contours)), "|"]
for cont in contours:
parts.append(_contour_sig(cont)); parts.append("|")
return "".join(parts)
def glyph_hash(elem: _Element) -> str:
return hashlib.md5(glyph_fingerprint(elem).encode("utf-8", "ignore")).hexdigest()
def is_disallowed_target(cp: int) -> bool:
ch = chr(cp)
if ch.isspace():
return True
# PUA (BMP)
if 0xE000 <= cp <= 0xF8FF:
return True
return False
def build_mapping_from_ttx(ttx_encrypted: str, ttx_normal: str) -> tuple[dict[str, str], list[list]]:
glyphs_enc = load_ttx_glyphs(ttx_encrypted)
glyphs_norm = load_ttx_glyphs(ttx_normal)
cmap_enc = load_cmap_map(ttx_encrypted)
cmap_norm = load_cmap_map(ttx_normal)
widths_enc = load_hmtx_widths(ttx_encrypted)
widths_norm = load_hmtx_widths(ttx_normal)
is_blank_enc = make_blank_checker(glyphs_enc, widths_enc)
is_blank_norm = make_blank_checker(glyphs_norm, widths_norm)
enc_all = [(cp, name) for cp, name in cmap_enc.items() if ENCRYPTED_FROM <= cp <= ENCRYPTED_TO]
encrypted_items = sorted(enc_all, key=lambda x: (0 if is_blank_enc(x[1]) else 1, x[0]))
hash_index: dict[str, list[tuple[int, str]]] = {}
for cp_n, gname_n in cmap_norm.items():
if is_disallowed_target(cp_n):
continue
if is_blank_norm(gname_n):
continue
g = glyphs_norm.get(gname_n)
if g is None:
continue
h = glyph_hash(g)
hash_index.setdefault(h, []).append((cp_n, gname_n))
mapping: dict[str, str] = {}
unmatched: list[list] = []
for cp_s, gname_s in tqdm(encrypted_items, desc="encrypted", unit="glyph"):
if is_blank_enc(gname_s):
unmatched.append([cp_s, gname_s])
continue
g_enc = glyphs_enc.get(gname_s)
if g_enc is None:
unmatched.append([cp_s, gname_s])
continue
h = glyph_hash(g_enc)
candidates = hash_index.get(h, [])
matched = False
for cp_n, gname_n in candidates:
if compare_glyphs_ttx(glyphs_enc, gname_s, glyphs_norm, gname_n):
mapping[f"\\u{cp_s:04x}"] = chr(cp_n)
matched = True
break
if not matched:
for cp_n, gname_n in cmap_norm.items():
if is_disallowed_target(cp_n):
continue
if is_blank_norm(gname_n):
continue
if compare_glyphs_ttx(glyphs_enc, gname_s, glyphs_norm, gname_n):
mapping[f"\\u{cp_s:04x}"] = chr(cp_n)
matched = True
break
if not matched:
unmatched.append([cp_s, gname_s])
return mapping, unmatched
if __name__ == "__main__":
ttx_enc = "read.ttx"
ttx_norm = "MI LANTING.ttx"
mapping, unmatched = build_mapping_from_ttx(ttx_enc, ttx_norm)
Path("mapping_from_ttx.json").write_text(json.dumps(mapping, ensure_ascii=False, indent=2), encoding="utf-8")
Path("unmatched_from_ttx.json").write_text(json.dumps(unmatched, ensure_ascii=False, indent=2), encoding="utf-8")
匹配结果与验证
经过完整比对后,初步得到 3513 对映射关系,并有 93 项未找到对应字形。
随后将这 93 项 分别通过 HTML 对照方式,用 read.ttf 与原字体 (MI LANTING.ttf) 同时渲染同一段字符,人工对比结果如下:
- 其中 91 项 实际上是正常可见字符 (并非混淆映射目标)
- 仅有 2 项 确认为真实的映射缺口
这两项缺口手动补全如下:
{
"\uf63b": "啰",
"\ue8c0": "瞭"
}
补全后,映射总数达到 3515 对。
与空字形对比
由于此前在 read.ttf 中统计得到 3500 个空字形 (即 width=0 且无轮廓),两者数量非常接近,因此对两集合进行交叉比对:
- 映射表的 value 集合与空字形集合之间,空字形集合是映射值的子集。
- 仅多出 15 个额外字符,这些字符全部为符号类误匹配,即虽然图形结构相似,但并非常用汉字。
这些误匹配的字符如下:
{
"\ue7c7": "ḿ",
"\ue7c8": "ǹ",
"\ue7e7": "〾",
"\ue7e8": "⿰",
"\ue7e9": "⿱",
"\ue7ea": "⿲",
"\ue7eb": "⿳",
"\ue7ec": "⿴",
"\ue7ed": "⿵",
"\ue7ee": "⿶",
"\ue7ef": "⿷",
"\ue7f0": "⿸",
"\ue7f1": "⿹",
"\ue7f2": "⿺",
"\ue7f3": "⿻"
}
这些字符属于 Unicode 的部件或注音符号区,而非 CJK 统一汉字。
最终稳定为 3500 对一一对应。
五、使用与还原¶
保存一份映射表 JSON。使用时:
- 先读取映射表并取其前 3500 个值构成空白集合
- 对混淆段落:先剔除空白字符,再按映射表替换
示例:
import json
from itertools import islice
from pathlib import Path
LINOVELIB_FONT_MAP_PATH = Path("/path/to/map.json")
_FONT_MAP: dict[str, str] = json.loads(
LINOVELIB_FONT_MAP_PATH.read_text(encoding="utf-8")
)
_BLANK_SET: set[str] = set(islice(_FONT_MAP.values(), 3500))
def _apply_font_map(text: str) -> str:
"""
Apply font mapping to the input text, skipping characters in blank set.
"""
return "".join(_FONT_MAP.get(c, c) for c in text if c not in _BLANK_SET)