OSDN Git Service

Implement a case-folding version of DJB hash
authorPavel Labath <labath@google.com>
Wed, 14 Feb 2018 10:05:09 +0000 (10:05 +0000)
committerPavel Labath <labath@google.com>
Wed, 14 Feb 2018 10:05:09 +0000 (10:05 +0000)
commitfd7c224b3894b5e869974106cbe56928152015f7
tree70f5f918ed5fa0c40fb088ab4822f020c45795dd
parent8e229ec0e5f36438adbf7324566e1a30dd649b63
Implement a case-folding version of DJB hash

Summary:
This patch implements a variant of the DJB hash function which folds the
input according to the algorithm in the Dwarf 5 specification (Section
6.1.1.4.5), which in turn references the Unicode Standard (Section 5.18,
"Case Mappings").

To achieve this, I have added a llvm::sys::unicode::foldCharSimple
function, which performs this mapping. The implementation of this
function was generated from the CaseMatching.txt file from the Unicode
spec using a python script (which is also included in this patch). The
script tries to optimize the function by coalescing adjecant mappings
with the same shift and stride (terms I made up). Theoretically, it
could be made a bit smarter and merge adjecant blocks that were
interrupted by only one or two characters with exceptional mapping, but
this would save only a couple of branches, while it would greatly
complicate the implementation, so I deemed it was not worth it.

Since we assume that the vast majority of the input characters will be
US-ASCII, the folding hash function has a fast-path for handling these,
and only whips out the full decode+fold+encode logic if we encounter a
character outside of this range. It might be possible to implement the
folding directly on utf8 sequences, but this would also bring a lot of
complexity for the few cases where we will actually need to process
non-ascii characters.

Reviewers: JDevlieghere, aprantl, probinson, dblaikie

Subscribers: mgorny, hintonda, echristo, clayborg, vleschuk, llvm-commits

Differential Revision: https://reviews.llvm.org/D42740

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@325107 91177308-0d34-0410-b5e6-96231b3b80d8
include/llvm/Support/DJB.h
include/llvm/Support/Unicode.h
lib/Support/CMakeLists.txt
lib/Support/DJB.cpp
lib/Support/UnicodeCaseFold.cpp [new file with mode: 0644]
unittests/Support/CMakeLists.txt
unittests/Support/DJBTest.cpp [new file with mode: 0644]
utils/unicode-case-fold.py [new file with mode: 0755]