还在篮子里

bytes包 · 语雀

bytes包

Overview buffer.go

这是 bytes 包里的 buffer 实现

一图胜千言

看不懂图的再看下面吧

buffer.jpg

核心函数

Buffer 结构

这是 buffer 的内部结构

buf 字节切片,用来存储 buffer 的内容

off 是代表从哪里开始读

bootstrap 用来作为字节切片过小的时候防止多次申请空间减小开销

lastRead 用来记录上一次的操作

 1// A Buffer is a variable-sized buffer of bytes with Read and Write methods.
 2// The zero value for Buffer is an empty buffer ready to use.
 3// 注意 buffer 的零值是空的 buf
 4type Buffer struct {
 5    buf       []byte   // contents are the bytes buf[off : len(buf)]
 6    off       int      // read at &buf[off], write at &buf[len(buf)]
 7    bootstrap [64]byte // memory to hold first slice; helps small buffers avoid allocation.
 8    lastRead  readOp   // last read operation, so that Unread* can work correctly.
 9
10    // FIXME: it would be advisable to align Buffer to cachelines to avoid false
11    // sharing.
12}

Grow(n int)

申请扩展缓冲区

 1// Grow grows the buffer's capacity, if necessary, to guarantee space for
 2// another n bytes. After Grow(n), at least n bytes can be written to the
 3// buffer without another allocation.
 4// If n is negative, Grow will panic.
 5// If the buffer can't grow it will panic with ErrTooLarge.
 6// 增加容量 n byte
 7func (b *Buffer) Grow(n int) {
 8    if n < 0 {
 9        panic("bytes.Buffer.Grow: negative count")
10    }
11    m := b.grow(n)
12    b.buf = b.buf[:m]
13}

WriteString(s string) (n int, err error)

向 buffer 中写字符串

 1// WriteString appends the contents of s to the buffer, growing the buffer as
 2// needed. The return value n is the length of s; err is always nil. If the
 3// buffer becomes too large, WriteString will panic with ErrTooLarge.
 4// 直接写 string 也行,同时自动扩展
 5func (b *Buffer) WriteString(s string) (n int, err error) {
 6    b.lastRead = opInvalid
 7    //先尝试不用扩展容量的写法
 8    m, ok := b.tryGrowByReslice(len(s))
 9    if !ok {
10        m = b.grow(len(s))
11    }
12    // copy 可以直接把 string 类型作为 字节切片拷贝过去
13    return copy(b.buf[m:], s), nil
14}

也有写字节切片的形式  Write(p []byte) (n int, err error)

ReadFrom(r io.Reader) (n int64, err error)

从 io.Reader 读取数据到 buffer 中

 1// ReadFrom reads data from r until EOF and appends it to the buffer, growing
 2// the buffer as needed. The return value n is the number of bytes read. Any
 3// error except io.EOF encountered during the read is also returned. If the
 4// buffer becomes too large, ReadFrom will panic with ErrTooLarge.
 5// 从实现了 io.Reader 接口的 r 中读取到 EOF 为止,如果超出了 maxInt 那么大就会返回太
 6// 大不能通过一个 [maxInt]byte 字节切片来存储了
 7func (b *Buffer) ReadFrom(r io.Reader) (n int64, err error) {
 8    b.lastRead = opInvalid
 9    for {
10        i := b.grow(MinRead)
11        // grow 申请了 n 个空间之后,会将 buffer 中的字节切片延长长度到 n 个字节之后
12        // 所以需要重新赋值一下长度,避免一些误解,保证长度都是有效数据提供的
13        b.buf = b.buf[:i]
14        // 将 r 中的数据读到 buffer 中去
15        m, e := r.Read(b.buf[i:cap(b.buf)])
16        if m < 0 {
17            panic(errNegativeRead)
18        }
19
20        // 手动更改长度
21        b.buf = b.buf[:i+m]
22        n += int64(m)
23        if e == io.EOF {
24            return n, nil // e is EOF, so return nil explicitly
25        }
26        if e != nil {
27            return n, e
28        }
29    }
30}

WriteTo(w io.Writer) (n int64, err error)

向 io.Writer 中写数据

 1// WriteTo writes data to w until the buffer is drained or an error occurs.
 2// The return value n is the number of bytes written; it always fits into an
 3// int, but it is int64 to match the io.WriterTo interface. Any error
 4// encountered during the write is also returned.
 5func (b *Buffer) WriteTo(w io.Writer) (n int64, err error) {
 6    b.lastRead = opInvalid
 7    if nBytes := b.Len(); nBytes > 0 {
 8        //从 off 开始读的地方算起,全部写到 io.Writer 中去
 9        m, e := w.Write(b.buf[b.off:])
10        //写的多了就报错
11        if m > nBytes {
12            panic("bytes.Buffer.WriteTo: invalid Write count")
13        }
14        //记录写过了多少,位移 offset 指针
15        b.off += m
16
17        n = int64(m)
18        if e != nil {
19            return n, e
20        }
21        // all bytes should have been written, by definition of
22        // Write method in io.Writer
23        // 因为刚才判断过写多了的情况,所以这里是写少了
24        if m != nBytes {
25            return n, io.ErrShortWrite
26        }
27    }
28    // Buffer is now empty; reset.
29    // 写完之后重置
30    b.Reset()
31    return n, nil
32}

ReadBytes(delim byte) (line []byte, err error)

用来读到终止符就结束,返回的是一个 line 字节切片包含终止符前的数据

 1// ReadBytes reads until the first occurrence of delim in the input,
 2// returning a slice containing the data up to and including the delimiter.
 3// If ReadBytes encounters an error before finding a delimiter,
 4// it returns the data read before the error and the error itself (often io.EOF).
 5// ReadBytes returns err != nil if and only if the returned data does not end in
 6// delim.
 7// 读取到终止符为止,就结束
 8func (b *Buffer) ReadBytes(delim byte) (line []byte, err error) {
 9    slice, err := b.readSlice(delim)
10    // return a copy of slice. The buffer's backing array may
11    // be overwritten by later calls.
12    line = append(line, slice...)
13    return line, err
14}

NewBuffer(buf []byte) *Buffer

用来新建一个新的 Buffer ,其实也可以使用 new 和 var 来声明

 1// NewBuffer creates and initializes a new Buffer using buf as its
 2// initial contents. The new Buffer takes ownership of buf, and the
 3// caller should not use buf after this call. NewBuffer is intended to
 4// prepare a Buffer to read existing data. It can also be used to size
 5// the internal buffer for writing. To do that, buf should have the
 6// desired capacity but a length of zero.
 7//
 8// In most cases, new(Buffer) (or just declaring a Buffer variable) is
 9// sufficient to initialize a Buffer.
10// 通过字节切片创建一个 buffer ,字节切片会保留初始值
11// 在渴望容量但是长度为 0?的情况下
12// 也可以当作内核的 buffer 来写入
13func NewBuffer(buf []byte) *Buffer { return &Buffer{buf: buf} }

同时也有通过 string 类型的实现

func NewBufferString(s string) *Buffer {return &Buffer{buf: []byte(s)}}

总结

缓冲区,实现了大小控制,字节切片和 string 类型的读写,同时还对情况进行了优化,比如存在 bootstrap,比如 grow 函数中的多次检定。适合多读精读来学习

Overview reader.go

这个太简单,没什么核心的东西,就是实现了reader的接口实例

结构

 1// A Reader implements the io.Reader, io.ReaderAt, io.WriterTo, io.Seeker,
 2// io.ByteScanner, and io.RuneScanner interfaces by reading from
 3// a byte slice.
 4// Unlike a Buffer, a Reader is read-only and supports seeking.
 5// 实现了读取的各种方法,与 buffer 不同的是,只读同时支持位置
 6type Reader struct {
 7    s        []byte
 8    i        int64 // current reading index
 9    prevRune int   // index of previous rune; or < 0
10}

总结

注意该 bytes.Reader 是只读的。

Overview bytes.go

操作字节切片的函数,与字符串 strings  包类似。

核心函数

genSplit(s, sep []byte, sepSave, n int) [][]byte

切分切片使用的最核心的函数。

有四个参数,第一个是被切切片,第二个是分隔符,第三个是选择包含分隔符在内往后几个字节一起作为子切片,最后一个是最多通过n个分隔符来切分

 1// Generic split: splits after each instance of sep,
 2// including sepSave bytes of sep in the subslices.
 3// 将含有 sep 的字节切片全部单独切开,最多切 n 个,同时 匹配到时候多切 sepSave 个字节一起切进同一个切片
 4func genSplit(s, sep []byte, sepSave, n int) [][]byte {
 5    if n == 0 {
 6        return nil
 7    }
 8    if len(sep) == 0 {
 9        return explode(s, n)
10    }
11    if n < 0 {
12        n = Count(s, sep) + 1
13    }
14
15    a := make([][]byte, n)
16    n--
17    i := 0
18    for i < n {
19        m := Index(s, sep)
20        if m < 0 {
21            break
22        }
23        a[i] = s[: m+sepSave : m+sepSave]
24        s = s[m+len(sep):]
25        i++
26    }
27    a[i] = s
28    return a[:i+1]
29}

Fields(s []byte) [][]byte

主要是可以消除多个分隔符连续的噪声

这里的巧妙的地方时通过了一个 uint8 数组来实现了 ASCII 编码的空格的判定,还是使用位来判定是否存在非ASCII编码加快分隔速度。

有一个 FieldsFunc 函数来自定义规则

 1var asciiSpace = [256]uint8{'\t': 1, '\n': 1, '\v': 1, '\f': 1, '\r': 1, ' ': 1}
 2
 3// Fields interprets s as a sequence of UTF-8-encoded code points.
 4// It splits the slice s around each instance of one or more consecutive white space
 5// characters, as defined by unicode.IsSpace, returning a slice of subslices of s or an
 6// empty slice if s contains only white space.
 7func Fields(s []byte) [][]byte {
 8    // First count the fields.
 9    // This is an exact count if s is ASCII, otherwise it is an approximation.
10    n := 0
11    wasSpace := 1
12    // setBits is used to track which bits are set in the bytes of s.
13  // 意思就是通过位来判断是否所有的都可以通过字节来表示而不是需要utf-8编码
14    setBits := uint8(0)
15  // 这里实现了如果连续出现空格不会多次计数的除噪,通过 wasSpace
16    for i := 0; i < len(s); i++ {
17        r := s[i]
18        setBits |= r
19        isSpace := int(asciiSpace[r])
20        n += wasSpace & ^isSpace
21        wasSpace = isSpace
22    }
23    //不能通过ASCII码了就用utf-8
24    if setBits >= utf8.RuneSelf {
25        // Some runes in the input slice are not ASCII.
26        return FieldsFunc(s, unicode.IsSpace)
27    }
28
29    // ASCII fast path 更快
30    a := make([][]byte, n)
31    na := 0
32    fieldStart := 0
33    i := 0
34    // Skip spaces in the front of the input.
35  // 跳过开头的空格
36    for i < len(s) && asciiSpace[s[i]] != 0 {
37        i++
38    }
39    fieldStart = i
40    for i < len(s) {
41        if asciiSpace[s[i]] == 0 {
42            i++
43            continue
44        }
45        a[na] = s[fieldStart:i:i]
46        na++
47        i++
48        // Skip spaces in between fields.
49        for i < len(s) && asciiSpace[s[i]] != 0 {
50            i++
51        }
52        fieldStart = i
53    }
54  // 弥补上面的判断可能最后的EOF会忽略
55    if fieldStart < len(s) { // Last field might end at EOF.
56        a[na] = s[fieldStart:len(s):len(s)]
57    }
58    return a
59}

Join(s [][]byte, sep []byte) []byte

有分离就有连结,通过 sep 分隔符插在中间。

 1// Join concatenates the elements of s to create a new byte slice. The separator
 2// sep is placed between elements in the resulting slice.
 3func Join(s [][]byte, sep []byte) []byte {
 4    if len(s) == 0 {
 5        return []byte{}
 6    }
 7    if len(s) == 1 {
 8        // Just return a copy.
 9        return append([]byte(nil), s[0]...)
10    }
11  //判断需要多长的切片
12    n := len(sep) * (len(s) - 1)
13    for _, v := range s {
14        n += len(v)
15    }
16
17    b := make([]byte, n)
18    bp := copy(b, s[0])
19    for _, v := range s[1:] {
20        bp += copy(b[bp:], sep)
21        bp += copy(b[bp:], v)
22    }
23    return b
24}

Map(mapping func(r rune) rune, s []byte) []byte

通过映射函数替换切片中满足条件的字节

 1// Map returns a copy of the byte slice s with all its characters modified
 2// according to the mapping function. If mapping returns a negative value, the character is
 3// dropped from the byte slice with no replacement. The characters in s and the
 4// output are interpreted as UTF-8-encoded code points.
 5
 6func Map(mapping func(r rune) rune, s []byte) []byte {
 7    // In the worst case, the slice can grow when mapped, making
 8    // things unpleasant. But it's so rare we barge in assuming it's
 9    // fine. It could also shrink but that falls out naturally.
10    maxbytes := len(s) // length of b
11    nbytes := 0        // number of bytes encoded in b
12    b := make([]byte, maxbytes)
13    for i := 0; i < len(s); {
14        wid := 1
15        r := rune(s[i])
16        if r >= utf8.RuneSelf {
17            r, wid = utf8.DecodeRune(s[i:])
18        }
19        r = mapping(r)
20        if r >= 0 {
21            rl := utf8.RuneLen(r)
22            if rl < 0 {
23                rl = len(string(utf8.RuneError))
24            }
25            if nbytes+rl > maxbytes {
26                // Grow the buffer.
27                maxbytes = maxbytes*2 + utf8.UTFMax
28                nb := make([]byte, maxbytes)
29                copy(nb, b[0:nbytes])
30                b = nb
31            }
32            nbytes += utf8.EncodeRune(b[nbytes:maxbytes], r)
33        }
34        i += wid
35    }
36    return b[0:nbytes]
37}

indexFunc(s []byte, f func(r rune) bool, truth bool) int

返回满足条件函数的 rune 的下标,未找到就返回-1

条件函数可以是满足条件,可以是不满足条件,看变量 truth 的使用

 1// indexFunc is the same as IndexFunc except that if
 2// truth==false, the sense of the predicate function is
 3// inverted.
 4func indexFunc(s []byte, f func(r rune) bool, truth bool) int {
 5    start := 0
 6    for start < len(s) {
 7        wid := 1
 8        r := rune(s[start])
 9    //如果是utf-8编码才能识别,则调用utf-8.DecodeRune(s[start:])
10        if r >= utf8.RuneSelf {
11            r, wid = utf8.DecodeRune(s[start:])
12        }
13        if f(r) == truth {
14            return start
15        }
16        start += wid
17    }
18    return -1
19}

makeCutsetFunc(cutset string) func(r rune) bool

通过传入 的 string 类型变量,作为判断的条件函数,该函数判断 如果是 string 蕴含的返回真否则假

 1func makeCutsetFunc(cutset string) func(r rune) bool {
 2    if len(cutset) == 1 && cutset[0] < utf8.RuneSelf {
 3        return func(r rune) bool {
 4            return r == rune(cutset[0])
 5        }
 6    }
 7    if as, isASCII := makeASCIISet(cutset); isASCII {
 8        return func(r rune) bool {
 9            return r < utf8.RuneSelf && as.contains(byte(r))
10        }
11    }
12    return func(r rune) bool {
13        for _, c := range cutset {
14            if c == r {
15                return true
16            }
17        }
18        return false
19    }
20}

帮助实现的使用次数较多的函数

DecodeRune(p []byte) (r rune, size int)

 1// DecodeRune unpacks the first UTF-8 encoding in p and returns the rune and
 2// its width in bytes. If p is empty it returns (RuneError, 0). Otherwise, if
 3// the encoding is invalid, it returns (RuneError, 1). Both are impossible
 4// results for correct, non-empty UTF-8.
 5//
 6// An encoding is invalid if it is incorrect UTF-8, encodes a rune that is
 7// out of range, or is not the shortest possible UTF-8 encoding for the
 8// value. No other validation is performed.
 9func DecodeRune(p []byte) (r rune, size int) {
10    n := len(p)
11    if n < 1 {
12        return RuneError, 0
13    }
14    p0 := p[0]
15    x := first[p0]
16    if x >= as {
17        // The following code simulates an additional check for x == xx and
18        // handling the ASCII and invalid cases accordingly. This mask-and-or
19        // approach prevents an additional branch.
20        mask := rune(x) << 31 >> 31 // Create 0x0000 or 0xFFFF.
21        return rune(p[0])&^mask | RuneError&mask, 1
22    }
23    sz := x & 7
24    accept := acceptRanges[x>>4]
25    if n < int(sz) {
26        return RuneError, 1
27    }
28    b1 := p[1]
29    if b1 < accept.lo || accept.hi < b1 {
30        return RuneError, 1
31    }
32    if sz == 2 {
33        return rune(p0&mask2)<<6 | rune(b1&maskx), 2
34    }
35    b2 := p[2]
36    if b2 < locb || hicb < b2 {
37        return RuneError, 1
38    }
39    if sz == 3 {
40        return rune(p0&mask3)<<12 | rune(b1&maskx)<<6 | rune(b2&maskx), 3
41    }
42    b3 := p[3]
43    if b3 < locb || hicb < b3 {
44        return RuneError, 1
45    }
46    return rune(p0&mask4)<<18 | rune(b1&maskx)<<12 | rune(b2&maskx)<<6 | rune(b3&maskx), 4
47}

Equal(a, b []byte) bool

1//go:noescape
2
3// Equal returns a boolean reporting whether a and b
4// are the same length and contain the same bytes.
5// A nil argument is equivalent to an empty slice.
6func Equal(a, b []byte) bool // in internal/bytealg

总结

实现了几乎所有能对字节切片产生的操作,基本都是基于 utf-8 编码来判定的,或者使用 ASCII 码当可以使用的时候,实现了

  • 分隔    各种规则的分隔符分隔(包括自定义规则)
  • 裁剪    内置左右匹配的裁剪(自定义规则)和裁剪空格符
  • 粘合
  • 索引
  • 替换
    • 各种规则的替换
    • 内置大小写和标题字体的替换
  • 这些都是在包内分成了小函数来实现增强可自定义的性质,比如内置实现一些判断是否有前缀,是否包含某些编码,就像造好了手枪和一些子弹,想要更多功能直接制造特制子弹即可。如果关心这个功能模块化请看带 Func 签名的函数即可

Ps:包含有 Rabin-Karp search 的实现,被使用在 Index 这个返回索引的函数中。

来源: bytes包 · 语雀

发表评论

电子邮件地址不会被公开。 必填项已用*标注