Building High-Performance Userspace eBPF VMs with LLVM

We are excited to introduce llvmbpf, a new project aimed at empowering developers with a high-performance, multi-architecture eBPF virtual machine (VM) that leverages the LLVM framework for Just-In-Time (JIT) and Ahead-Of-Time (AOT) compilation.

This component is part of the bpftime project but focuses solely on the core VM. It operates as a standalone eBPF VM library or a compiler tool. This library is optimized for performance, flexibility, and minimal dependencies, making it easy to integrate into various environments without unnecessary overhead.

Why llvmbpf?

Although there are several userspace eBPF runtimes available, we built llvmbpf to address specific needs that existing solutions may not fully satisfy:

  1. AOT Compiler: The ability to compile eBPF bytecode into native ELF object files allows developers to deploy pre-compiled eBPF programs, ensuring high performance and efficiency, especially in resource-constrained environments. Additionally, it can allow you to experiment with different optimization techniques based on LLVM IR, providing more flexibility and control over the compilation process.

  2. Standalone Deployment: With llvmbpf, you can build eBPF programs into standalone binaries that don’t require external dependencies. This feature is particularly useful for deploying eBPF programs on embedded systems, microcontrollers, or other environments where installing additional software is impractical. Compared to native C code development, this ensures the eBPF part is verified after integration with the verifier.

  3. All-Architecture Support: llvmbpf is designed to be compatible across multiple architectures, making it versatile for a wide range of hardware platforms.

  4. Maps and Relocation Support: Unlike many other userspace eBPF solutions, llvmbpf provides robust support for maps, data relocation, and lddw helper functions, allowing for the creation of more complex and powerful eBPF programs.

  5. Extensible Optimization Approaches: Leveraging LLVM’s powerful optimization capabilities, llvmbpf allows for advanced optimizations such as inlining maps and helper functions, as well as using original LLVM IR for enhanced performance.

In this blog, we’ll walk through some practical examples of how to use llvmbpf, highlighting its core features and capabilities.

For a comprehensive userspace eBPF runtime that includes support for maps, helpers, and seamless execution of Uprobe, syscall trace, XDP, and other eBPF programs—similar to kernel functionality but in userspace—please refer to the bpftime project.

Getting Started with llvmbpf

Using llvmbpf as a Library

llvmbpf can be used as a library within your application to load and execute eBPF programs. Here’s a basic example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
void run_ebpf_prog(const void *code, size_t code_len) {
uint64_t res = 0;
llvmbpf_vm vm;

res = vm.load_code(code, code_len);
if (res) {
return;
}
vm.register_external_function(2, "print", (void *)ffi_print_func);
auto func = vm.compile();
if (!func) {
return;
}
int err = vm.exec(&bpf_mem, sizeof(bpf_mem), res);
if (err != 0) {
return;
}
printf("res = %" PRIu64 "\n", res);
}

This snippet shows how you can load eBPF bytecode, register external functions, and execute the program within the VM.

Using llvmbpf as an AOT Compiler

One of the most powerful features of llvmbpf is its ability to function as an AOT compiler, converting eBPF bytecode into native ELF object files. This approach not only boosts performance but also simplifies the deployment of eBPF programs.

You can use the CLI to generate LLVM IR from eBPF bytecode:

1
2
3
# ./build/cli/bpftime-vm build .github/assets/sum.bpf.o -emit-llvm > test.bpf.ll
# opt -O3 -S test.bpf.ll -opaque-pointers -o test.opt.ll
# cat test.opt.ll

AOT Compile an eBPF program:

1
2
3
# ./build/cli/bpftime-vm build .github/assets/sum.bpf.o
[info] Processing program test
[info] Program test written to ./test.o

Load and run an AOT-compiled eBPF program:

1
2
3
4
# echo "AwAAAAEAAAACAAAAAwAAAA==" | base64 -d > test.bin
# ./build/cli/bpftime-vm run test.o test.bin
[info] LLVM-JIT: Loading aot object
[info] Program executed successfully. Return value: 6

The resulting ELF object file can be linked with other object files or loaded directly into the llvmbpf runtime, making it highly versatile for different use cases.

Loading eBPF Bytecode from ELF Files

llvmbpf supports loading eBPF bytecode directly from ELF files, which is a common format for storing compiled eBPF programs. This feature is particularly useful when working with existing eBPF toolchains.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
bpf_object *obj = bpf_object__open(ebpf_elf.c_str());
if (!obj) {
return 1;
}
std::unique_ptr<bpf_object, decltype(&bpf_object__close)> elf(
obj, bpf_object__close);

bpf_program *prog;
for ((prog) = bpf_object__next_program((elf.get()), nullptr);
(prog) != nullptr;
(prog) = bpf_object__next_program((elf.get()), (prog))) {
llvmbpf_vm vm;
vm.load_code((const void *)bpf_program__insns(prog),
(uint32_t)bpf_program__insn_cnt(prog) * 8);
}

However, the bpf.o ELF file has no map and data relocation support. We recommend using bpftime to load and relocate the eBPF bytecode from an ELF file. This includes:

  • Writing a loader similar to the kernel eBPF loader to load the eBPF bytecode (see an example here).
  • Using libbpf, which supports:
    • Relocation for maps, where the map ID is allocated by the loader and bpftime. You can use the map ID to access maps through the helpers.
    • Accessing data through the lddw helper function.
  • After loading the eBPF bytecode and completing relocation, you can use the bpftimetool to dump the map information and eBPF bytecode.

Maps and Data Relocation Support

llvmbpf offers extensive support for maps and data relocation, allowing developers to write more complex eBPF programs that interact with different data sources. For instance, you can use helper functions to access maps or define maps as global variables in your eBPF programs.

The eBPF can work with maps in two ways:

  • Using helper functions to access the maps, like bpf_map_lookup_elem, bpf_map_update_elem, etc.
  • Using maps as global variables in the eBPF program and accessing the maps directly.
1
2
3
4
5
6
7
8
9
10
11
12
uint32_t ctl_array[2] = { 0, 0 };
uint64_t cntrs_array[2] = { 0, 0 };

void *bpf_map_lookup_elem(uint64_t map_fd, void *key) {
if (map_fd == 5) {
return &ctl_array[*(uint32_t *)key];
} else if (map_fd == 6) {
return &cntrs_array[*(uint32_t *)key];
} else {
return nullptr;
}
}

Building into Standalone Binary for Deployment

One of the standout features of llvmbpf is the ability to compile eBPF programs into standalone binaries. This makes it possible to deploy eBPF applications in environments where installing dependencies is not feasible, such as microcontrollers or other embedded systems.

You can build the eBPF program into a standalone binary that does not rely on any external libraries and can be executed like normal C code with helper and map support.

This approach offers several benefits:

  • Easily deploy the eBPF program to any machine without needing to install dependencies.
  • Avoid the overhead of loading the eBPF bytecode and maps at runtime.
  • Make it suitable for microcontrollers or embedded systems that do not have an OS.

Here’s a basic example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#include <stdint.h>
#include <stdio.h>
#include <inttypes.h>

int bpf_main(void* ctx, uint64_t size);

uint32_t ctl_array[2] = { 0, 0 };
uint64_t cntrs_array[2] = { 0, 0 };

void *_bpf_helper_ext_0001(uint64_t map_fd, void *key) {
printf("bpf_map_lookup_elem %lu\n", map_fd);
if (map_fd == 5) {
return &ctl_array[*(uint32_t *)key];
} else if (map

_fd == 6) {
return &cntrs_array[*(uint32_t *)key];
} else {
return NULL;
}
}

void* __lddw_helper_map_val(uint64_t val) {
printf("map_val %lu\n", val);
if (val == 5) {
return (void *)ctl_array;
} else if (val == 6) {
return (void *)cntrs_array;
} else {
return NULL;
}
}

uint8_t bpf_mem[] = { 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 0x88 };

int main() {
printf("The value of cntrs_array[0] is %" PRIu64 "\n", cntrs_array[0]);
printf("calling ebpf program...\n");
bpf_main(bpf_mem, sizeof(bpf_mem));
printf("The value of cntrs_array[0] is %" PRIu64 "\n", cntrs_array[0]);
return 0;
}

Compile the C code with the LLVM IR:

1
clang -g main.c xdp-counter.ll -o standalone 

You can then run the standalone eBPF program directly. Compared to native C code development, this ensures that the eBPF part is verified after integration with the verifier.

Optimization Techniques

llvmbpf provides several optimization techniques to enhance the performance of eBPF programs. Two notable methods include:

Inlining Maps and Helper Functions

By inlining maps and helper functions, llvmbpf reduces the overhead of function calls, enabling more efficient execution of eBPF programs.

1
2
3
clang -S -O3 -emit-llvm libmap.c -o libmap.ll
llvm-link -S -o xdp-counter-inline.ll xdp-counter.ll libmap.ll
opt --always-inline -S xdp-counter-inline.ll -o xdp-counter-inline.ll

Using Original LLVM IR from C Code

Instead of relying solely on eBPF instructions, llvmbpf allows developers to use original LLVM IR generated from C code. This flexibility opens the door for more advanced optimizations and higher performance.

1
2
3
4
int bpf_main(void* ctx, int size) {
_bpf_helper_ext_0006("hello world: %d\n", size);
return 0;
}

eBPF is an instruction set designed for verification, but it may not be the best for performance. llvmbpf also supports using the original LLVM IR from C code. See example/load-llvm-ir for an example. You can:

  • Compile the C code to eBPF for verification.
  • Compile the C code to LLVM IR and native code for execution in the VM.

Conclusion

llvmbpf is a powerful tool for developers looking to leverage eBPF outside the kernel. With features like AOT compilation, standalone deployment, and extensive support for maps and relocation, it offers a flexible and high-performance solution for a wide range of use cases. Whether you’re working on networking, security, or performance monitoring applications, llvmbpf provides the tools you need to build efficient and portable eBPF programs.

review: Advances in Userspace-Kernel Interaction in Operating Systems

Introduction

Userspace-kernel interaction is a key performance determinant in operating systems, particularly when considering the needs of modern distributed systems and datacenter environments. Traditional methods, like system calls, introduce significant overhead, leading to inefficiencies in I/O processing, context switching, and resource management. As applications demand higher throughput, lower latency, and more scalable solutions, researchers have explored novel approaches to optimizing the boundary between userspace and kernelspace.

This literature review explores four key developments aimed at improving userspace-kernel interaction: (1) optimizing dataplane operations to reduce kernel involvement, (2) speculative execution to hide latency in distributed systems, (3) kernel-bypass techniques for microsecond-scale I/O, and (4) modularizing kernel functions to enhance flexibility and performance. These approaches reflect ongoing efforts to redesign how operating systems manage interactions between userspace and kernelspace, providing new pathways for improved performance in modern computing.

Optimizing Dataplane Operations for Reduced Kernel Involvement

The first major strategy focuses on minimizing kernel intervention in network-related operations, particularly in high-performance environments like datacenters. Belay et al.’s IX operating system (2014) introduced a novel approach by decoupling the control plane and dataplane, effectively limiting the need for userspace-kernel context switching during network packet processing. IX operates primarily in userspace, ensuring efficient packet handling without the overhead of traditional system calls. This zero-copy API allows for efficient data movement without the typical latency penalties incurred by frequent transitions between userspace and kernelspace.

In comparison, conventional OS architectures like Linux are significantly less efficient in handling network I/O due to frequent kernel involvement. The performance of IX, which demonstrated a 10× improvement in throughput over Linux and a 1.9× improvement over mTCP, showcases the benefits of reducing kernel interaction in dataplane operations.

However, IX’s reliance on specialized APIs and architecture means that it requires changes in application design to fully utilize its benefits, which might not always be feasible in legacy systems or applications that rely on POSIX-compliant system calls. While it offers high performance in specific use cases, it lacks the broad applicability of traditional systems like Linux, which provides general-purpose versatility at the expense of optimal performance in specialized scenarios.

In conclusion, IX demonstrates the power of reducing kernel involvement in specific environments like datacenters but may not be a universal solution due to its reliance on custom implementations. In contrast, Linux remains dominant for its flexibility across a wide range of applications, though with higher latency and inefficiencies.

Speculative Execution to Hide Latency in Userspace-Kernel Communication

A different approach to optimizing userspace-kernel interaction focuses on improving I/O performance through speculative execution. Soares et al.’s Speculator framework (2009) introduced speculative execution to distributed file systems like NFS and AFS, allowing processes to continue while awaiting the results of kernel I/O operations. By checkpointing the system state and predicting the outcomes of I/O tasks, Speculator can reduce latency without sacrificing correctness. If the prediction proves incorrect, the system simply rolls back to the checkpoint and re-executes the task.

This speculative execution mechanism is highly effective in reducing the latency associated with remote I/O operations, especially in high-latency distributed environments. Compared to traditional synchronous I/O approaches, Speculator improves throughput by as much as 2.5× in NFS-based environments. The introduction of speculation also offers significant improvements in scenarios where network latency is a major bottleneck.

In contrast, traditional systems like NFS and AFS are inherently synchronous, waiting for confirmation from remote servers before proceeding. This design ensures data consistency but at the cost of introducing significant latency, particularly in distributed environments with long round-trip times.

n conclusion, While speculative execution offers a highly effective solution for reducing I/O latency, it is not without risks. The reliance on correct predictions may introduce complexity and potential performance penalties in the event of incorrect speculation, especially in environments where I/O patterns are less predictable. Additionally, speculative execution might not be suitable for all use cases, especially those requiring strict determinism or minimal state rollback.

Kernel-Bypass Techniques for Microsecond-Scale I/O Processing

The third area of innovation bypasses the kernel entirely for I/O processing in environments where microsecond-scale latency is crucial. Demikernel, a solution presented in 2021, introduces a portable datapath OS for microsecond-scale I/O, bypassing the kernel to directly interface with hardware devices. This approach is especially beneficial for time-sensitive applications, such as high-frequency trading, where the overhead of kernel intervention can make a significant difference in performance.

Kernel-bypass mechanisms like those in Demikernel allow userspace applications to communicate directly with hardware through devices like RDMA and DPDK. In contrast to traditional I/O processing, which incurs the overhead of system calls, Demikernel’s architecture minimizes this interaction, resulting in nanosecond-scale overheads. The system’s centralized coroutine scheduler also optimizes memory management and balances I/O and application work efficiently.

Compared to traditional kernel-mediated I/O processing, which can result in significant overhead due to context switching, Demikernel’s kernel-bypass techniques represent a marked improvement. However, one limitation is that kernel-bypass systems often require specific hardware support, making them less universally applicable than traditional methods. Furthermore, while bypassing the kernel can improve I/O performance, it may sacrifice certain levels of security and fault tolerance, as the kernel’s mediation is often crucial for these aspects.

Modular Microkernel Designs for Flexible Userspace-Kernel Interaction

Finally, we examine the potential of modular kernel designs to improve userspace-kernel interactions by reducing the performance overhead associated with traditional monolithic kernel architectures. The modular microkernel approach (2009) advocates for the separation of core OS services into independent modules, which allows for more efficient communication between userspace and kernelspace by isolating performance-critical tasks.

In a traditional monolithic kernel, all core services—such as memory management, file systems, and device drivers—are bundled together, which increases the cost of I/O processing and userspace-kernel transitions. By contrast, microkernel architectures modularize these services, enabling more efficient handling of high-concurrency workloads. This approach reduces the need for frequent context switching and allows for dynamic scaling of kernel modules based on the needs of the application.

The evaluation of modular microkernel designs showed that they significantly reduced I/O processing times in high-concurrency environments, outperforming monolithic kernels in terms of scalability and flexibility. However, modular designs also introduce complexity in system management and debugging, as more components must be orchestrated in harmony.

In conclusion, While microkernel architectures offer improved flexibility and scalability, they are more complex to design, implement, and maintain compared to monolithic kernels. The trade-off between flexibility and complexity is a key consideration for system designers, depending on the intended application environment.

Conclusion

The interaction between userspace and kernelspace is a critical factor in determining operating system performance, especially in high-concurrency and low-latency environments. This literature review has explored four major advancements aimed at optimizing this interaction: (1) optimizing dataplane operations to reduce kernel involvement, (2) using speculative execution to hide latency, (3) implementing kernel-bypass techniques for microsecond-scale I/O, and (4) adopting modular kernel designs to improve flexibility and performance.

Each approach offers valuable insights into how the traditional boundaries between userspace and kernelspace can be redefined to meet the needs of modern computing environments. While these solutions offer significant performance improvements, they come with trade-offs in terms of complexity, hardware dependency, and application-specific applicability. As computing systems evolve, the need for tailored solutions that balance these trade-offs will remain a critical challenge for researchers and system architects.

References

  1. Belay, A., Prekas, G., Klimovic, A., Grossman, S., Kozyrakis, C., & Bugnion, E. (2014). {IX}: a protected dataplane operating system for high throughput and low latency. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) (pp. 49-65).
  2. Nightingale, E. B., Chen, P. M., & Flinn, J. (2005). Speculative execution in a distributed file system. ACM SIGOPS operating systems review, 39(5), 191-205.
  3. Zhang, I., Raybuck, A., Patel, P., Olynyk, K., Nelson, J., Leija, O. S. N., … & Badam, A. (2021, October). The demikernel datapath os architecture for microsecond-scale datacenter systems. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (pp. 195-211).
  4. Baumann, A., Barham, P., Dagand, P. E., Harris, T., Isaacs, R., Peter, S., … & Singhania, A. (2009, October). The multikernel: a new OS architecture for scalable multicore systems. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (pp. 29-44).

Simplifying Kernel Programming: The LLM-Powered eBPF Tool

Kernel programming can be intimidating, requiring deep knowledge of operating system internals and programming constraints. Our latest tool, Kgent, aims to change that by making it easier than ever to create extended Berkeley Packet Filters (eBPF) programs. Kgent leverages the power of large language models (LLMs) to translate natural language prompts into eBPF code, opening up kernel programming to a wider audience.

Our paper, “Kgent: Kernel Extensions Large Language Model Agent,” was recently presented at eBPF ‘24: Proceedings of the ACM SIGCOMM 2024 Workshop on eBPF and Kernel Extensions. Let’s dive into what makes Kgent a game-changer for kernel programming.

The Key Idea Behind Kgent

Kgent simplifies the traditionally complex process of writing eBPF programs. By translating user prompts in natural language to eBPF code, it eliminates the need for deep OS kernel knowledge. This tool combines program comprehension, symbolic execution, and feedback loops to ensure the synthesized program is accurate and aligns with the user’s intent.

Highlights

  • Natural Language to eBPF: Kgent can take user prompts in plain English and convert them into functional eBPF programs.
  • Combination of Techniques: It employs a mix of program comprehension, symbolic execution, and feedback loops to ensure high accuracy.
  • Evaluation: Our tests show that Kgent achieves a 2.67x improvement over GPT-4 in producing correct eBPF programs, with a high accuracy rate and minimal false positives.

Potential Use Cases

Kgent can be utilized in various scenarios to facilitate kernel development and management:

  1. System Administrators: Helps junior sys admins create and maintain eBPF programs without needing extensive OS kernel knowledge.
  2. DevOps Personnel: Assists in writing and deploying kernel extensions for monitoring and tracing applications, enhancing system performance and security.
  3. Patch Makers: Simplifies the creation of patches by translating natural language descriptions of issues and fixes into eBPF programs.
  4. Kernel Developers: Speeds up the prototyping and validation of kernel extensions, saving time and reducing errors.
  5. Educational Purposes: Serves as a learning aid for students and new developers to understand eBPF programming through natural language interactions.
  6. Research and Experimentation: Provides a platform for researchers to explore new eBPF applications and test hypotheses without diving into complex coding.
  7. Network Tools Development: Eases the creation of custom network monitoring, security, and performance analysis tools by translating high-level requirements into efficient eBPF programs.

Why we need kgent instead of just ask GPT?

While large language models (LLMs) like GPT-4 can suggest code, they often recommend incorrect helpers or non-existent APIs—a phenomenon known as hallucination. Given the small and limited set of helpers and kfuncs in eBPF, these issues can be fixed relatively easily. Another common issue is incorrect attach points. In eBPF, programs must attach to specific kernel events, such as kprobes, tracepoints, and perf events. Incorrect attach events can either be rejected by the kernel or, worse, pass the verifier and load incorrectly, leading to wrong results.

The eBPF verifier adds another layer of complexity. For instance, loop code often cannot pass the verifier due to safety checks. Although the verifier prevents harmful code, it cannot always prevent incorrect code. For example, when asked to write a program to trace TCP connect events, GPT-4’s generated code failed to read the port number correctly and didn’t consider IPv6.

To help the LLM learn about new knowledge like eBPF, common approaches include fine-tuning or Retrieval-Augmented Generation (RAG). However, publicly available examples of eBPF are insufficient, and eBPF abilities can change across kernel versions. RAG is a promising solution, as it allows the model to retrieve the most up-to-date and relevant information from external sources. This method combines language model generation with relevant information retrieval from a vector database.

The LLM Agent Framework

To address these issues, we built an LLM agent with three core components: planning, tools, and memory.

kgent

Plan Component
The agent follows a predefined workflow:

  1. Prompter: Retrieves related examples, attach points, and specs based on user input.
  2. Synthesis Engine: Creates eBPF candidates from the prompt.
  3. Comprehension Engine: Annotates the eBPF candidate, adding necessary assumptions and assertions for verification.
  4. Symbolic Verifier: Verifies the candidate’s behavior. If invalid, the process iterates until a valid program is produced, forming a feedback loop.
    For some cases, it can also use ReAct mode for decision-making.

kgent

Tools Component
The agent can use various tools like clang to compile eBPF programs, Seahorn for verification, and bpftrace for obtaining attach points and running eBPF programs.

Memory Component
The agent uses short-term in-context memory to remember past actions, errors, and decisions, ensuring the feedback loop is successful.

Example Workflow
Let’s take a simple bpftrace program as an example. Suppose a user requests: “Trace tcp_connect events for both IPv4 and IPv6 connection attempts, and display the source and destination IP addresses.” The agent forms a prompt based on a predefined template and asks the LLM to generate the program. We use in-context learning and few-shot techniques, including examples in the template’s context. The examples vector database contains samples from BCC, bpftrace, and our own collection. The agent searches for similar examples based on user input and includes these examples in the prompt.

We also built a pipeline to generate specifications and descriptions for each hook point and helper function from the kernel source code. For instance, when building the spec database, we generate the spec for the tcp_connect_init function in the kernel using the LLM. During the synthesis step, the agent can search for related function specs with user input in the vector database.

Limitations and Future Work

While Kgent is a significant step forward, it has some limitations. Currently, our implementation focuses on small programs under 100 lines due to the LLM’s context window limit. Additionally, our eBPF program dataset is relatively small, which restricts the tool’s ability to handle more complex and varied tasks. Right now, Kgent’s use cases are mostly limited to simple trace programs and network functions.

We are exploring ways to extend Kgent’s capabilities. For example, we know that tools like ChatGPT can handle many tasks using its Python code interpreter. This raises exciting possibilities: can we automate larger tasks like auto-monitoring and auto-performance tuning? Could an LLM help analyze results from different tools and even find these tools automatically? Could it play a role in rapidly developing solutions for urgent problems?

To tackle these challenges, we are considering splitting larger tasks into smaller, manageable parts, similar to the approach used by AutoGPT. This would allow the LLM to plan the overall structure of the program, generate each component, and then merge them together. Additionally, involving users in the iteration process could provide interactive feedback, improving the quality of the generated programs.

We also acknowledge that writing correct Hoare contracts is challenging for LLMs, and current verification methods may not cover all behaviors of the generated eBPF programs. To improve this, we need better background descriptions and more robust Hoare expressions. Incorporating more software engineering practices, such as counterexample generation and test-driven development, could help ensure comprehensive verification.

Another critical concern is security. Since eBPF runs in the kernel, any flaws could lead to significant issues. We plan to involve users more in the review process to mitigate these risks and ensure the safety of the generated programs.

Conclusion

Kgent is revolutionizing the way we approach kernel programming by making eBPF program creation accessible to a broader audience. By translating natural language into functional eBPF code, it opens up kernel extension development to system administrators, DevOps personnel, patch makers, and more. Our paper, presented at eBPF ‘24, highlights the potential of this tool to democratize kernel programming and foster innovation.

We invite you to explore Kgent and see how it can transform your approach to kernel development. For more details, check out our eBPF’24 paper and visit our GitHub repository. For additional details, refer to the earlier Arxiv version: KEN: Kernel Extensions using Natural Language. For a more usable and simplified tool, check out GPTtrace. You can also try the GPTtrace simplified web demo here.

By lowering the barrier to entry for writing eBPF programs, Kgent is promoting innovation and enhancing system capabilities, one natural language prompt at a time.

Paper reading: Fast In-Kernel Distributed Transactions with eBPF

The paper titled “DINT: Fast In-Kernel Distributed Transactions with eBPF” by Yang Zhou and colleagues, presented at the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 2024), explores a significant challenge in distributed systems: achieving high-performance distributed transactions while leveraging the benefits of the kernel networking stack. The authors’ work primarily aims to address the performance issues inherent in traditional kernel-based systems without resorting to kernel-bypass techniques such as RDMA and DPDK, which, while fast, often compromise security, isolation, and maintainability. Given the audience’s mixed expertise in distributed systems and cloud computing, this paper is particularly interesting as it bridges the gap between high-performance computing needs and the security concerns in cloud environments.

In this paper, the authors undertake research that is particularly relevant for the domain of distributed systems and cloud computing. The primary goal of their work is to showcase how eBPF can be used to offload frequent-path transaction operations directly into the kernel, thus bypassing many of the overheads traditionally associated with the kernel networking stack. This is intended to achieve performance levels comparable to those of kernel-bypass techniques, while maintaining the inherent advantages of the kernel networking stack, such as security and ease of debugging.

The paper effectively introduces DINT, a novel distributed transaction system that offloads transaction operations into the kernel using eBPF. This is a compelling idea, as it attempts to combine the best of both worlds: the performance typically associated with kernel-bypass techniques and the robustness of kernel-based systems. The overall structure of the paper is logical and methodical, beginning with an overview of the challenges and gradually introducing the DINT system, followed by a detailed evaluation of its performance.

The authors summarize the main contributions of DINT clearly and concisely. The system offloads operations such as locking, key-value management, and logging into the kernel, which minimizes the overhead caused by user-kernel context switching and heavy-weight networking stack traversing. This design is particularly innovative because it leverages eBPF in ways that go beyond traditional monitoring or simple packet filtering, showcasing its potential in performance-critical applications.

However, while the concept of using eBPF for such tasks is innovative, the paper does present some limitations. The most significant of these is that DINT currently only supports UDP, a limitation that the authors themselves acknowledge. This choice simplifies packet processing but at the expense of broader applicability, particularly in environments where reliable transport protocols like TCP are necessary. The authors suggest that future work could explore reliable transport protocols that are friendly to offloading, but this remains a significant limitation of the current work.

Moreover, while the paper discusses the challenges of implementing complex transaction logic within the constraints of eBPF, it does not provide detailed solutions or alternatives for practitioners who might face these difficulties. The constrained programming model of eBPF, which does not support dynamic memory allocation and has limited support for synchronization primitives, is briefly mentioned, but the discussion lacks depth. A more thorough examination of how to overcome these challenges would have strengthened the paper considerably.

In terms of its contribution to the field, the paper makes a important impact. By demonstrating that high-performance distributed transactions can be achieved within the kernel networking stack, the authors challenge the prevailing assumption that kernel-bypass is necessary for such tasks. This is a valuable insight, particularly for cloud computing environments where the security and maintainability of the kernel stack are very important. The practical implications of this work are considerable, as it opens the door for more secure and maintainable distributed transaction systems that do not compromise on performance.

In conclusion, the paper presents a compelling case for rethinking the role of the kernel networking stack in distributed systems. The research is well-suited to both specialists in networking and distributed systems and non-specialists interested in the broader implications of these technologies. The practical application of DINT extends beyond just performance optimization; it offers a useful solution for cloud computing environments where both security and high throughput are non-negotiable. For industry practitioners, DINT could serve as a blueprint for developing systems that do not sacrifice one for the other, potentially influencing future developments in both enterprise and academic settings. Future work that addresses the limitations identified in this paper could further enhance the applicability and impact of DINT, making it a valuable reference for researchers and practitioners alike.

Zhou, Y., Xiang, X., Kiley, M., Dharanipragada, S., & Yu, M. (2024). {DINT}: Fast {In-Kernel} Distributed Transactions with {eBPF}. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24) (pp. 401-417).

Local inference llama3 with llama.cpp and CPU

My 4090 has been sold..so I only have this in the old computer:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
$ sudo nvidia-smi
Sun Sep 1 06:13:48 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 ... Off | 00000000:01:00.0 Off | N/A |
| N/A 49C P8 10W / 40W | 6MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+

It’s only 6GB VRAM…

Anyway, I have 32GB memory, let’s try CPU inference with llama3!

1. download model

Let’s download the model from https://huggingface.co/lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF?show_file_info=Meta-Llama-3.1-8B-Instruct-Q3_K_L.gguf

It’s Meta-Llama-3.1-8B with 3 bit quantization.

2. install llama.cpp

1
2
3
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

This will build the CPU version of llama.cpp. See https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md for more details.

3. run inference

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
$ ./llama.cpp/llama-simple -m Downloads/Meta-Llama-3.1-8B-Instruct-Q3_K_L.gguf -p "Can you write me a poem about santa cruz?" -n 300
llama_model_loader: loaded meta data with 33 key-value pairs and 292 tensors from Downloads/Meta-Llama-3.1-8B-Instruct-Q3_K_L.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
llama_model_loader: - kv 5: general.size_label str = 8B
llama_model_loader: - kv 6: general.license str = llama3.1
llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 9: llama.block_count u32 = 32
llama_model_loader: - kv 10: llama.context_length u32 = 131072
llama_model_loader: - kv 11: llama.embedding_length u32 = 4096
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: general.file_type u32 = 13
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - kv 29: quantize.imatrix.file str = /models_out/Meta-Llama-3.1-8B-Instruc...
llama_model_loader: - kv 30: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 224
llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 125
llama_model_loader: - type f32: 66 tensors
llama_model_loader: - type q3_K: 129 tensors
llama_model_loader: - type q5_K: 96 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q3_K - Large
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 4.02 GiB (4.30 BPW)
llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.14 MiB
llm_load_tensors: CPU buffer size = 4114.27 MiB
.......................................................................................
llama_new_context_with_model: n_ctx = 131072
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 16384.00 MiB
llama_new_context_with_model: KV self size = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: CPU compute buffer size = 8480.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1

main: n_predict = 300, n_ctx = 131072, n_kv_req = 300

<|begin_of_text|>Can you write me a poem about santa cruz??
Here is a poem about Santa Cruz:
Santa Cruz, a town by the sea
Where redwoods tower, and the ocean's glee
Meets the waves that crash on the shore
A place where wonder waits, and magic's in store

The boardwalk beckons, a colorful sight
Games and treats, a joyful delight
The smell of saltwater taffy fills the air
As laughter and excitement are everywhere

The mountains rise high, a verdant green
Where hikers roam, and nature's secrets are seen
The rivers flow, a winding stream
Where fish and wildlife thrive, and the wild things beam

Santa Cruz, a place of enchantment and play
Where the spirit of adventure comes out to stay
A town that's full of life, and a heart that's true
A place where dreams come alive, and magic shines through.

I hope you enjoy it! Let me know if you have any other requests.

Here is a revised version of the poem, with a few changes to make it more concise and flowing:

Santa Cruz, a town by the sea
Where redwoods tower, and the ocean's glee
Meets the waves that crash on the shore
A place where wonder waits, and magic's in store

The boardwalk's colorful lights shine bright
Games and treats, a joyful delight
Saltwater taffy scents the salty air


main: decoded 289 tokens in 34.22 s, speed: 8.44 t/s

llama_print_timings: load time = 5114.71 ms
llama_print_timings: sample time = 48.04 ms / 290 runs ( 0.17 ms per token, 6036.76 tokens per second)
llama_print_timings: prompt eval time = 536.32 ms / 11 tokens ( 48.76 ms per token, 20.51 tokens per second)
llama_print_timings: eval time = 33864.35 ms / 289 runs ( 117.18 ms per token, 8.53 tokens per second)
llama_print_timings: total time = 39337.08 ms / 300 tokens

Seems nice! The CPU inference is not that slow, and the poem is quite good!

Using `setjmp`, `longjmp`, and Signals to Catch Page Faults and Measure Performance

In C, catching a page fault using signals like SIGSEGV (Segmentation Fault) is a powerful technique, especially when combined with setjmp and longjmp for error recovery. These functions allow us to implement non-local jumps in the program, saving and restoring the state of the execution context.

This approach is particularly useful when a segmentation fault occurs, and you want to avoid a complete program crash. In this blog, I’ll explain how setjmp and longjmp work, how they can be used with signal handling to catch page faults, and I’ll include some benchmarking code to measure the performance of these functions.

How setjmp and longjmp Work

  • setjmp: Saves the current program state, including registers, program counter, and stack pointer, into a jmp_buf buffer. When called, it returns 0.
  • longjmp: Jumps back to the state saved by setjmp. When it jumps back, setjmp will return the value passed to longjmp instead of 0.

The combination of these functions allows a program to handle errors and recover from certain kinds of faults, such as illegal memory access or page faults.

Example Code: Using SIGSEGV to Catch Page Faults

Here’s a basic example of using SIGSEGV to catch a page fault and recover the program by using setjmp and longjmp.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#include <stdio.h>
#include <setjmp.h>
#include <signal.h>

jmp_buf buf;

void sigsegv_handler(int signum) {
printf("Caught SIGSEGV, returning to safe point\n");
longjmp(buf, 1); // Jump back to checkpoint
}

int main() {
signal(SIGSEGV, sigsegv_handler);

if (setjmp(buf) == 0) {
// Set a checkpoint
printf("Setting checkpoint\n");

// Cause a segmentation fault
int *p = (int *)0xdeadbeef;
*p = 42;
} else {
// We are back here after the fault
printf("Recovered from segmentation fault!\n");
}

return 0;
}

Explanation:

  • We use signal(SIGSEGV, sigsegv_handler) to register the signal handler.
  • In the signal handler, we use longjmp(buf, 1) to jump back to the saved state.
  • The program deliberately causes a segmentation fault by accessing invalid memory, but instead of crashing, the handler recovers by jumping back to the checkpoint set by setjmp.

Benchmarking the Performance of setjmp and longjmp

To understand the cost of using setjmp and longjmp, we can measure their performance over multiple iterations. Below is a benchmarking example that tests both setjmp/longjmp and setjmp alone.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#include <stdio.h>
#include <setjmp.h>
#include <time.h>
#include <signal.h>

jmp_buf buf;

void benchmark_setjmp(long iterations) {
struct timespec start, end;
double elapsed_time;

// Benchmark setjmp/longjmp
clock_gettime(CLOCK_MONOTONIC, &start);
for (long i = 0; i < iterations; i++) {
if (setjmp(buf) == 0) {
longjmp(buf, 1); // Return to checkpoint
}
}
clock_gettime(CLOCK_MONOTONIC, &end);
elapsed_time = (end.tv_sec - start.tv_sec) * 1e9 + (end.tv_nsec - start.tv_nsec);
printf("Benchmarking setjmp/longjmp over %ld iterations, each 1 took %.3f ns\n", iterations, elapsed_time / iterations);

// Benchmark setjmp alone
clock_gettime(CLOCK_MONOTONIC, &start);
for (long i = 0; i < iterations; i++) {
setjmp(buf);
}
clock_gettime(CLOCK_MONOTONIC, &end);
elapsed_time = (end.tv_sec - start.tv_sec) * 1e9 + (end.tv_nsec - start.tv_nsec);
printf("Benchmarking setjmp over %ld iterations, each 1 took %.3f ns\n", iterations, elapsed_time / iterations);
}

int main() {
long iterations = 1000000;
benchmark_setjmp(iterations);
return 0;
}

Performance Results:
After running this code with 1 million iterations, here are the results:

1
2
Benchmarking setjmp/longjmp over 1000000 iterations, each 1 took 12.293 ns
Benchmarking setjmp over 1000000 iterations, each 1 took 4.478 ns

This shows that while longjmp adds overhead, the performance is still acceptable for many use cases where non-local jumps are required for error handling.

Practical Considerations

Using setjmp and longjmp comes with caveats:

  1. Signal Safety: These functions should be used carefully within signal handlers. Specifically, non-local jumps can complicate the flow of the program and leave resources in an inconsistent state.
  2. Portability: Behavior might vary slightly between platforms. On some systems, sigsetjmp and siglongjmp are preferable for more precise control over signal masks【16†source】【17†source】.

Conclusion

By combining setjmp, longjmp, and signals, you can catch and recover from page faults in your C programs. While setjmp/longjmp may add some overhead, they provide a powerful mechanism for handling errors gracefully without crashing the program. If you need to benchmark or optimize this approach, the above code can be a great starting point to understand the performance trade-offs.

Implementing an Inline Hook in C in 5 minutes

One of the fascinating aspects of programming comes when we try to alter the behavior of a program while it is running.

In this tutorial, we shed light on one method that can make this possible - an “Inline Hook”. We will delve into how you can manipulate the execution flow of a program in the C programming language. By implementing an Inline Hook, we aim to divert the program’s execution flow to our function, then returning it back to the normal flow.

What is an Inline Hook?

An Inline hook is a technique that inserts a piece of code into a running program, altering its control flow. In practice, this is achieved by replacing the first few instructions of a function with a jump to our inserted code (usually another function), which upon completion will jump back, continuing the execution of the original function. Frida is a popular tool that uses this technique to inject code into a running process. It is used for dynamic instrumentation, debugging, and reverse engineering.

In our userspace eBPF runtime bpftime (https://github.com/eunomia-bpf/bpftime), we use inline hooking to implement the uprobe feature. bpftime is an userspace eBPF runtime that allows existing eBPF applications to operate in unprivileged userspace using the same libraries and toolchains. It offers Uprobe and Syscall tracepoints for eBPF, with significant performance improvements over kernel uprobe and without requiring manual code instrumentation or process restarts.

Inline Hook Implementation

The Inline hook implementation primarily follows five crucial steps:

  1. Identifying the memory address of the function to be hooked.
  2. Backing up the initial instructions of the target function that will be overwritten,
  3. Writing a jump instruction at the beginning of the target function in the hooked process’s memory,
  4. Creating the hook function, which will replace the original one,
  5. Altering the memory permissions to enable the changes, and restoring them once modifications are complete.

On a side note, Inline Hooking could be limited by modern compiler optimization and certain memory protection procedures such as Data Execution Prevention (DEP) and Address Space Layout Randomization (ASLR).

Inline Hooking Example: how to use it

To make this more digestible, we will use an example scenario. In this example, we will hook a simple function my_function. This code is in main.c and it initially prints “Hello, world!”. But after applying our hook, it prints “Hello from hook!” instead.

1
2
3
4
5
// This is the original function to hook.
void my_function()
{
printf("Hello, world!\n");
}

Next, we create a hooking function my_hook_function in hook.c. This function will replace my_function and is designed to print “Hello from hook!”

1
2
3
4
5
// This is the hook function.
void my_hook_function()
{
printf("Hello from hook!\n");
}

The inline_hook function is the most critical part of our application. It uses mprotect to change the memory permissions of the target function, making it writable. It then replaces the first few instructions of my_function with a jump instruction to my_hook_function. The original bytes are saved for future restoration.

On the main function, we start by calling my_function, enabling the inline_hook, calling my_function again (which now executes my_hook_function), then removing the hook and calling my_function another time to see that it prints the original “Hello, world!” string.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
int main()
{
my_function();

// Enabling the hook.
inline_hook(my_function, my_hook_function);

// Now calling the function will actually call the hook function.
my_function();

// Removing the hook
remove_hook(my_function);

// Now calling the function will call the original function.
my_function();

return 0;
}

After compiling and running the main function, we can observe the output.

1
2
3
4
5
$ make
$ ./maps
Hello, world!
Hello from hook!
Hello, world!

You can find the complete example in the following repository: https://github.com/eunomia-bpf/inline-hook-demo

Implementation a inline hook

Let’s take a look at the implementation of the inline_hook function. This is a very basic implementation that works on x86_64, ARM64, and ARM32. It is not a complete implementation, but it should be enough to get you started.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
#include <sys/mman.h>
#include <unistd.h>
#include <assert.h>
#include <string.h>
#include <stdlib.h>
#include <stdint.h>

#if defined(__x86_64__) || defined(_M_X64)
#define SIZE_ORIG_BYTES 16
static void inline_hook_replace_inst(void *orig_func, void *hook_func) {
// Write a jump instruction at the start of the original function.
*((unsigned char *)orig_func + 0) = 0xE9; // JMP instruction
*((void **)((unsigned char *)orig_func + 1)) =
(unsigned char *)hook_func - (unsigned char *)orig_func - 5;
}
#elif defined(__aarch64__) || defined(_M_ARM64)
#define SIZE_ORIG_BYTES 32
static void inline_hook_replace_inst(void *orig_func, void *hook_func) {
int offset = ((intptr_t)hook_func - (intptr_t)orig_func) / 4;
if (offset < -0x2000000 || offset > 0x1ffffff) {
printf("Offset %d out of range!\n", offset);
exit(1);
}
uint32_t branch_instruction = 0x14000000 | (offset & 0x03ffffff);
*((uint32_t*)orig_func) = branch_instruction;
}
#elif defined(__arm__) || defined(_M_ARM)
#define SIZE_ORIG_BYTES 20
static void inline_hook_replace_inst(void *orig_func, void *hook_func) {
// Construct a branch instruction to the hook function.
// The instruction for a branch in ARM is 0xEA000000 | ((<offset> / 4) & 0x00FFFFFF)
// The offset needs to be divided by 4 because the PC advances by 4 bytes each step in ARM
int offset = ((intptr_t)hook_func - (intptr_t)orig_func - 8) / 4;
int branch_instruction = 0xEA000000 | (offset & 0x00FFFFFF);

// Write the branch instruction to the start of the original function.
*(int *)orig_func = branch_instruction;
}
#else
#error "Unsupported architecture"
#endif

void *get_page_addr(void *addr)
{
return (void *)((uintptr_t)addr & ~(getpagesize() - 1));
}

unsigned char orig_bytes[SIZE_ORIG_BYTES];

void inline_hook(void *orig_func, void *hook_func)
{
// Store the original bytes of the function.
memcpy(orig_bytes, orig_func, SIZE_ORIG_BYTES);

// Make the memory page writable.
mprotect(get_page_addr(orig_func), getpagesize(),
PROT_READ | PROT_WRITE | PROT_EXEC);

inline_hook_replace_inst(orig_func, hook_func);

// Make the memory page executable only.
mprotect(get_page_addr(orig_func), getpagesize(),
PROT_READ | PROT_EXEC);
}

void remove_hook(void *orig_func)
{
// Make the memory page writable.
mprotect(get_page_addr(orig_func), getpagesize(),
PROT_READ | PROT_WRITE | PROT_EXEC);

// Restore the original bytes of the function.
memcpy(orig_func, orig_bytes, SIZE_ORIG_BYTES);

// Make the memory page executable only.
mprotect(get_page_addr(orig_func), getpagesize(),
PROT_READ | PROT_EXEC);
}

We start by saving the original bytes of the target function in the orig_bytes array. We then make the memory page writable using mprotect. Next, we replace the first few instructions of the target function with a jump instruction to the hook function. Finally, we restore the memory page’s permissions to their original state. get_page_addr computes the page-aligned address. inline_hook sets up the hook by storing original bytes and modifying instructions. remove_hook reverses the changes.

The hook installation differs based on the processor architecture.

On x86_64, we replace the beginning of the target function with a JMP instruction that redirects to our hook function.

1
2
3
4
5
6
7
#define SIZE_ORIG_BYTES 16
static void inline_hook_replace_inst(void *orig_func, void *hook_func) {
// Write a jump instruction at the start of the original function.
*((unsigned char *)orig_func + 0) = 0xE9; // JMP instruction
*((void **)((unsigned char *)orig_func + 1)) =
(unsigned char *)hook_func - (unsigned char *)orig_func - 5;
}

Note that in ARM32, the Program Counter (PC) is usually 2 instructions ahead, which is why we subtract 8 (2 instructions * 4 bytes/instruction) when calculating the offset. This might differ between different ARM versions or modes (Thumb vs ARM, etc.) so please adjust accordingly to your target’s specifics.

Also, you need to increase the SIZE_ORIG_BYTES from 16 to 20 because the minimal branch instruction in ARM is 4 bytes and you’re going to replace 5 instructions. This is needed because the branch instruction uses a relative offset and you cannot be sure how far your hook function will be. If your function and hook are within 32MB of each other, you could only replace the first 4 bytes with a branch and wouldn’t need to touch the rest.

1
2
3
4
5
6
#define SIZE_ORIG_BYTES 20
static void inline_hook_replace_inst(void *orig_func, void *hook_func) {
int offset = ((intptr_t)hook_func - (intptr_t)orig_func - 8) / 4;
int branch_instruction = 0xEA000000 | (offset & 0x00FFFFFF);
*(int *)orig_func = branch_instruction;
}

Similar to ARM32, ARM64 uses the ARM instruction set. However, there are differences and specifics to consider for ARM64. For example, the encoding of the branch instruction is different and because of the larger address space, you have to create a trampoline for larger offsets that can’t be reached by a single branch instruction. The trampoline should be close to the original function so it can be reached by a branch instruction and from there, it will load the full 64 bit address of the hook function.

1
2
3
4
5
6
7
8
9
10
11
12
#define SIZE_ORIG_BYTES 32
static void inline_hook_replace_inst(void *orig_func, void *hook_func) {
int offset = ((intptr_t)hook_func - (intptr_t)orig_func) / 4;
// Check if the offset is within the allowable range for a branch instruction.
if (offset < -0x2000000 || offset > 0x1ffffff) {
printf("Offset %d out of range!\n", offset);
exit(1);
}
// Construct and write the branch instruction.
uint32_t branch_instruction = 0x14000000 | (offset & 0x03ffffff);
*((uint32_t*)orig_func) = branch_instruction;
}

You can find the complete example in the following repository: https://github.com/eunomia-bpf/inline-hook-demo

Limitations

Understanding the Limitations of Inline Hooking

Inline Hooking, while a powerful technique for intercepting and modifying function calls in software, has several inherent limitations, particularly in the context of modern operating systems and programming environments. Here, we delve into these limitations in more detail to provide a clearer understanding of the challenges and implications involved. The demostration code is very simple and cannot be used in production.

1. Operating System Security Mechanisms

Modern operating systems deploy a variety of security mechanisms to prevent malicious or unintended modifications to executing code:

  • Data Execution Prevention (DEP): DEP is designed to prevent code from being run from data segments of a process, such as the stack or heap. Inline hooking often requires executing code that has been written to these segments, which can be blocked by DEP.

  • Address Space Layout Randomization (ASLR): ASLR randomizes the memory addresses used by system and application files. This complicates the process of inline hooking since the exact address of the target function may change every time the application or system is restarted.

  • Code Signing and Integrity Checks: Some operating systems and applications implement code signing and integrity checks. These mechanisms can detect modifications to code, including inline hooks, and may prevent the modified application from executing or flag it as malicious.

2. Compiler Optimizations

Modern compilers employ various optimizations that can interfere with inline hooking:

  • Function Inlining: Compilers may inline functions, which means the function’s code is directly inserted into each place it is called, rather than being kept as a separate function. This can eliminate the consistent function entry point that inline hooks rely on.
  • Instruction Reordering and Optimizations: Compilers might reorder instructions or optimize the function’s structure in a way that doesn’t align well with the assumptions made during the setup of an inline hook, potentially leading to crashes or undefined behavior.

3. Multi-threading and Concurrent Execution

  • Thread Safety: In multi-threaded applications, ensuring that the hook is correctly applied without interrupting currently executing threads can be challenging. There’s a risk of race conditions where one thread might be executing the function being hooked while another is applying the hook.
  • Re-entrancy Issues: If the hooked function or the hook itself is re-entrant (can be called simultaneously from multiple threads), it complicates the inline hooking process. Care must be taken to handle such cases properly to avoid deadlocks or inconsistent program states.

4. Hardware and Architecture Specifics

  • Instruction Set Differences: Different processors have different instruction sets and execution models. For instance, ARM and x86 processors have significantly different ways of handling instructions, making the process of writing a universal inline hook more complex.
  • Instruction Length Variations: The length of instructions can vary (especially in variable-length instruction sets like x86), making it difficult to determine how many bytes to overwrite safely without affecting subsequent instructions.

Wrapping Up

Understanding inline hooking can substantially aid in areas such as software security, testing, and debugging. It provides an avenue to alter and control program behavior on-the-fly. While it is powerful, it also comes with its drawbacks, which need to be handled with care.
In conclusion, while inline hooks are powerful tools, they should be used with caution, understanding, and a good knowledge of system architecture.

I hope you enjoyed the journey exploring Inline Hooks. Happy coding!

bpftime: Extending eBPF from Kernel to User Space

Yu Sheng Zheng, Yu Tong

eBPF is a revolutionary technology that originated in the Linux kernel, enabling sandboxed programs to run within the operating system’s kernel. It is used to safely and efficiently extend the kernel’s capabilities without altering its source code or loading kernel modules.

In this blog, we are excited to introduce a new open-source user-space eBPF runtime: https://github.com/eunomia-bpf/bpftime. bpftime further expands the capabilities of eBPF, allowing existing eBPF tools and applications, such as BCC tools, bpftrace, Deepflow, etc., to run in non-privileged user space without any code modifications, while using the same libraries and toolchains as kernel eBPF.

bpftime not only provides dynamic tracing or extension mechanisms like Uprobe and system call tracepoints, but also offers an order of magnitude performance improvement over kernel Uprobe. Moreover, like kernel eBPF, it requires no manual code instrumentation or process restarts. bpftime supports inter-process eBPF maps through user-space shared memory, while being compatible with kernel eBPF maps, enabling seamless operations with kernel eBPF infrastructure. Additionally, it includes high-performance LLVM JIT/AOT compilers for various architectures, as well as a lightweight JIT and interpreter for x86. Through performance data and real-world examples, we will demonstrate how bpftime can be effective in the real world and provide insights into its future development. We hope bpftime will bring unprecedented performance and flexibility to system monitoring, analysis, and extension. We also introduced the design and implementation of bpftime at the Linux plumbers 23 conference[2].

eBPF: System Extension from Kernel to User Space

eBPF (extended Berkeley Packet Filter) has evolved from a simple network packet filtering tool into a versatile system-level extension technology. Since the inception of BPF in the 1990s, eBPF has significantly enhanced its functionality through an expanded instruction set and direct interaction with kernel data structures. After joining the Linux kernel in 2014, eBPF became a powerful bytecode engine, widely used in performance analysis, security policies, and other areas. With the growing complexity of computing environments, eBPF’s real-time data collection and analysis capabilities have become crucial in modern computing, especially in traffic control, load balancing, and security policies.

Although eBPF was initially designed for the kernel, its tremendous potential in user space, coupled with the kernel’s GPL LICENSE restrictions, led to the development of early user-space eBPF runtimes like ubpf[3] and rbpf[4]. These runtimes allowed developers to execute eBPF bytecode outside the kernel, breaking free from GPL license restrictions and offering a more intuitive and convenient debugging environment. However, writing programs for ubpf and rbpf might require a specific, not fully kernel-compatible toolchain, and they only had limited single-threaded hash maps implementations, making it difficult to run actual eBPF programs. Additionally, ubpf and rbpf are essentially eBPF bytecode virtual machines that still require glue code to compile and link with other user-space programs for practical use, and they did not offer dynamic tracing functionality.

In practice, user-space eBPF has been explored and applied in fields like network processing, blockchain, and security. For example, Oko and DPDK eBPF support demonstrate the flexibility and performance advantages of eBPF in network data processing. The Solana project utilized eBPF to implement a JIT compiler, supporting the execution of blockchain smart contracts. The eBPF for Windows project extended eBPF functionality beyond Linux, showcasing its potential for cross-platform compatibility. These applications not only demonstrate eBPF’s powerful system extension capabilities but also highlight its significance and wide applicability in the modern computing domain. For further discussion, refer to our previous blog: https://eunomia.dev/blogs/userspace-ebpf/.

Why We Need bpftime

Due to the core role of operating system kernels and the high demands for stability and security, innovation and evolution in operating system kernels tend to be slow. This is the original intention behind eBPF: to extend the kernel’s functionality without changing its source code, thereby bringing more innovative application scenarios[5]. This is also the impact we hope bpftime will have: exploring more development possibilities with the safety and ecosystem brought by eBPF, without changing user-space program code, and compensating for the potential shortcomings of current kernel-space eBPF and other user-space extension solutions.

Limitations of Kernel-Space Implementation of User-Space Tracing (Uprobe) and System Call Tracing

Uprobe is a powerful user-level dynamic tracing mechanism that allows developers to perform dynamic instrumentation in user-space programs, such as at function entry points, specific code offsets, and function return points. This technology is implemented by setting breakpoints at designated locations, such as using the int3 instruction on x86 architecture. When the execution flow reaches this point, the program traps into the kernel, triggering an event, then executing a predefined probe function, and finally returning to user-space to continue execution. This dynamic tracing method can trace and instrument all processes executing a specific file across the system, allowing for the collection of critical data for performance analysis and fault diagnosis without modifying code, recompiling, or restarting processes.

However, since the eBPF virtual machine executes in kernel mode, the current Uprobe implementation introduces two context switches in the kernel, causing significant performance overhead, especially impacting performance in latency-sensitive applications. As shown in the diagram, Uprobe’s overhead is nearly ten times that of Kprobe[5]. On the other hand, Uprobe is currently limited to tracing and cannot modify the execution flow or return values of user-space functions, limiting its use cases to code extension, hot patching, defect injection, etc. Despite this, Uprobe is still widely used in production environments for its non-intrusive user-space functionality tracing, such as tracing user-space protocols like SSL/TLS and HTTP2, monitoring memory allocation and leaks, analyzing garbage collection and language runtimes, and tracking the creation and recycling of coroutines, among other scenarios.

Uprobe vs Kprobe

For system call tracepoints, since they are globally visible, additional filtering is required for specific process tracing, such as filtering based on pid, cgroup, etc., in eBPF[6], which also brings some additional overhead to other processes that do not need to be traced.

Limitations of Kernel-Space eBPF in Terms of Security and Extensibility

eBPF running in kernel mode has its limitations in terms of security and extensibility. On one hand, eBPF programs need to run in kernel mode, meaning they require root privileges, thereby increasing the attack surface and potential risks, such as container escape. Moreover, vulnerabilities in eBPF itself can lead to security issues at the kernel level. On the other hand, while the verifier restricts eBPF programs to ensure safety, this also limits the functionality expansion of eBPF; any new feature or improvement requires modifications to the kernel code. These limitations not only increase the maintenance difficulty of the system but also reduce the flexibility and universality of eBPF.

For kernels without eBPF support (e.g., older systems) or applications in non-privileged containers, user-space eBPF runtimes are a viable alternative, allowing the execution of eBPF programs for tracing, analysis, and extension operations without kernel eBPF support.

Shortcomings of Other User-Space Extension Solutions

Currently, there are other user-space tracing and extension solutions, such as gdb and other tools that use the ptrace mechanism for process tracing and analysis, Wasm, Lua virtual machines that can be used as plugin runtimes, and binary instrumentation tools like Frida for dynamic tracing in user space. However, these solutions have their own limitations.

  • High Performance Overhead: Traditional tools like gdb use the ptrace mechanism for process tracing. Although they are powerful, they introduce significant performance overhead when analyzing and interacting with other processes. This method frequently pauses and resumes the target process, leading to reduced efficiency. Additionally, ptrace limits the number of processes that can be traced simultaneously in the system, making large-scale distributed tracing infeasible. WebAssembly (Wasm) sandboxes, while offering good flexibility and cross-language support, require strict validation and runtime checks when executing external libraries or procedures, potentially introducing performance losses. In contrast, eBPF offers a more performance-centric strategy, using static analysis and a verifier to ensure safe execution of code on the host without additional runtime overhead. For bpftime, since it embeds the eBPF virtual machine in the function call context of the traced process without extra context switches, it has lower performance overhead.
  • Security Issues: Binary instrumentation tools like Frida provide dynamic tracing capabilities, but this can introduce security issues. The instrumentation code runs in the same process context and can be maliciously exploited. Additionally, code defects in the tracing tools or scripts themselves may cause the traced program to crash, such as accessing incorrect addresses or pointers. In contrast, eBPF can ensure the safety of code through its verifier.
  • Insufficient Visibility: Additionally, for other user-space tracing solutions, these tools typically only offer visibility into single processes and cannot provide system-wide insights. They struggle to capture a global view of kernel-level events or cross-process communications, limiting their analytical capabilities in complex systems. This is why eBPF and other solutions mainly perform tracing in kernel space, allowing for correlated analysis of kernel and user-space events, such as linking layer 7 network packets with kernel-level network events, or associating user-space function call behavior with kernel-level system calls, thus providing more comprehensive analytical capabilities. For bpftime, it can be more than just a user-space virtual machine solution. User-space eBPF can work in conjunction with kernel-space eBPF infrastructure to achieve boundary-crossing analysis and extension capabilities.

For existing other user-space eBPF runtimes, as mentioned earlier, they lack dynamic tracing or extension capabilities, require manual integration, and cannot directly utilize existing eBPF toolchains and applications, which greatly limits their use cases. On the other hand, they cannot work directly with kernel-space eBPF, only offering limited user-space extension capabilities.

bpftime: User-Space eBPF Runtime

User-Space eBPF Runtime Compatible with Existing eBPF Tools and Frameworks

bpftime aims to maintain good compatibility with existing kernel eBPF as a user-space alternative and improvement to kernel eBPF. It also seeks to maximize the use of the rich ecosystem and tools of existing eBPF. For example, bpftime allows the direct use of unmodified bpftrace tools to execute eBPF scripts in user space, tracing system calls or user-space functions:

bpftrace

At the same time, it can run user-space versions of BCC/libbpf-tools such as bashreadline, funclatency, gethostlatency, mountsnoop, opensnoop, sigsnoop, statsnoop, syscount, etc[7]. bpftime constructs eBPF map data structures in user-space shared memory, enabling the analysis and statistics of multiple processes, and supports reporting data to tracing tools through ring buffer, perf buffer, and other means.

bpftime also provides eBPF infrastructure compatible with the kernel in user-space. It can run without needing kernel eBPF and supports some of the kernel’s eBPF maps, helpers, dynamic tracing mechanisms, and almost all eBPF instruction sets:

bpftime

From a security perspective, bpftime provides an eBPF verifier to ensure the safety of eBPF bytecode, preventing malicious code injection or damaging the traced process. bpftime can use the kernel’s eBPF verifier or an independent user-space eBPF verifier as an alternative for environments without access to kernel eBPF.

High-Performance Uprobe and System Call Tracing

bpftime supports Uprobe and system call tracing by embedding eBPF programs into the function call context of the traced process through binary rewriting, thus achieving dynamic tracing and extension. This method not only avoids context switching between kernel and user spaces but also collects key data for performance analysis and fault diagnosis without modifying code, recompiling, or restarting processes. Compared to kernel Uprobe, bpftime’s Uprobe implementation is more performant and offers more functionalities, such as modifying function return values or altering function execution flows, enabling code extension, hot patching, and defect injection. The performance of user-space Uprobe implemented by bpftime can be an order of magnitude higher than that of kernel Uprobe:

Probe/Tracepoint Types Kernel (ns) Userspace (ns)
Uprobe 3224.172760 314.569110
Uretprobe 3996.799580 381.270270
Syscall Trace 151.82801 232.57691

Using dynamic library injection implemented via ptrace and technologies like LD_PRELOAD, bpftime’s eBPF runtime supports tracing during program startup and also allows mounting eBPF probes directly onto multiple running processes. We conducted a test where a probe monitoring the malloc function in libc was loaded using bpftime, and the loading latency was measured. The results showed that bpftime caused the running process to pause for about 48 milliseconds during loading. For comparison, we used the LD_PRELOAD method to load the same extension before the process started and observed a loading latency of 30 milliseconds.

We used the sslsniff tool[8] to trace and analyze SSL encrypted traffic of Nginx in bpftime’s user-space Uprobe and compared it with the kernel Uprobe approach, observing a significant performance improvement:

sslsniff

For modern eBPF observability tools, it may be necessary to collect and analyze the same event in both kernel and user-space functions. For instance, an HTTP request might require analyzing both kernel-level network events and user-space function calls to obtain a complete request chain. bpftime’s Uprobe implementation can work in conjunction with kernel eBPF kprobes, enabling this kind of cross-boundary analysis capability. Implementing and improving other dynamic tracing mechanisms are also part of our plan.

New eBPF JIT and AOT Compilers

bpftime includes a new LLVM-based eBPF JIT compiler that compiles eBPF bytecode into native machine code at runtime, thereby improving the execution efficiency of eBPF programs. Compared to other user-space eBPF runtime JIT compilers like ubpf and rbpf, and Wasm, the LLVM JIT compiler offers better performance, approaching the efficiency of native code execution. It also provides better cross-platform support, for example, supporting architectures like RISC-V. We conducted a simple performance comparison and analysis[9]:

jit

In addition to JIT, bpftime also includes an AOT compiler, which allows eBPF bytecode to be pre-compiled into machine code files for specific architectures after verification. This can be particularly useful for deployment and use in embedded systems, significantly reducing the time for compilation at startup.

More Exploratory Use Cases and Future Developments

Beyond extending previous Uprobe and system call tracepoints, bpftime can also be used for other exploratory use cases, such as:

  • Fault Injection: Using the kernel-compatible bpf_override_return() helper[10], bpftime can modify the Syscall return values of processes, block specific Syscalls, or modify and replace specific function calls in certain types of eBPF programs. This enables fault injection capabilities. Kernel Uprobe itself does not support this functionality, and kernel’s bpf_override_return also requires enabling the CONFIG_BPF_KPROBE_OVERRIDE option at compile time for security reasons, which is not enabled by default in mainstream Linux distributions.
  • Hot Patching: As mentioned earlier, using the bpf_override_return helper mechanism, user-space eBPF can also replace or filter certain function calls, thus enabling hot patching capabilities.
  • eBPF-based Nginx Module: bpftime can be used as an Nginx Module to implement extensions in Nginx through eBPF, such as dynamic routing, load balancing, caching, security policies, etc., in Nginx.
  • Enhancing Fuse: There have been attempts to optimize Fuse using eBPF in the kernel. bpftime could also be used as part of a user-space filesystem, modifying the behavior of system calls in the corresponding user-space process through eBPF, enabling filesystem extensions such as dynamic routing, caching, security policies, etc., in user-space filesystems.

bpftime is currently an early-stage exploratory project. We are actively exploring more potential application scenarios, such as implementing eBPF-based network packet filtering in user space, optimizing packet forwarding performance for service meshes, bypassing the kernel’s network protocol stack, and more. We look forward to more ideas and suggestions from everyone, or working together to implement these functions. In the future, we also hope that bpftime can offer better compatibility support for the kernel and, with the help of LLVM’s JIT compiler, provide better performance optimization guidance, and a more convenient testing and debugging

Conclusion

bpftime opens up new possibilities for eBPF applications in user space and provides new options for extending user-space applications. It allows existing eBPF applications to run in non-privileged user space using the same libraries and toolchains, and offers tracing mechanisms like Uprobe and Syscall for user-space eBPF. Compared to kernel Uprobe, it significantly improves performance and does not require manual code instrumentation or process restarts. The runtime supports inter-process eBPF maps in user-space shared memory, and is also compatible with kernel eBPF maps, allowing seamless operation with the kernel eBPF infrastructure.

bpftime is now open source on GitHub, and everyone is welcome to try it out and provide feedback: https://github.com/eunomia-bpf/bpftime If you have any suggestions or questions, feel free to raise an issue on GitHub or contact us by email at yunwei356@gmail.com.

References

  1. bpftime Git repo: https://github.com/eunomia-bpf/bpftime
  2. bpftime Linux Plumbers talk: https://lpc.events/event/17/contributions/1639/
  3. ubpf: https://github.com/iovisor/ubpf
  4. rbpf: https://github.com/qmonnet/rbpf
  5. Performance comparison of uprobe and kprobe: https://dl.acm.org/doi/10.1145/3603269.3604823
  6. Capturing Opening Files and Filter with Global Variables: https://eunomia.dev/tutorials/4-opensnoop/
  7. examples: https://github.com/eunomia-bpf/bpftime/tree/master/example
  8. sslsniff, based on the tool of the same name in bcc: https://github.com/eunomia-bpf/bpftime/tree/master/example/sslsniff
  9. bpf benchmark: https://github.com/eunomia-bpf/bpf-benchmark
  10. BPF-based error injection for the kernel: https://lwn.net/Articles/740146/
  11. FUSE BPF: A Stacked Filesystem Extension for FUSE: https://lwn.net/Articles/915717/

使用 Intel Control-flow Enforcement Technology (CET) 加强 C 程序的安全性

简介

Intel 的 Control-flow Enforcement Technology (CET) 是一种旨在提高软件安全性的硬件级特性,用于防止控制流劫持攻击,如返回导向编程(ROP)和跳转导向编程(JOP)。CET 主要通过两种机制实现:间接分支跟踪(IBT)和阴影栈(Shadow Stack)。

CET 实现原理

  1. **间接分支跟踪 (IBT)**:要求所有间接跳转(如函数指针调用)必须跳转到用 ENDBRANCH 指令标记的目标。这阻止了通过篡改内存来创建恶意跳转的尝试。

  2. **阴影栈 (Shadow Stack)**:为每个线程独立存储返回地址的副本。当函数返回时,处理器会比较当前的返回地址和阴影栈中的地址。如果它们不匹配,处理器会阻止返回操作,防止ROP攻击。

示例代码

下面的示例展示了一个简单的C程序,其中包含函数指针的使用和尝试非法访问的场景。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#include <stdio.h>
#include <stdlib.h>

void safe_function() {
printf("Safe function called.\n");
}

void unsafe_function() {
printf("Unsafe function called.\n");
}

int main() {
// 正常的函数指针使用
void (*func_ptr)() = safe_function;
func_ptr();

// 尝试非法修改函数指针
// 在一个CET支持的环境中,这种修改会被检测并阻止
func_ptr = (void (*)())((char*)safe_function + 2);
func_ptr(); // 这里的调用在CET环境下应该会失败

return 0;
}

在这个例子中,func_ptr 最初指向 safe_function。然后,代码尝试将 func_ptr 修改为 safe_function 地址的偏移,模拟一个非法的控制流修改尝试。在启用了 CET 的环境中,这种尝试应该会导致程序崩溃或异常终止。

编译和运行

使用支持 CET 的 GCC 编译器,编译上述代码:

1
gcc -fcf-protection=full -o cet_demo cet_demo.c

然后运行编译出的程序:

1
./cet_demo

测试 CET

在支持 CET 的环境中,程序中的第二次 func_ptr 调用(非法调用)将会失败,因为它尝试执行一个没有用 ENDBRANCH 指令标记的地址。这正是 CET 防止控制流劫持攻击的方式。

结论

CET 是在深度防御策略中的一个重要层面,但它并不是万能的。良好的编程实践和其他安全措施仍然是必要的。CET 提供了一个硬件级别的保护层,使得控制流劫持攻击更加困难,从而提高了软件的整体安全性。

The Secure Path Forward for eBPF runtime: Challenges and Innovations

Yusheng Zheng

Extended Berkeley Packet Filter (eBPF) represents a significant evolution in the way we interact with and extend the capabilities of modern operating systems. As a powerful technology that enables the Linux kernel to run sandboxed programs in response to events, eBPF has become a cornerstone for system observability, networking, and security features.

However, as with any system that interfaces closely with the kernel, the security of eBPF itself is paramount. In this blog, we delve into the often-overlooked aspect of eBPF security, exploring how the mechanisms intended to safeguard eBPF can themselves be fortified. We’ll dissect the role of the eBPF verifier, scrutinize the current access control model, and investigate potential improvements from ongoing research. Moreover, we’ll navigate through the complexities of securing eBPF, addressing open questions and the challenges they pose to system architects and developers alike.

Table of Contents

How eBPF Ensures Security with Verifier

The security framework of eBPF is largely predicated on the robustness of its verifier. This component acts as the gatekeeper, ensuring that only safe and compliant programs are allowed to run within the kernel space.

What the eBPF Verifier Is and What It Does

At its core, the eBPF verifier is a static code analyzer. Its primary function is to vet the BPF program instructions before they are executed. It scrutinizes a copy of the program within the kernel, operating with the following objectives:

  • Ensuring Program Termination

    The verifier uses depth-first search (DFS) algorithms to traverse the program’s control flow graph, which it ensures is a Directed Acyclic Graph (DAG). This is crucial for guaranteeing that the program cannot enter into an infinite loop, thereby ensuring its termination. It meticulously checks for any unbounded loops and malformed or out-of-bounds jumps that could disrupt the normal operation of the kernel or lead to a system hang.

  • Ensuring Memory Safety

    Memory safety is paramount in kernel operations. The verifier checks for potential out-of-bounds memory accesses that could lead to data corruption or security breaches. It also safeguards against use-after-free bugs and object leaks, which are common vulnerabilities that can be exploited. In addition to these, it takes into account hardware vulnerabilities like Spectre, enforcing mitigations to prevent such side-channel attacks.

  • Ensuring Type Safety

    Type safety is another critical aspect that the verifier ensures. By preventing type confusion bugs, it helps maintain the integrity of data within the kernel. The eBPF verifier utilizes BPF Type Format (BTF), which allows it to accurately understand and check the kernel’s complex data structures, ensuring that the program’s operations on these structures are valid and safe.

  • Preventing Hardware Exceptions

    Hardware exceptions, such as division by zero, can cause abrupt program terminations and kernel panics. To prevent this, the verifier includes checks for divisions by unknown scalars, ensuring that instructions are rewritten or handled in a manner consistent with aarch64 specifications, which dictate safe handling of such exceptions.

Through these mechanisms, the eBPF verifier plays a critical role in maintaining the security and stability of the kernel, making it an indispensable component of the eBPF infrastructure. It not only reinforces the system’s defenses but also upholds the integrity of operations that eBPF programs intend to perform, making it a quintessential part of the eBPF ecosystem.

How the eBPF Verifier Works

The eBPF verifier is essentially a sophisticated simulation engine that exhaustively tests every possible execution path of a given eBPF program. This simulation is not a mere theoretical exercise but a stringent enforcement of security and safety policies in kernel operations.

  • Follows control flow graph
    The verifier begins its analysis by constructing and following the control flow graph (CFG) of the eBPF program. It carefully computes the set of possible states for each instruction, considering the BPF register set and stack. Safety checks are then performed depending on the current instruction context.

    One of the critical aspects of this process is register spill/fill tracking for the program’s private BPF stack. This ensures that operations involving the stack do not lead to overflows or underflows, which could corrupt data or provide an attack vector.

  • Back-edges in control flow graph
    To effectively manage loops within the eBPF program, the verifier identifies back-edges in the CFG. Bounded loops are handled by simulating all iterations up to a predefined limit, thus guaranteeing that loops will not lead to indefinite execution.

  • Dealing with potentially large number of states
    The verifier must manage the complexity that comes with the large number of potential states in a program’s execution paths. It employs path pruning logic to compare the current state with prior states, assessing whether the current path is “equivalent” to prior paths and has a safe exit. This reduces the overall number of states that need to be considered.

  • Function-by-function verification for state reduction
    To streamline the verification process, the verifier conducts a function-by-function analysis. This modular approach allows for a reduction in the number of states that need to be analyzed at any given time, thereby improving the efficiency of the verification.

  • On-demand scalar precision (back-)tracking for state reduction
    The verifier uses on-demand scalar precision tracking to reduce the state space further. By back-tracking scalar values when necessary, the verifier can more accurately predict the program’s behavior, optimizing its analysis process.

  • Terminates with rejection upon surpassing “complexity” threshold
    To maintain practical performance, the verifier has a “complexity” threshold. If a program’s analysis surpasses this threshold, the verifier will terminate the process and reject the program. This ensures that only programs that are within the manageable complexity are allowed to execute, balancing security with system performance.

Challenges

Despite its thoroughness, the eBPF verifier faces significant challenges:

  • Attractive target for exploitation when exposed to non-root users
    As the verifier becomes more complex, it becomes an increasingly attractive target for exploitation. The programmability of eBPF, while powerful, also means that if an attacker were to bypass the verifier and gain execution within the OS kernel, the consequences could be severe.

  • Reasoning about verifier correctness is non-trivial
    Ensuring the verifier’s correctness, especially concerning Spectre mitigations, is not a straightforward task. While there is some formal verification in place, it is only partial. Areas such as the Just-In-Time (JIT) compilers and abstract interpretation models are particularly challenging.

  • Occasions where valid programs get rejected
    There is sometimes a disconnect between the optimizations performed by LLVM (the compiler infrastructure used to prepare eBPF programs) and the verifier’s ability to understand these optimizations, leading to valid programs being erroneously rejected.

  • “Stable ABI” for BPF program types
    A “stable ABI” is vital so that BPF programs running in production do not break upon an OS kernel upgrade. However, maintaining this stability while also evolving the verifier and the BPF ecosystem presents its own set of challenges.

  • Performance vs. security considerations
    Finally, the eternal trade-off between performance and security is pronounced in the verification of complex eBPF programs. While the verifier must be efficient to be practical, it also must not compromise on security, as the performance of the programs it is verifying is crucial for modern computing systems.

The eBPF verifier stands as a testament to the ingenuity in modern computing security, navigating the treacherous waters between maximum programmability and maintaining a fortress-like defense at the kernel level.

Other works to improve verifier

Together, these works signify a robust and multi-faceted research initiative aimed at bolstering the foundations of eBPF verification, ensuring that it remains a secure and performant tool for extending the capabilities of the Linux kernel.

Other reference for you to learn more about eBPF verifier:

Limitations in eBPF Access Control

After leading Linux distributions, such as Ubuntu and SUSE, have disallowed unprivileged usage of eBPF Socket Filter and CGroup programs, the current eBPF access control model only supports a single permission level. This level necessitates the CAP_SYS_ADMIN capability for all features. However, CAP_SYS_ADMIN carries inherent risks, particularly to containers, due to its extensive privileges.

Addressing this, Linux 5.6 introduces a more granular permission system by breaking down eBPF capabilities. Instead of relying solely on CAP_SYS_ADMIN, a new capability, CAP_BPF, is introduced for invoking the bpf syscall. Additionally, installing specific types of eBPF programs demands further capabilities, such as CAP_PERFMON for performance monitoring or CAP_NET_ADMIN for network administration tasks. This structure aims to mitigate certain types of attacks—like altering process memory or eBPF maps—that still require CAP_SYS_ADMIN.

Nevertheless, these segregated capabilities are not bulletproof against all eBPF-based attacks, such as Denial of Service (DoS) and information theft. Attackers may exploit these to craft eBPF-based malware specifically targeting containers. The emergence of eBPF in cloud-native applications exacerbates this threat, as users could inadvertently deploy containers that contain untrusted eBPF programs.

Compounding the issue, the risks associated with eBPF in containerized environments are not entirely understood. Some container services might unintentionally grant eBPF permissions, for reasons such as enabling filesystem mounting functionality. The existing permission model is inadequate in preventing misuse of these potentially harmful eBPF features within containers.

CAP_BPF

Traditionally, almost all BPF actions required CAP_SYS_ADMIN privileges, which also grant broad system access. Over time, there has been a push to separate BPF permissions from these root privileges. As a result, capabilities like CAP_PERFMON and CAP_BPF were introduced to allow more granular control over BPF operations, such as reading kernel memory and loading tracing or networking programs, without needing full system admin rights.

However, CAP_BPF’s scope is also ambiguous, leading to a perception problem. Unlike CAP_SYS_MODULE, which is well-defined and used for loading kernel modules, CAP_BPF lacks namespace constraints, meaning it can access all kernel memory rather than being container-specific. This broad access is problematic because verifier bugs in BPF programs can crash the kernel, considered a security vulnerability, leading to an excessive number of CVEs (Common Vulnerabilities and Exposures) being filed, even for bugs that are already fixed. This response to verifier bugs creates undue alarm and urgency to patch older kernel versions that may not have been updated.

Additionally, some security startups have been criticized for exploiting the fears around BPF’s capabilities to market their products, paradoxically using BPF itself to safeguard against the issues they highlight. This has led to a contradictory narrative where BPF is both demonized and promoted as a solution.

bpf namespace

The current security model requires the CAP_SYS_ADMIN capability for iterating BPF object IDs and converting these IDs to file descriptors (FDs). This is to prevent non-privileged users from accessing BPF programs owned by others, but it also restricts them from inspecting their own BPF objects, posing a challenge in container environments.

Users can run BPF programs with CAP_BPF and other specific capabilities, yet they lack a generic method to inspect these programs, as tools like bpftool need CAP_SYS_ADMIN. The existing workaround without CAP_SYS_ADMIN is deemed inconvenient, involving SCM_RIGHTS and Unix domain sockets for sharing BPF object FDs between processes.

To address these limitations, Yafang Shao proposes introducing a BPF namespace. This would allow users to create BPF maps, programs, and links within a specific namespace, isolating these objects from users in different namespaces. However, objects within a BPF namespace would still be visible to the parent namespace, enabling system administrators to maintain oversight.

The BPF namespace is conceptually similar to the PID namespace and is intended to be intuitive. The initial implementation focuses on BPF maps, programs, and links, with plans to extend this to other BPF objects like BTF and bpffs in the future. This could potentially enable container users to trace only the processes within their container without accessing data from other containers, enhancing security and usability in containerized environments.

reference:

Unprivileged eBPF

The concept of unprivileged eBPF refers to the ability for non-root users to load eBPF programs into the kernel. This feature is controversial due to security implications and, as such, is currently turned off by default across all major Linux distributions. The concern stems from hardware vulnerabilities like Spectre to kernel bugs and exploits, which can be exploited by malicious eBPF programs to leak sensitive data or attack the system.

To combat this, mitigations have been put in place for various versions of these vulnerabilities, like v1, v2, and v4. However, these mitigations come at a cost, often significantly reducing the flexibility and performance of eBPF programs. This trade-off makes the feature unattractive and impractical for many users and use cases.

Trusted Unprivileged BPF

In light of these challenges, a middle ground known as “trusted unprivileged BPF” is being explored. This approach would involve an allowlist system, where specific eBPF programs that have been thoroughly vetted and deemed trustworthy could be loaded by unprivileged users. This vetting process would ensure that only secure, production-ready programs bypass the privilege requirement, maintaining a balance between security and functionality. It’s a step toward enabling more widespread use of eBPF without compromising the system’s integrity.

  • Permissive LSM hooks: Rejected upstream given LSMs enforce further restrictions

    New Linux Security Module (LSM) hooks specifically for the BPF subsystem, with the intent of offering more granular control over BPF maps and BTF data objects. These are fundamental to the operation of modern BPF applications.

    The primary addition includes two LSM hooks: bpf_map_create_security and bpf_btf_load_security, which provide the ability to override the default permission checks that rely on capabilities like CAP_BPF and CAP_NET_ADMIN. This new mechanism allows for finer control, enabling policies to enforce restrictions or bypass checks for trusted applications, shifting the decision-making to custom LSM policy implementations.

    This approach allows for a safer default by not requiring applications to have BPF-related capabilities, which are typically required to interact with the kernel’s BPF subsystem. Instead, applications can run without such privileges, with only vetted and trusted cases being granted permission to operate as if they had elevated capabilities.

  • BPF token concept to delegate subset of BPF via token fd from trusted privileged daemon

    the BPF token, a new mechanism allowing privileged daemons to delegate a subset of BPF functionality to trusted unprivileged applications. This concept enables containerized BPF applications to operate safely within user namespaces—a feature previously unattainable due to security restrictions with CAP_BPF capabilities. The BPF token is created and managed via kernel APIs, and it can be pinned within the BPF filesystem for controlled access. The latest version of the patch ensures that a BPF token is confined to its creation instance in the BPF filesystem to prevent misuse. This addition to the BPF subsystem facilitates more secure and flexible unprivileged BPF operations.

  • BPF signing as gatekeeper: application vs BPF program (no one-size-fits-all)

    Song Liu has proposed a patch for unprivileged access to BPF functionality through a new device, /dev/bpf. This device controls access via two new ioctl commands that allow users with write permissions to the device to invoke sys_bpf(). These commands toggle the ability of the current task to call sys_bpf(), with the permission state being stored in the task_struct. This permission is also inheritable by new threads created by the task. A new helper function, bpf_capable(), is introduced to check if a task has obtained permission through /dev/bpf. The patch includes updates to documentation and header files.

  • RPC to privileged BPF daemon: Limitations depending on use cases/environment

    The RPC approach (eg. bpfd) is similar to the BPF token concept, but it uses a privileged daemon to manage the BPF programs. This daemon is responsible for loading and unloading BPF programs, as well as managing the BPF maps. The daemon is also responsible for verifying the BPF programs before loading them. This approach is more flexible than the BPF token concept, as it allows for more fine-grained control over the BPF programs. However, it is also more complex, bring more maintenance challenges and possibilities for single points of failure.

reference

Other possible solutions

Here are also some research or discussions about how to improve the security of eBPF. Existing works can be roughly divided into three categories: virtualization, Software Fault Isolation (SFI), and formal methods. Use a sandbox like WebAssembly to deploy eBPF programs or run eBPF programs in userspace is also a possible solution.

MOAT: Towards Safe BPF Kernel Extension (Isolation)

The Linux kernel makes considerable use of
Berkeley Packet Filter (BPF) to allow user-written BPF applications
to execute in the kernel space. BPF employs a verifier to
statically check the security of user-supplied BPF code. Recent
attacks show that BPF programs can evade security checks and
gain unauthorized access to kernel memory, indicating that the
verification process is not flawless. In this paper, we present
MOAT, a system that isolates potentially malicious BPF programs
using Intel Memory Protection Keys (MPK). Enforcing BPF
program isolation with MPK is not straightforward; MOAT is
carefully designed to alleviate technical obstacles, such as limited
hardware keys and supporting a wide variety of kernel BPF
helper functions. We have implemented MOAT in a prototype
kernel module, and our evaluation shows that MOAT delivers
low-cost isolation of BPF programs under various real-world
usage scenarios, such as the isolation of a packet-forwarding
BPF program for the memcached database with an average
throughput loss of 6%.

https://arxiv.org/abs/2301.13421

If we must resort to hardware protection mechanisms, is language safety or verification still necessary to protect the kernel and extensions from one another?

Unleashing Unprivileged eBPF Potential with Dynamic Sandboxing

For safety reasons, unprivileged users today have only limited ways to customize the kernel through the extended Berkeley Packet Filter (eBPF). This is unfortunate, especially since the eBPF framework itself has seen an increase in scope over the years. We propose SandBPF, a software-based kernel isolation technique that dynamically sandboxes eBPF programs to allow unprivileged users to safely extend the kernel, unleashing eBPF’s full potential. Our early proof-of-concept shows that SandBPF can effectively prevent exploits missed by eBPF’s native safety mechanism (i.e., static verification) while incurring 0%-10% overhead on web server benchmarks.

https://arxiv.org/abs/2308.01983

It may be conflict with the original design of eBPF, since it’s not designed to use sandbox to ensure safety. Why not using webassembly in kernel if you want SFI?

Kernel extension verification is untenable

The emergence of verified eBPF bytecode is ushering in a
new era of safe kernel extensions. In this paper, we argue
that eBPF’s verifier—the source of its safety guarantees—has
become a liability. In addition to the well-known bugs and
vulnerabilities stemming from the complexity and ad hoc
nature of the in-kernel verifier, we highlight a concerning
trend in which escape hatches to unsafe kernel functions
(in the form of helper functions) are being introduced to
bypass verifier-imposed limitations on expressiveness, unfortunately also bypassing its safety guarantees. We propose
safe kernel extension frameworks using a balance of not
just static but also lightweight runtime techniques. We describe a design centered around kernel extensions in safe
Rust that will eliminate the need of the in-kernel verifier,
improve expressiveness, allow for reduced escape hatches,
and ultimately improve the safety of kernel extensions

https://sigops.org/s/conferences/hotos/2023/papers/jia.pdf

It may limits the kernel to load only eBPF programs that are signed by trusted third parties, as the kernel itself can no longer independently verify them. The rust toolchains also has vulnerabilities.

Wasm-bpf: WebAssembly eBPF library, toolchain and runtime

Wasm-bpf is a WebAssembly eBPF library, toolchain and runtime allows the construction of eBPF programs into Wasm with little to no changes to the code, and run them cross platforms with Wasm sandbox.

It provides a configurable environment with limited eBPF WASI behavior, enhancing security and control. This allows for fine-grained permissions, restricting access to kernel resources and providing a more secure environment. For instance, eBPF programs can be restricted to specific types of useage, such as network monitoring, it can also configure what kind of eBPF programs can be loaded in kernel, what kind of attach event it can access without the need for modify kernel eBPF permission models.

It will require additional effort to port the application to WebAssembly. Additionally, Wasm interface of kernel eBPF also need more effort of maintain, as the BPF daemon does.

bpftime: Userspace eBPF runtime for uprobe & syscall hook & plugin

An userspace eBPF runtime that allows existing eBPF applications to operate in unprivileged userspace using the same libraries and toolchains. It offers Uprobe and Syscall tracepoints for eBPF, with significant performance improvements over kernel uprobe and without requiring manual code instrumentation or process restarts. The runtime facilitates interprocess eBPF maps in userspace shared memory, and is also compatible with kernel eBPF maps, allowing for seamless operation with the kernel’s eBPF infrastructure. It includes a high-performance LLVM JIT for various architectures, alongside a lightweight JIT for x86 and an interpreter.

It may only limited to centain eBPF program types and usecases, not a general approach for kernel eBPF.

Conclusion

As we have traversed the multifaceted domain of eBPF security, it’s clear that while eBPF’s verifier provides a robust first line of defense, there are inherent limitations within the current access control model that require attention. We have considered potential solutions from the realms of virtualization, software fault isolation, and formal methods to WebAssembly or userspace eBPF runtime, each offering unique approaches to fortify eBPF against vulnerabilities.

However, as with any complex system, new questions and challenges continue to surface. The gaps identified between the theoretical security models and their practical implementation invite continued research and experimentation. The future of eBPF security is not only promising but also demands a collective effort to ensure the technology can be adopted with confidence in its capacity to safeguard systems.

We are github.com/eunomia-bpf, build open source projects to make eBPF easier to use, and exploring new technologies, toolchains and runtimes related to eBPF.
For those interested in eBPF technology, check out our tutorial code repository at https://github.com/eunomia-bpf/bpf-developer-tutorial and our tutorials at https://eunomia.dev/tutorials/ for practical understanding and practice.