Redis: under the hood---转载

http://pauladamsmith.com/articles/redis-under-the-hood.html#redis-under-the-hood

How does the Redis server work?

I was curious to learn more about Redis’s internals, so I’ve been familiarizing myself with the source, largely by reading and jumping around in Emacs. After I had peeled back enough of the onion’s layers, I realized I was trying to keep track of too many details in my head, and it wasn’t clear how it all hung together. I decided to write out in narrative form how an instance of the Redis server starts up and initializes itself, and how it handles the request/response cycle with a client, as a way of explaining it to myself, hopefully in a clear fashion. Luckily, Redis has a nice, clean code base that is easy to read and follow along. Armed with a TAGS file, my $EDITOR, and GDB, I set out to see how it all works under the hood. (Incidentally, I was working with the Redis code base as of commit b4f2e41. Of course, internals such as I outline below are subject to change. However, the broad architecture of the server is unlikely to change very much, and I tried to keep that in mind as I went along.)

This article examines server startup and takes a high-level view of the request/response processing cycle. In a subsequent article, I’ll dive in to greater detail and trace a simple SET/GET command pair as they make their way through Redis.

Redis: under the hood

Startup

Let’s begin with the main() function in redis.c.

Beginning global server state initialization

First, initServerConfig() is called. This partially initializes a variable server, which has the type struct redisServer, that serves as the global server state.

// redis.h:338
struct redisServer {
    pthread_t mainthread;
    int port;
    int fd;
    redisDb *db;
    // ...
};
 
// redis.c:69
struct redisServer server; /* server global state */

There are a huge number of members in this struct, but they generally fall into the following categories:

general server state
statistics
configuration from config file
replication
sort parameters
virtual memory config, state, I/O threads, & stats
zip structure
event loop helpers
pub/sub

For example, this struct includes members that map to options in the configuration file (usually named redis.conf) such as the port the server listens on and how verbose logging should be, pointers to linked lists of connected clients and slave Redis servers as well as the Redis database(s) itself, and counters for statistics like the number of commands processed since startup.

initServerConfig() provides default values for the members that can be configured by the user via the redis.conf config file.

Setting up command table

The next thing main() does is sort the table of Redis commands. These are defined in a global variable readonlyCommandTable which is an array of struct redisCommands.

// redis.c:70
struct redisCommand *commandTable;
struct redisCommand readonlyCommandTable[] = {
    {"get",getCommand,2,REDIS_CMD_INLINE,NULL,1,1,1},
    {"set",setCommand,3,REDIS_CMD_BULK|REDIS_CMD_DENYOOM,NULL,0,0,0},
    {"setnx",setnxCommand,3,REDIS_CMD_BULK|REDIS_CMD_DENYOOM,NULL,0,0,0},
    {"setex",setexCommand,4,REDIS_CMD_BULK|REDIS_CMD_DENYOOM,NULL,0,0,0},
    {"append",appendCommand,3,REDIS_CMD_BULK|REDIS_CMD_DENYOOM,NULL,1,1,1},
    // ...
};
 
// redis.h:458
typedef void redisCommandProc(redisClient *c);
// ...
struct redisCommand {
    char *name;
    redisCommandProc *proc;
    int arity;
    int flags;
    // ...
};

A redisCommand struct keeps track of its name — the mnemonic, i.e., “get” — a pointer to the actual C function that performs the command, the command’s arity, command flags such as whether it returns a bulk response, and a number of VM-specific members.)

The read-only table is ordered in source code so that commands are grouped by category, such as string commands, list commands, set commands, etc., to make it easier for a programmer to scan the table for similar commands. The sorted table of commands is pointed to by the global variable commandTable, and is used to lookup Redis commands with a standard binary search (lookupCommand(), which returns a pointer to a redisCommand).

Loading config file

main() moves on to processing command-line options given by the user starting up the redis-server executable. Currently, there is only a single argument that Redis takes — aside from the usual version -vand help -h flags — which is a path to a config file. If the path is given, Redis loads the config file and overrides any defaults already set by initServerConfig() by calling loadServerConfig(). This function is fairly straightforward, looping over each line in the config file and converting values that match directive names to appropriate types for the matching member in the serverstruct. At this point, Redis will daemonize and detach from the controlling terminal if it has been configured to do so.

`initServer()`

initServer() finishes the job of initializing the server struct that was begun by initServerConfig(). First, it sets up signal handling (SIGHUP and SIGPIPE signals are ignored — there is an opportunity to improve Redis by adding the ability to reload its config file when it receives a SIGHUP, in the fashion of other daemons), including printing a stacktrace if the server receives a SIGSEGV (and other related signals), see segvHandler().

A number of doubly-linked lists (see adlist.h) are created to keep track of clients, slaves, monitors (a client that has sent theMONITOR command), and an object free list.

Shared objects

One interesting thing Redis does is create a number of shared objects, which are accessible via the global shared struct. Common Redis objects that are required by many different commands, response strings and error messages, for example, can be shared without having to allocate them each time, saving memory, with the tradeoff of a bit more initialization effort at startup time.

// redis.c:662
shared.crlf = createObject(REDIS_STRING,sdsnew("\r\n"));
shared.ok = createObject(REDIS_STRING,sdsnew("+OK\r\n"));
shared.err = createObject(REDIS_STRING,sdsnew("-ERR\r\n"));
shared.emptybulk = createObject(REDIS_STRING,sdsnew("$0\r\n\r\n"));
// ...

Shared integers

The greatest impact in terms of memory savings with shared objects comes from a large pool of shared integers.

// redis.c:705
for (j = 0; j < REDIS_SHARED_INTEGERS; j++) {
    shared.integers[j] = createObject(REDIS_STRING,(void*)(long)j);
    shared.integers[j]->encoding = REDIS_ENCODING_INT;
}

createSharedObjects() creates an array of the first 10,000 non-negative integers as Redis objects (strings with an integer encoding). Various Redis objects like strings, sets, and lists often contain many small integers (for IDs or counters), and they can reuse the same objects already allocated in memory, for a large potential savings. One could imagine the constant that defines the number of shared integers to be created, REDIS_SHARED_INTEGERS, being exposed to configuration to give users the opportunity, based on knowledge of their applications and needs, to increase the number of shared integers for greater memory savings. The tradeoff is that Redis statically allocates slightly more memory at startup time, but this would likely be a small amount compared to the overall typical database sizes and potential savings.

Event loop

initServer() continues by creating the core event loop, calling aeCreateEventLoop() (see ae.c) and assigning the result to the elmember of server.

One key aspect of Redis’s implementation is the use of locally-provided wrappers to simplify and hide the complexity of common tasks, without adding dependecies at build-time. For example,zmalloc.h defines a number of wrappers for the family of*alloc() functions, which keep track of how much memory Redis has allocated. sds.hdefines an API to a dynamic strings library (basically, a string that keeps track of its length and whether its memory can be freed.

ae.h provides a platform-independent wrapper for setting up I/O event notification loops, which usesepoll on Linux, kqueue on BSD, and falls back to select if the respective first choice is not available.) Redis’s event loop polls for new connections and I/O events (reading requests from and writing responses to a socket), being triggered when a new event arrives. This is what makes Redis so responsive, it can serve thousands of clients simultaneously without blocking while individual requests are processed and responded to.

Databases

initServer() also initializes a number of redisDb objects, which are structs that encapsulate the details of a particular Redis database, including tracking expiring keys, keys that are blocking (either from a B{L,R}POPcommand or from I/O), and keys that are being watched for check-and-set. (By default there are 16 separate databases, which can be thought of as namespaces within a Redis server.)

TCP socket

initServer() is where the socket that Redis listens for connections (by default, bound to port 6379) is set up. Another Redis-local wrapper, anet.h, defines anetTcpServer() and a number of other functions that simplify the usual complexity of setting up a new socket, binding, and listening to a port.

// redis.c:791
server.fd = anetTcpServer(server.neterr, server.port, server.bindaddr);
if (server.fd == -1) {
    redisLog(REDIS_WARNING, "Opening TCP port: %s", server.neterr);
    exit(1);
}

Server cron

initServer() further allocates various dicts and lists for databases and for pub/sub, resets stats and various flags, and notes the UNIX timestamp of the server start time. It registers serverCron() with the event loop as a time event, executing that function once every 100 milliseconds. (This is a bit tricky, because initially, serverCron() is set to run in 1 millisecond by initServer(), in order to have the cron cycle start right away with the server startup, but then the return value of serverCron(), which is 100, is plugged in to the calculation of the next time the time event process should be handled.)

serverCron() performs a number of periodic tasks for Redis, including verbose logging of database size (# of keys and memory used) and connected clients, resizing hash tables, closing idle/timed-out client connections, performing any post-background save or AOF rewrite cleanup, kicking off a background save if the save conditions as configured have been met (so many keys changed in so many seconds), calculating LRU information and dealing with expired keys (Redis only expires a few timed-out keys per cron cycle, using a adaptive, statistical method to avoid tying up the server, but will get more aggressive if expiring keys can help avoid out-of-memory situations), swapping out values to disk if virtual memory is enabled, and syncing with a master if this server is a slave.

Registering connection handler with event loop

Crucially, initServer() hooks up the event loop with the server’s TCP socket by registering the socket’s descriptor, registering the acceptHandler() function to be called when a new connection is accepted. (More on this below in the “Processing a request” section.)

// redis.c:821
if (aeCreateFileEvent(server.el, server.fd, AE_READABLE,
    acceptHandler, NULL) == AE_ERR) oom("creating file event");

Opening the AOF

initServer() creates or opens the append-only file (AOF), it the server was configured to use it.

// redis.c:824
if (server.appendonly) {
    server.appendfd = open(server.appendfilename,O_WRONLY|O_APPEND|O_CREAT,0644);

Finally, initServer() initializes Redis’s virtual memory system, again, if the server had been configured for it.

Back up to `main()`

If the server was configured to daemonize, Redis will now try to write out a pid file (the path of which is configurable, but defaults to /var/run/redis.pid).

At this point, the server has started up, and Redis will log this fact to its log file. However, there’s still a bit more to do in main()before Redis is fully ready.

Restoring data

If there is an AOF or a database dump file (eg., dump.rdb), it will be loaded, restoring data to the server from a previous session. (If both exist, the AOF takes priority.)

// redis.c:1452
if (server.appendonly) {
    if (loadAppendOnlyFile(server.appendfilename) == REDIS_OK)
        redisLog(REDIS_NOTICE,"DB loaded from append only file: %ld seconds",time(NULL)-start);
} else {
    if (rdbLoad(server.dbfilename) == REDIS_OK)
        redisLog(REDIS_NOTICE,"DB loaded from disk: %ld seconds",time(NULL)-start);
}

The server is now ready to start accepting requests.

Event loop setup

To finish up, Redis registers a function to be called each time it enters the event loop, beforeSleep() (since the process essentially goes to sleep while it waits to be notified of events). beforeSleep() does two things: it deals with serving clients that have requested keys that were swapped to disk if the virtual memory system was enabled, and it flushes the AOF to disk. The writing of the AOF is handled by flushAppendOnlyFile(). The function encapsulates some tricky logic about flushing the buffer that holds pending AOF writes (the frequency of which is configurable by the user).

Entering the event loop

Redis now enters the main event loop by calling aeMain(), with the argument server.el (remember, this member contains a pointer to an aeEventLoop). If there are any time (i.e., server cron) or file events to process each time through the loop, their respective handler functions will be called. aeProcessEvents() encapsulates this logic — time events are handled by custom logic, whereas file events are handled by the underlying epoll or kqueue or select I/O event notification system.

Because of Redis’s need to respond to time events as well as file or I/O events, it implements a custom event/polling loop,aeMain(). By checking to see if any time events need processing, and utilizing file event notification, the event loop can efficiently sleep until there is work to be done, and not tie up the CPU in a tight while loop.

Processing a request & returning a response

We are now inside Redis’s main event polling loop, listening on a port and waiting for clients to connect. It’s time to look at how Redis processes a command request.

Handling a new connection

Back under initServer(), Redis registered acceptHandler() to be called when there was an I/O event associated with the file descriptor of the socket the server is listening to (i.e., the socket has data waiting to be read or written). acceptHandler() creates a client object — pointer to a redisClient, a struct defined in redis.h — to represent a new client connection.

// networking.c:347
cfd = anetAccept(server.neterr, fd, cip, &cport);
if (cfd == AE_ERR) {
    redisLog(REDIS_VERBOSE,"Accepting client connection: %s", server.neterr);
    return;
}
redisLog(REDIS_VERBOSE,"Accepted %s:%d", cip, cport);
if ((c = createClient(cfd)) == NULL) {
    redisLog(REDIS_WARNING,"Error allocating resoures for the client");
    close(cfd); /* May be already closed, just ingore errors */
    return;
}

createClient() is called to allocate and initialize the client object. It selects database 0 by default (since there has to be at least one Redis db per server), and associates the client file descriptor generated by accept(2) in acceptHandler() with the client object. Other flags and members are initialized, and finally the client is appended to the global list of clients being tracked byserver.clients. The key thing Redis does in createClient() is registering a handler with the event loop, the functionreadQueryFromClient(), for when there is data from the client connection to be read.

// networking.c:20
if (aeCreateFileEvent(server.el,fd,AE_READABLE, readQueryFromClient, c) == AE_ERR)
{
    close(fd);
    zfree(c);
    return NULL;
}

Reading a command from a client

readQueryFromClient() is called by the main event loop when the client makes a command request. (If you are debugging with GDB, this is a good function to set as a breakpoint.) It reads it as much as it can of the command — up to 1024 bytes — to a temporary buffer, then appends it to a client-specific query buffer. This allows Redis to process commands where the payload (command name plus arguments) is larger than 1024 bytes, or because of I/O reasons have been split up into multiple read events. It then calls processInputBuffer(), passing the client object as an argument.

// networking.c:754
void readQueryFromClient(aeEventLoop *el, int fd, void *privdata, int mask) {
    redisClient *c = (redisClient*) privdata;
    char buf[REDIS_IOBUF_LEN];
    int nread;
    // ...
 
    nread = read(fd, buf, REDIS_IOBUF_LEN);
    // ...
    if (nread) {
        size_t oldlen = sdslen(c->querybuf);
        c->querybuf = sdscatlen(c->querybuf, buf, nread);
        c->lastinteraction = time(NULL);
        /* Scan this new piece of the query for the newline. We do this
         * here in order to make sure we perform this scan just one time
         * per piece of buffer, leading to an O(N) scan instead of O(N*N) */
        if (c->bulklen == -1 && c->newline == NULL)
            c->newline = strchr(c->querybuf+oldlen,'\n');
    } else {
        return;
    }
    Processinputbuffer(c);
}

processInputBuffer() parses the raw query from the client into arguments for execution of a Redis command. It first has to contend with the possibility that the client is blocked in a B{L,R}POP command, and bails early if that is the case. The function then parses the raw query buffer into arguments, creating Redis string objects of each and storing them in an array on the client object. The query is in the form of the Redis protocol. processInputBuffer() is really a protocol parser, calling back toprocessCommand() to fully parse the query. Somewhat confusingly, the source code comments describe parsing the “multi bulk command type,” an alternative protocol originally meant for commands like MSET, but it actually is now the main Redis protocol for all commands. It is binary-safe and easy to parse and debug. (Note: this code has been refactored for the upcoming 2.2 release and is a bit easier to follow.) Now it’s time to actually execute the command the client has sent, by callingprocessCommand() on the client object.

processCommand() takes the arguments of a command from a client and executes it. Before it gets to the actual execution of the command, it performs a number of checks — if any check fails, it appends a error message to the client object’s reply list and returns to the caller, processInputBuffer(). After handling the QUIT command as a special case (in order to shut down the client safely), processCommand() looks up the command name in the commandTable that was previously set up during Redis’s start up cycle. If it’s an unknown command, or the client got the arity of the command wrong, it’s an error. While it’s not commonly used, Redis can be configured to require a password to authenticate a client before it will accept commands, and this is the stage where Redis will check if the client is authenticate, and will set an error if not. If Redis is configured to use up to a maximum amount of memory, it will at this point try to free up memory if it can (be freeing objects from the free list and removing expired keys), otherwise if the server is over the limit, it won’t process commands with the REDIS_CMD_DENYOOM flag set (mainly writes like SET,INCR, RPUSH, ZADD, etc.), again, an error. One final check Redis makes is that a client can only issue SUBSCRIBE or UNSUBSCRIBEcommands while there are outstanding channels subscribed to, otherwise, it’s an error. If all the checks have been passed, the command will be executed, by calling call() with the client object and the command object as arguments.

Executing the command and responding

call(), gets a pointer to function of type struct redisCommandProc, from the proc member of the command object, which takes a single argument, that of a client object. The Redis command procedure is called.

// redis.c:864
void call(redisClient *c, struct redisCommand *cmd) {
    long long dirty;
 
    dirty = server.dirty;
    cmd->proc(c);
    dirty = server.dirty-dirty;
}
// ...

Each Redis command is responsible for setting the reply for the client. This is possible because the signature of a Redis command procedure is of a single argument, which is the client object. Likewise, each command procedure is responsible for encoding, or deserializing, arguments from the command, and for decoding, or serializing, Redis objects in memory for response to the client.

Write commands, like SET and ZADD, make the server “dirty,” in other words, the server is marked as having pages in memory that have changed. This is important for the automatic save process, which keeps track of how many keys have changed in a certain period, or the writing to the AOF. The function callsfeedAppendOnlyFile() if use of the AOF has been enabled, which writes out the command buffer from the client to the AOF, so the command can be replayed. (It translates commands that set a relative key expiration to an absolute expiration, but otherwise it basically copies the command as it came in from the client, see catAppendOnlyGenericCommand().) If any slaves are connected, call() will send the command to each of them to be executed locally, see replicationFeedSlaves(). Likewise, if any clients are connected and have issued the MONITOR command, Redis will send a representation of the command, prefixed with a timestamp, see replicationFeedMonitors().

// redis.c:871 (call() cont.'d)
    // ...
    if (server.appendonly && dirty)
        feedAppendOnlyFile(cmd,c->db->id,c->argv,c->argc);
    if ((dirty || cmd->flags & REDIS_CMD_FORCE_REPLICATION) &&
        listLength(server.slaves))
        replicationFeedSlaves(server.slaves,c->db->id,c->argv,c->argc);
    if (listLength(server.monitors))
        replicationFeedMonitors(server.monitors,c->db->id,c->argv,c->argc);
    server.stat_numcommands++;
}

Control returns to the caller, processCommand(), which resets the client object for subsequent commands.

As mentioned, each Redis command procedure is itself responsible for setting the response to be sent to the client. AfterreadQueryFromClient() exits and Redis returns the to event loop in aeMain(), aeProcessEvents() will pick up the waiting response in the write buffer and will copy it to the socket the client is connected on.

And that’s it! The response has been sent, and both client and server are back to a state where they can respectively emit and process more Redis commands.

Summary

Redis starts up by initializing a global server state variable, and reading in an optional configuration file to override any defaults. It sets up a global command table that connects command names with the actual function that implements the command. It creates an event loop using the best available underlying system library for event/readiness notification, and registers a handler function for when there is a new client socket connection to accept. It also registers a periodic (i.e., time-based) event handler for dealing with cron-like tasks like key expiry that need to be addressed outside the regular client-handling path. Once a client has connected, a function is registered on the event loop for being notified when the client has data to be read (i.e., a query for a command). The client’s query is parsed and a command handler is called to execute the command and write the response back to the client (the writing of data to the client is also handled by the event-notification loop). The client object is reset and server is ready to process more queries.

Next time — tracing a `SET` and `GET`

I’ll follow-up this article with one that takes a close look at the processing of two commands, SET and GET, by stepping through the implementation of each command’s procedure and examining the data structures Redis uses to store and index data. March 15, 2011: here it is.

—Paul Smith (follow me on the Twitter)

Thanks to pietern on #redis for feedback on a draft of this article

October 18, 2010