Contents
Protocol Buffers is a widely used binary serialization format, with bindings available in many languages.
Is there a way to easily store protobuf messages in a Postgres database, so that they become queriable with SQL?
message Message {
string name = 1;
uint32 age = 2;
Address address = 3;
message Address {
...
}
}
SELECT * from messages WHERE age > 30 AND address.country = 'CH';
Possible methods
One way is to mirror the protobuf schema (.proto
) as a SQL schema, but this requires a lot of boilerplate for reading and writing, and backwards-compatible changes to the proto
(one of the strengths of Protobuf) require performing migrations.
A simpler solution would be to combine:
- The ability to serialize Protobufs to/from JSON in a way that matches Protobuf semantics (see e.g. the Python API).
- Postgres’ support for JSONb
A blog post by Dataform describes a Go implementation of this idea with MongoDB. Fields names are replaces by their indexes, given that the former are allowed to freely change.
A quick search also reveals:
- The work-in-progress http://profanedb.gitlab.io/ based on the RocksDB KV store.
- A work-in-progress Protobuf support for MySQL by Google
- The support for protobufs in Apple’s FondationDB.
- Support for Protobuf in Clickhouse.
A implementation in Rust using serde
The following describes a way to implement the idea above in Rust, leveraging serde
and the support of custom types in diesel
. The same could be achieved with the postgres
crate in lieu of diesel
(which would probably make more sense given we’re not really using the ORM features…).
JSONb in diesel
via serde
Assume we have a protobuf message
message Message {
...
}
that we codegen in Rust to
struct Message {
...
}
using prost
.
The following generalizes slightly diesel
’s support of JSONb by handling arbitrary types T
rather than only serde_json::Value
.
#[derive(
Debug, Serialize, Deserialize, diesel::expression::AsExpression, diesel::deserialize::FromSqlRow,
)]
#[diesel(sql_type = diesel::sql_types::Jsonb)]
pub struct Proto<T: std::fmt::Debug>(T);
const JSONB_HEADER: u8 = 1;
impl<T> diesel::serialize::ToSql<diesel::sql_types::Jsonb, Pg>
for Proto<T>
where T: std::fmt::Debug + Serialize
{
fn to_sql<'b>(
&'b self,
out: &mut diesel::serialize::Output<'b, '_, Pg>,
) -> diesel::serialize::Result {
out.write_all(&[JSONB_HEADER])?;
serde_json::to_writer(out, &self.0)
.map(|_| diesel::serialize::IsNull::No)
.map_err(Into::into)
}
}
impl<T: std::fmt::Debug + serde::de::DeserializeOwned>
diesel::deserialize::FromSql<diesel::sql_types::Jsonb, Pg> for Proto<T>
{
fn from_sql(value: diesel::pg::PgValue<'_>) -> diesel::deserialize::Result<Self> {
let bytes = value.as_bytes();
if bytes[0] != JSONB_HEADER {
return Err("Unsupported JSONB encoding version".into());
}
serde_json::from_slice(&bytes[1..]).map_err(|_| "Invalid Json".into())
}
}
Regarding Protobuf semantics:
- We use the field names as keys, which violates Protobuf semantics (changing the key in the
proto
would prevent from deserializing messages in the database). It would not be too difficult to use the (stable) ID as key, but it makes queries more difficult; a translation layer would then likely be needed. Alternatively, we can create migrations to rename columns. - We should ensure that all fields of
T
have the#[serde(default)]
field attribute, so that old message can be deserialized (with the new fields set as default). This is possible, given that non-primitive types (submessages) are stored asOption
s.
Getting Serialize
and Deserialize
on the messages
One small issue is that prost
does not give by default serde::{Serialize, Deserialize}
on the types it generates (see this issue).
The type_attribute
option in prost_build
allows adding the #[derive(...)]
macro, but adding #[serde(default)]
on fields is a bit more tricky. Indeed, field_attribute
applies to structs and enums indistinctively and #[serde(default)]
is not valid on enum
fields…. Instead, we achieve this with a custom Derive
macro:
#[proc_macro_attribute]
pub fn serde_deserialize_default(_args: TokenStream, input: TokenStream) -> TokenStream {
let mut input: syn::Item = syn::parse(input).unwrap();
let attr = quote! {#[serde(default)]};
let attr = syn::Attribute::parse_outer.parse(attr.into()).unwrap();
if let syn::Item::Struct(input) = &mut input {
for field in &mut input.fields {
field.attrs.extend(attr.clone());
}
}
quote! {
#[derive(serde::Serialize,serde::Deserialize)]
#input
}
.into()
}
Then we can set
config.type_attribute(".", "#[derive(serde_deserialize_default)]");
in the prost
configuration.
Table setup and example
Finally, we can set our table up as:
#[derive(Insertable, Debug, Queryable, Serialize, Deserialize)]
#[diesel(table_name = messages)]
struct Message {
data: Proto<proto::Message>,
}
A Rust Message
struct can be added and retrieved with:
conn.interact(move |conn| {
diesel::insert_into(messages::table)
.values(&[(
messages::dsl::data.eq(Proto(message)),
)])
.execute(conn)
})
.await
.unwrap()?;
let message: proto::Message = conn
.interact(|conn| messages::dsl::messages.first(conn))
.await
.unwrap()?
.data
.0;
More complex queries are then achieved with raw SQL, following Postgres’ JSONb support.
Indexes are supported, but only to some extent for sub-messages (see the operators that the GIN indexes support). An solution for that case is to create a materialized view and set indexes there.
The same with MongoDB
The very same idea can be used with MongoDB and the official Rust driver.
A priori, the bson
type already implements serde::{Serialize, Deserialize}
. However, it requires that map keys be strings. We use a wrapper type again to pass through serde_json
, which will transparently convert the keys to strings and back.
pub struct ProtoAsJson<T> {
pub inner: T,
}
impl<T> ProtoAsJson<T> {
pub fn from(inner: T) -> ProtoAsJson<T> {
ProtoAsJson { inner }
}
}
// BSON-serialized (with keys as strings) -> JSON -> Native
impl<'de, T: Deserialize<'de>> Deserialize<'de> for ProtoAsJson<T> {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'de>,
{
let v: Result<serde_json::Value, D::Error> = serde_json::Value::deserialize(deserializer);
T::deserialize(v?)
.map(|inner| ProtoAsJson { inner })
.map_err(|e| D::Error::custom(format!("Failed to deserialize BSON: {}", e)))
}
}
// Native -> JSON -> BSON-serialized (with keys as strings)
impl<T: Serialize> Serialize for ProtoAsJson<T> {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
let x = serde_json::to_value(&self.inner).unwrap();
x.serialize(serializer)
}
}
A word of warning
JSON and Protobuf both have complex semantics, and the above implementations should be taken with caution. For example, there is no NaN
in JSON, which are mapped to null
in MongoDB.