Journalist/developer. Storytelling developer @ USA Today Network. Builder of @HomicideWatch. Sinophile for fun. Past: @frontlinepbs @WBUR, @NPR, @NewsHour.
2039 stories
·
45 followers

Load Balancing

1 Share

Past a certain point, web applications outgrow a single server deployment. Companies either want to increase their availability, scalability, or both! To do this, they deploy their application across multiple servers with a load balancer in front to distribute incoming requests. Big companies may need thousands of servers running their web application to handle the load.

In this post we're going to focus on the ways that a single load balancer might distribute HTTP requests to a set of servers. We'll start from the bottom and work our way up to modern load balancing algorithms.

# Visualising the problem

Let's start at the beginning: a single load balancer sending requests to a single server. Requests are being sent at a rate of 1 request per second (RPS), and each request reduces in size as the server processes it.

For a lot of websites, this setup works just fine. Modern servers are powerful and can handle a lot of requests. But what happens when they can't keep up?

Here we see that a rate of 3 RPS causes some requests to get dropped. If a request arrives at the server while another request is being processed, the server will drop it. This will result in an error being shown to the user and is something we want to avoid. We can add another server to our load balancer to fix this.

No more dropped requests! The way our load balancer is behaving here, sending a request to each server in turn, is called "round robin" load balancing. It's one of the simplest forms of load balancing, and works well when your servers are all equally powerful and your requests are all equally expensive.

# When round robin doesn't cut it

In the real world, it's rare for servers to be equally powerful and requests to be equally expensive. Even if you use the exact same server hardware, performance may differ. Applications may have to service many different types of requests, and these will likely have different performance characteristics.

Let's see what happens when we vary request cost. In the following simulation, requests aren't equally expensive. You'll be able to see this by some requests taking longer to shrink than others.

While most requests get served successfully, we do drop some. One of the ways we can mitigate this is to have a "request queue."

Request queues help us deal with uncertainty, but it's a trade-off. We will drop fewer requests, but at the cost of some requests having a higher latency. If you watch the above simulation long enough, you might notice the requests subtly changing colour. The longer they go without being served, the more their colour will change. You'll also notice that thanks to the request cost variance, servers start to exhibit an imbalance. Queues will get backed up on servers that get unlucky and have to serve multiple expensive requests in a row. If a queue is full, we will drop the request.

Everything said above applies equally to servers that vary in power. In the next simulation we also vary the power of each server, which is represented visually with a darker shade of grey.

The servers are given a random power value, but odds are some are less powerful than others and quickly start to drop requests. At the same time, the more powerful servers sit idle most of the time. This scenario shows the key weakness of round robin: variance.

Despite its flaws, however, round robin is still the default HTTP load balancing method for nginx.

# Improving on round robin

It's possible to tweak round robin to perform better with variance. There's an algorithm called "weighted round robin" which involves getting humans to tag each server with a weight that dictates how many requests to send to it.

In this simulation, we use each server's known power value as its weight, and we give more powerful servers more requests as we loop through them.

While this handles the variance of server power better than vanilla round robin, we still have request variance to contend with. In practice, getting humans to set the weight by hand falls apart quickly. Boiling server performance down to a single number is hard, and would require careful load testing with real workloads. This is rarely done, so another variant of weighted round robin calculates weights dynamically by using a proxy metric: latency.

It stands to reason that if one server serves requests 3 times faster than another server, it's probably 3 times faster and should receive 3 times more requests than the other server.

I've added text to each server this time that shows the average latency of the last 3 requests served. We then decide whether to send 1, 2, or 3 requests to each server based on the relative differences in the latencies. The result is very similar to the initial weighted round robin simulation, but there's no need to specify the weight of each server up front. This algorithm will also be able to adapt to changes in server performance over time. This is called "dynamic weighted round robin."

Let's see how it handles a complex situation, with high variance in both server power and request cost. The following simulation uses randomised values, so feel free to refresh the page a few times to see it adapt to new variants.

# Moving away from round robin

Dynamic weighted round robin seems to account well for variance in both server power and request cost. But what if I told you we could do even better, and with a simpler algorithm?

This is called "least connections" load balancing.

Because the load balancer sits between the server and the user, it can accurately keep track of how many outstanding requests each server has. Then when a new request comes in and it's time to determine where to send it, it knows which servers have the least work to do and prioritises those.

This algorithm performs extremely well regardless how much variance exists. It cuts through uncertainty by maintaining an accurate understanding of what each server is doing. It also has the benefit of being very simple to implement.

Let's see this in action in a similarly complex simulation, the same parameters we gave the dynamic weighted round robin algorithm above. Again, these parameters are randomised within given ranges, so refresh the page to see new variants.

While this algorithm is a great balance between simplicity and performance, it's not immune to dropping requests. However, what you'll notice is that the only time this algorithm drops requests is when there is literally no more queue space available. It will make sure all available resources are in use, and that makes it a great default choice for most workloads.

# Optimizing for latency

Up until now I've been avoiding a crucial part of the discussion: what we're optimising for. Implicitly, I've been considering dropped requests to be really bad and seeking to avoid them. This is a nice goal, but it's not the metric we most want to optimise for in an HTTP load balancer.

What we're often more concerned about is latency. This is measured in milliseconds from the moment a request is created to the moment it has been served. When we're discussing latency in this context, it is common to talk about different "percentiles." For example, the 50th percentile (also called the "median") is defined as the millisecond value for which 50% of requests are below, and 50% are above.

I ran 3 simulations with identical parameters for 60 seconds and took a variety of measurements every second. Each simulation varied only by the load balancing algorithm used. Let's compare the medians for each of the 3 simulations:

You might not have expected it, but round robin has the best median latency. If we weren't looking at any other data points, we'd miss the full story. Let's take a look at the 95th and 99th percentiles.

Note: there's no colour difference between the different percentiles for each load balancing algorithm. Higher percentiles will always be higher on the graph.

We see that round robin doesn't perform well in the higher percentiles. How can it be that round robin has a great median, but bad 95th and 99th percentiles?

In round robin, the state of each server isn't considered, so you'll get quite a lot of requests going to servers that are idle. This is how we get the low 50th percentile. On the flip side, we'll also happily send requests to servers that are overloaded, hence the bad 95th and 99th percentiles.

We can take a look at the full data in histogram form:

I chose the parameters for these simulations to avoid dropping any requests. This guarantees we compare the same number of data points for all 3 algorithms. Let's run the simulations again but with an increased RPS value, designed to push all of the algorithms past what they can handle. The following is a graph of cumulative requests dropped over time.

Least connections handles overload much better, but the cost of doing that is slightly higher 95th and 99th percentile latencies. Depending on your use-case, this might be a worthwhile trade-off.

# One last algorithm

If we really want to optimise for latency, we need an algorithm that takes latency into account. Wouldn't it be great if we could combine the dynamic weighted round robin algorithm with the least connections algorithm? The latency of weighted round robin and the resilience of least connections.

Turns out we're not the first people to have this thought. Below is a simulation using an algorithm called "peak exponentially weighted moving average" (or PEWMA). It's a long and complex name but hang in there, I'll break down how it works in a moment.

I've set specific parameters for this simulation that are guaranteed to exhibit an expected behaviour. If you watch closely, you'll notice that the algorithm just stops sending requests to the leftmost server after a while. It does this because it figures out that all of the other servers are faster, and there's no need to send requests to the slowest one. That will just result in requests with a higher latency.

So how does it do this? It combines techniques from dynamic weighted round robin with techniques from least connections, and sprinkles a little bit of its own magic on top.

For each server, the algorithm keeps track of the latency from the last N requests. Instead of using this to calculate an average, it sums the values but with an exponentially decreasing scale factor. This results in a value where the older a latency is, the less it contributes to the sum. Recent requests influence the calculation more than old ones.

That value is then taken and multiplied by the number of open connections to the server and the result is the value we use to choose which server to send the next request to. Lower is better.

So how does it compare? First let's take a look at the 50th, 95th, and 99th percentiles when compared against the least connections data from earlier.

We see a marked improvement across the board! It's far more pronounced at the higher percentiles, but consistently present for the median as well. Here we can see the same data in histogram form.

How about dropped requests?

It starts out performing better, but over time performs worse than least connections. This makes sense. PEWMA is opportunistic in that it tries to get the best latency, and this means it may sometimes leave a server less than fully loaded.

I want to add here that PEWMA has a lot of parameters that can be tweaked. The implementation I wrote for this post uses a configuration that seemed to work well for the situations I tested it in, but further tweaking could get you better results vs least connections. This is one of the downsides of PEWMA vs least connections: extra complexity.

# Conclusion

I spent a long time on this post. It was difficult to balance realism against ease of understanding, but I feel good about where I landed. I'm hopeful that being able to see how these complex systems behave in practice, in ideal and less-than-ideal scenarios, helps you grow an intuitive understanding of when they would best apply to your workloads.

Obligatory disclaimer: You must always benchmark your own workloads over taking advice from the Internet as gospel. My simulations here ignore some real life constraints (server slow start, network latency), and are set up to display specific properties of each algorithm. They aren't realistic benchmarks to be taken at face value.

To round this out, I leave you with a version of the simulation that lets you tweak most of the parameters in real time. Have fun!

EDIT: Thanks to everyone who participated in the discussions on Hacker News, Twitter and Lobste.rs!

You all had a tonne of great questions and I tried to answer all of them. Some of the common themes were about missing things, either algorithms (like "power of 2 choices") or downsides of algorithms covered (like how "least connections" handles errors from servers).

I tried to strike a balance between post length and complexity of the simulations. I'm quite happy with where I landed, but like you I also wish I could have covered more. I'd love to see people taking inspiration from this and covering more topics in this space in a visual way. Please ping me if you do!

The other common theme was "how did you make this?" I used PixiJS and I'm really happy with how it turned out. It's my first time using this library and it was quite easy to get to grips with. If writing visual explanations like this are something you're interested in, I recommend it!

# Playground

Read the whole story
chrisamico
5 days ago
reply
Boston, MA
Share this story
Delete

Former DCist staff launch the 51st, new local news site for Washington

1 Share

A group of D.C. journalists who worked at a local news site that was abruptly shuttered by NPR affiliate WAMU earlier this year are launching their own nonprofit devoted to covering community news of Washington.

They are calling it the 51st — a nod to the District of Columbia’s lack of statehood — and say it will deliver hyperlocal Washington news relevant to District residents.

Initially, their coverage will focus on topics such as the cost of living in D.C. and how to navigate city services, as well as on accountability reporting.

The idea for the site first arose days after the unexpected closure of DCist, a beloved local news site that had been acquired by WAMU six years ago. When former staff gathered at a downtown restaurant to commiserate, the wake quickly turned into a business-development strategy session.

“Worker-run newsroom: When?” Maddie Poore asked her old co-workers that night, only half-joking. “When are we going to do this? We need this as a city.”

That pipe dream eventually led to a core group of six colleagues talking to local journalism start-up experts from around the country, drawing up business plans and filing legal paperwork.

On Tuesday, they will debut their project, along with a fundraising campaign that they hope will propel it into a sustainable future.

“All six of us have been working in a volunteer capacity putting just love, sweat and dreams into it,” Poore said. “We are spending our own money to stand this up, like we’re digging into our dwindling savings accounts.”

As local newspapers across the country have shriveled — victims of changing readership habits, an evaporating ad business and corporate cost-cutting — some enterprising journalists have attempted to keep coverage of their communities alive through independent start-ups.

“All these sites are trying to meet a community need,” said Amy Kovac-Ashley, executive director of the nonprofit Tiny News Collective, “and all these people raising their hands are seeing the needs in their own communities.”

Such an endeavor can be daunting, she said. “Local journalism is not something that is easy to scale.”

Tiny News Collective, which offers services to media start-ups such as guidance on how to establish a board and advice on how to navigate hiring and staffing, is serving as the 51st’s fiscal sponsor while it awaits the IRS nonprofit designation that will allow it to accept donations directly.

The 51st’s staffers hope to raise $250,000 in the next 30 days to fund their site for six months as they continue to apply for grants and seek donations.

It will be operated as a worker-run newsroom, a model followed by other media start-ups as well, such as New York’s Hell Gate and Defector, which are funded by subscriptions.

The 51st was also inspired by nonprofit newsrooms LA Public Press and the Outlier in Detroit — which, in addition to covering the ins and outs of local government, has also published practical guidance on getting your landlord to fix your toilet and other “how-to” pieces.

DCist covered the city budget, breaking news and other serious matters, as well as life-in-the-city amusements, such as overheard conversations on Metro trains. Founded as a blog in 2004, it was run by volunteers before it became a more professional operation with a staff editor and a fleet of freelance writers.

Along with sister site Gothamist, DCist was abruptly shut down in 2017 by its billionaire owner after unionization attempts. WAMU in Washington and New York’s WNYC, both nonprofit public radio stations, acquired the local sites, and WAMU eventually merged the DCist newsroom with its own.

At its peak, said former executive editor Teresa Frontado, DCist drew an average of 1.6 million readers a month — though that number dropped to 600,000 by the time it was shut down, which she attributed to earlier cutbacks. About 57,000 people subscribed to the DCist newsletter.

She pointed to those figures as evidence that there is an audience for the 51st. “There was a sadness [when DCist shut down]. We care deeply about what we do,” Frontado said. “But I saw an opportunity to contribute instead of just lamenting the end of a project.”

While the Beltway is home to several massive media organizations that cover federal institutions and national politics, local news operations in the area have experienced cutbacks, including the Metro staff of The Washington Post, news station WTOP and alt-weekly City Paper.

“We’re kind of in this unique position where it’s a city that is saturated with journalists — it has one of the highest concentrations of journalists of most cities — and yet there are very few resources to do local reporting,” said Abigail Higgins, a former DCist editor.

The 51st’s staffers say they aim to “co-create” journalism with residents by having regular, direct conversations about the kinds of coverage that locals want, Higgins said. “We want to make sure that D.C. residents are represented in our reporting, just like D.C. residents deserve representation in government.” They plan to host community listening sessions and hire freelancers in each of the city’s wards.

As former DCist managing editor Natalie Delgadillo put it: Instead of “reporting on D.C.,” they hope to think of themselves as “reporting for D.C., … and you can’t do that by sitting around and guessing or talking to your friends only. It takes a lot of effort and deep engagement.”

The worker-directed aspect, they said, will allow them to try experiments that may have been mired in corporate bureaucracy, such as distributing digital stories through printouts or sharing information via WhatsApp groups.

Although they say they aren’t trying to re-create DCist — and, in fact, passed on the idea of trying to obtain DCist’s archives from WAMU — they do want to preserve some of DCist’s irreverence, by publishing “pride of place” stories that highlight the quirkiness of D.C.’s neighborhoods.

“We were very intentional about not relaunching DCist,” said staffer Eric Falquero. “This is something new, of wanting to take the best pieces of what we did together there, and build upon that and expand.” He wants to reach DCist’s fans but also “look beyond the audiences we had there.”

“We’re in it for the long haul,” he said.

That is, if they get the support they need to pull it off.

Read the whole story
chrisamico
5 days ago
reply
Boston, MA
Share this story
Delete

Free George R. R. Martin from The Winds of Winter

1 Comment and 2 Shares

At this point in my career, I’ve written many thousands of words and edited quite a few different writers. I know a lot about procrastination. That is why, without speaking to the man or knowing him personally at all, I am nonetheless prepared to make the case that George R. R. Martin simply does not want to finish writing The Winds of Winter.

He’s just not into it. If he continues to force himself to do it, the end result will probably be a pretty terrible book — and I think he knows that, and that’s why he can’t finish it, because he doesn’t want to publish a bad book. The alternative? We don’t get the book at all. And for me, that’s actually preferable.

I could be misreading the signs, of course. Like so many other unhinged fans of A Song of Ice and Fire, I’m basing this purely on vibes, and Martin does love to troll all of us by posting oblique productivity quotes and tagging them with the word “writing,” suggesting that he’s quite merrily chugging along on The Winds of Winter. I also realize that by writing this post, I may be invoking the wrath of fate itself in such a way that Martin will post on his blog tomorrow that The Winds of Winter is officially done and now in his editors’ hands. That would be great, actually! But I don’t think that’s gonna happen.

I realize this is a controversial opinion amongst fellow A Song of Ice and Fire fans. I read the original books all in a whack long before the TV show was even announced, and I waited along with everyone else for A Dance with Dragons, which was a day-one purchase for me. I’m one of those people who has always strongly preferred the books to the HBO series, which was one reason why I actually fell off watching, instead content to wait for the book series to conclude the story instead. Given all of that, you’d think I’d be one of the fans begging GRRM to finish The Winds of Winter already, lest I never get closure on the long-running story. Instead, I feel the complete opposite.

Part of my change in opinion is due to the unusual circumstances in which GRRM finds himself. There are very few other examples of a hit book series getting adapted into a show before it has concluded, but I can think of at least one other: Fullmetal Alchemist, for which the original manga had not concluded even as the anime adaptation sped past it and had to invent its own (widely disliked) ending to the story. The manga’s author, Hiromu Arakawa, was still busily writing the rest of her manga, one that ended up with a much stronger conclusion. A whole different anime called Fullmetal Alchemist: Brotherhood got released a few years later — a redo of the whole adaptation concept, this time more faithfully following Arakawa’s story and, perhaps most importantly, her intended ending.

Like Arakawa, George R. R. Martin had some involvement in the TV adaptation of his work, although he claims not to have been as involved in the show’s final seasons. Speaking to the New York Times in 2022, Martin said “by Season 5 and 6, and certainly 7 and 8, I was pretty much out of the loop.” As for why that happened, “I don’t know — you have to ask [showrunners] Dan [Weiss] and David [Benioff].” At that time, Martin had this to say about The Winds of Winter: “My ending will be very different.”

Yet unlike Arakawa, who kept on writing the Fullmetal Alchemist manga at a steady clip (while simultaneously lending an opinion or two towards the two anime adaptations), George R. R. Martin appears to have had several other priorities. These priorities haven’t all been divergent from the characters from A Song of Ice and Fire, of course; Martin’s 2018 novel Fire & Blood, which is set 300 years before the events of A Song of Ice and Fire, led to the TV show House of the Dragon, which Martin has also been working on and has seemed pretty excited about, if his personal blog is any indication. He’s clearly not tired of the world he built and he still has more to say.

It just doesn’t seem like The Winds of Winter fits into the category of things that Martin is excited about.

This is super weird to me, because if I were Martin, the shitshow that was Game of Thrones’ final season and the disappointed fan reactions would be enough to psychologically propel me into angrily and speedily writing a “very different” ending, like the one Martin apparently has in mind. Spite can be a powerful motivator, and if Martin is correct that he got slow-faded out of that Game of Thrones production room, wouldn’t he be even more motivated to correct the record? And yet, here we are, in a world where years and years go by and Martin seems far more interested in telling totally different stories.

This is the part of the article where we’re going to talk about procrastination and why it happens. In my experience, there are two major reasons why it can occur. Again, I’m only talking about myself here; mine is the writer’s brain I know the best, after all. Ever since I got an ADHD diagnosis at the age of 12, I have become extremely familiar with the two kinds of procrastinating that I do.

The first is the better kind, because I personally have found I can solve it with a Wellbutrin prescription. The way it works is simple: It’s just too damn hard to get started on a given task, especially a hard one. Even people without ADHD can understand this experience, but people who have ADHD may experience it in a far more acute way, to an extent that their brain may feel that it is impossible to get started at all. I’ve seen it called “ADHD paralysis.” Whatever you call it, it’s a huge pain in the ass if you want to get some writing done, especially difficult or complicated writing. Imagine this: You have writing you want to do, you have a deadline (maybe one you’ve already blown), you know exactly what you want to write, and you do want to write — you just cannot get yourself to start. Here’s the important part of that sentence: You want to write. The only thing holding you back is your own brain.

Again, I don’t know Martin personally, but based on his updates, I don’t think that’s his problem with The Winds of Winter. I think the problem is that he doesn’t actually want to write it — or even worse, he has no idea how to write it, due to the various plot entanglements the characters in the book now face. This would result in an entirely different kind of procrastination.

This happens to me, too. It also happens to other writers I know, including ones I’ve edited. Sometimes you have an idea and you’re really excited about it, and the pitch gets accepted. But then when you actually start writing, you realize that the idea doesn’t work. Or maybe it just doesn’t excite you anymore. Even though you’ve already started, you just can’t seem to continue, or finish, your original idea. Remember the other type of procrastination, where getting started was the hard part, and once you started, you were off to the races, writing and excited about what you had to say? This isn’t that. This is the opposite, where you’ve started but you’re realizing that you have absolutely no fucking clue what you even want to say, or even if you have anything to say at all. This is around when you go to your editor and you say, “This isn’t working.” Or maybe your editor comes to you and asks why your draft is so late, and you admit defeat. At that point, you can work together to turn the idea into something else. That doesn’t always work, though. Sometimes the only solution is to walk away from the idea entirely.

No one is sadder than I am about this situation. I want to read The Winds of Winter, too. I’ve wanted to read it for a very long time. But you know what makes me even more sad? The past decade of listening to the ways that George R. R. Martin talks about The Winds of Winter on his personal blog, in interviews, and at press events. There is no joy in this man’s eyes. During a live event in October 2023 with fellow author Cassandra Clare, who said her next book is due out in 2025, Martin said in a visibly defeated and frustrated tone, “Really depressing thing is, that still may beat The Winds of Winter. Who knows? […] I’m 12 years late with The Winds of Winter, as we know. I’m just gonna put it right out there. You guys don’t have to pester me about it.”

And yet, people have been pestering him, and they show no sign of stopping. It keeps on feeling like the book is almost ready. Two years ago, Martin told the public the book was “75% done.” But perhaps rather than talking about his recent estimated percentages, it would be easier to link to this extensive Esquire article outlining every single time that Martin has attempted to put a timeline on the book’s completion, ever since he started writing it circa 2010. There have been a lot of bad guesses on this man’s part about how soon he’s going to be able to finish this book. It’s giving Zeno’s Paradox.

Here’s what I can’t stop thinking about: The Winds of Winter is not even the last book in the series. So it’s not like fans are just impatiently waiting for the conclusion. This is actually the penultimate book. So let’s just say George does manage to knock this one out (which I don’t think he will, based on how much difficulty he’s had thus far). Do fans really think that A Dream of Spring is going to come easily to this man, based on how he’s been doing so far with The Winds of Winter?

Just look at the publish dates for every A Song of Ice and Fire book up to now. Starting with A Game of Thrones in 1996, A Clash of Kings in 1998, and A Storm of Swords in 2000, each book was two years apart (impressive!). Then there’s A Feast for Crows in 2005, and A Dance with Dragons in 2011 — five years in between, then six. Now we’re up to a staggering 13-year wait, and counting.

Meanwhile, Martin doesn’t seem to have a problem getting other projects done — like, say, contributing to Elden Ring’s lore — nor any problem with agreeing to do other projects. This only annoys the fans who want him to have a one-track mind for The Winds of Winter. But to those fans I can only say, put yourself in his shoes. You’re a creative person; you want to do projects that excite you. What does it say that he keeps on choosing other things to do? What does it say about The Winds of Winter that it’s always last on the to-do list? In my case, that would be a pretty strong indication that I simply didn’t have any interest in doing the task that I just kept on pushing off, year after year after year after year. And it might even indicate that I was not-so-secretly hoping that particular task would disappear entirely.

George R. R. Martin’s editors are probably not ever going to do this for him. After all, for them, it’s a huge financial boon if he manages to finish the book. Even if it sucks ass, it will sell! The very prospect of ending his contract would be absurd on their end. And yet, having seen so many years go by with no final draft, it very much appears to be torture for him. I can’t condone that. And I’m a little worried about what kind of book could even result from such a death march.

As fans, or just as humans, we need to accept this reality. Stop asking this man to write the book he clearly hates. After all, we did get an ending, in the form of a rushed television finale; several of the plot points in that finale did line up with where a lot of the books’ foreshadowing appeared to be heading. It’s not like we have no closure at all. It’s very sloppy closure, but it’s something. It’s probably about as good as the original Fullmetal Alchemist ending.

I don’t really know what it looks like for us as laypeople to free George R. R. Martin from this situation. Without his publisher actually forgiving him, he probably won’t ever experience the true relief that comes from an editor telling you that you don’t have to keep working on something that you despise and can’t seem to finish or make into a draft that’s any good. It’s sort of like the relief that happens when you get plans canceled that you never wanted to do in the first place, but way better. Since this will probably never happen for Martin, I can only hope that with this essay, I manage to convince just a few other people to stop pestering the man to finish a book that he seems to have no interest in completing. Imagine how bad he must already feel. He doesn’t need any more reminders of the fact that this series ended with a whimper instead of a bang.

At least Elden Ring was a really cool game, start to finish. We’ll always have that. And probably a hell of a lot of other really cool projects from George R. R. Martin that he actually wants to work on. The Winds of Winter just isn’t going to be one of them.

Read the whole story
chrisamico
8 days ago
reply
Boston, MA
Share this story
Delete
1 public comment
crhill1979
9 days ago
reply
I found A Dance with Dragons pretty dull, so it didn't seem like he was too into that one either.
hiddeninput
9 days ago
I think it's best that he not finish at this point. Kinda like George Lucas shouldn't have made episodes 1-3.
osopeligroso
9 days ago
I really doubt it’s that he doesn’t want to write it anymore or can’t put words to paper. I think he’s stuck trying to solve a puzzle that there very well may be no solution for. He’s like a chess player staring at the board contemplating innumerable branching and expanding possibilities, but every single one of them eventually ends in checkmate. But this is his life’s work, so he can’t tip over his king and congratulate his opponent—better luck next time! All he can do is keep exploring the ever expanding multiverse in his mind, in the dwindling hope that he will someday stumble upon the one perfect solution that can finally end his torment

Wall Street Journal fires Hong Kong reporter who headed embattled press club

1 Share

A Hong Kong-based reporter for the Wall Street Journal was terminated by the newspaper soon after she was elected as chair of the Hong Kong Journalists Association.

The HKJA, a press advocacy association, has been accused in recent weeks by state-backed and state-run media outlets in Hong Kong and China of destabilizing the city.

Selina Cheng, the reporter, said in a news conference Wednesday that she believes the termination is related to her role as chair of the organization. She said she came under pressure from her employer to quit the association.

The day before the HKJA election, Cheng said, her supervisors directed her to withdraw her candidacy and to leave HKJA’s board, of which she has been a member since 2021. She declined their requests.

“[I] was immediately told it would be incompatible with my job,” said Cheng. “The editor said employees of the Journal should not be seen as advocating for press freedom in a place like Hong Kong, even though they can in Western countries, where it is already established.”

The HKJA is considered a trade union, and under Hong Kong law, it is legal to be an officer of a union, a right guaranteed by the Basic Law, the city’s mini-constitution.

In an emailed response, a spokesman for Dow Jones, the parent company of the Wall Street Journal, confirmed it made “personnel changes” on Wednesday but said it could not comment on specific individuals.

“The Wall Street Journal has been and continues to be a fierce and vocal advocate for press freedom in Hong Kong and around the world,” the spokesman added.

The termination, if linked to Cheng’s position at HKJA, would be the latest indication of how even large, well-resourced international media organizations are wary about the risks of operating in Hong Kong, a once-freewheeling city that has increasingly come to resemble mainland China in its suppression of civil liberties, including press freedom.

In the wake of mass protests in 2019, Beijing passed a national security law in Hong Kong that established punishments of up to life imprisonment for vaguely described crimes, such as subversion of state power and colluding with foreign forces.

These laws, alongside a new set of domestically focused national security laws passed this year, have had the effect of altering every institution in Hong Kong, from the courts to universities and newsrooms. After the passage of the national security law, the New York Times relocated its Hong Kong digital operation to Seoul, saying there was “a lot of uncertainty” about what the changes would mean for its operations and journalism.

Earlier this year, the Wall Street Journal said it was shifting its Asia headquarters from Hong Kong to Singapore and laid off a number of Hong Kong-based reporters. Cheng’s role was not affected at the time, and she continued to be based in and employed in the city. Cheng, 32, covers the Chinese auto industry, which the Journal has said is one of its priority coverage areas. In terminating her on Wednesday, editors cited restructuring, she said.

In a statement, the HKJA said the Journal is “not alone” in taking this stance and that other elected board members have been “pressured by their employers to stand down.” Previously, the Journal’s management in Hong Kong told one of its now-former reporters, technology reporter Dan Strumpf, not to run for president of the Foreign Correspondents’ Club of Hong Kong, citing risks to the company.

The HKJA remains a vocal group advocating for journalists in Hong Kong, both local and foreign. In a piece earlier this month, the Global Times, a Chinese state mouthpiece, said it had a “spotty history of colluding with separatist politicians and instigating riots in Hong Kong” and was “by no means a professional organization representing the Hong Kong media.”

The Global Times highlighted Cheng’s reporting for the Journal, which it said attacked the national security law, and the reporting of two other board members: James Griffiths, a correspondent for the Canadian-based Globe and Mail, and Theodora Yu, a freelancer who was a former employee of The Washington Post.

Hong Kong security chief Chris Tang Ping-keung has also attacked the HKJA, saying it had stood with the “black-clad violent mob” during the 2019 protests.

In its statement, the HKJA called on all media outlets working in China “to allow their employees to freely advocate for press freedom and better working conditions in solidarity with fellow journalists in Hong Kong and China.”

Read the whole story
chrisamico
9 days ago
reply
Boston, MA
Share this story
Delete

Optimizing Large-Scale OpenStreetMap Data with SQLite

1 Share

Over the past year or two, I’ve worked on a project to convert a massive dataset into an SQLite database. The original data was in a compressed binary format known as OSMPBF, which stands for OpenStreetMap Protocol Buffer Format. This format is highly compact and compressed, making it difficult to search. The goal of converting it into an SQLite database was to leverage SQLite’s search functionalities, such as full-text search, R-tree indexes, and traditional B-tree indexes on database table columns.

The OpenStreetMap (OSM) data is categorized into three main elements: nodes, ways, and relations. A node represents a single latitude-longitude point, akin to a point along a trail. A way is a series of nodes forming a path that can be a shape. A relation is an element that can include other relations, ways, or nodes, such as an entire trail system. Each component can have metadata associated with it, documented in a well-maintained OSM wiki.

I’m using the Open Street Map data of the entire United States for project. The stats of the file are around 1.4 billion entries, 9GB in size, and tags and bounding boxes are deduplicated to save space.

My first task was to transfer this OSM data from its compressed file format into SQLite. Given the inconsistent tagging across different elements, I used a JSON data type for tags while keeping other consistent information, such as latitude, longitude, and element type, in regular columns. This initial SQLite database was enormous, around 100 gigabytes for the United States, which necessitated determining which data was essential and how to optimize searches.

CREATE TABLE entries (
	id       INTEGER PRIMARY KEY AUTOINCREMENT,
	osm_id   INTEGER NOT NULL,
	osm_type INTEGER NOT NULL,
	minLat   REAL,
	maxLat   REAL,
	minLon   REAL,
	maxLon   REAL,
	tags     BLOB, -- key-value pair of tags (JSON)
	refs     BLOB  -- array of nodes, ways, and relations
) STRICT;

For instance, a query like “Find all the Costcos” would be practical, but due to the vast dataset, running a query took over a minute. I realized I needed to process the data further. By filtering down to elements with specific tags like name, shop type, and amenity, I reduced the database size to about 40 gigabytes. Although searches became faster, they were still too slow for practical use, often taking tens of seconds.

To improve query performance, I explored SQLite’s indexing capabilities. While SQLite doesn’t support the same JSON indexing as Postgres, I could create indexes for individual tags within the JSON.

CREATE INDEX entries_name ON entries(tags->>'name');

However, this requires an index per tag, which won’t scale, especially for a dynamic list of tags. SQLite does offer full-text search for unstructured text, such as a document. I adapted this by concatenating JSON keys and values into a single string for full-text indexing, using the following SQL:

CREATE VIRTUAL TABLE search USING fts5(tags);

WITH tags AS (
	SELECT
		entries.id AS id,
		json_each.key || ' ' || json_each.value AS kv
	FROM
		entries,
		json_each(entries.tags)
)
INSERT INTO
	search(rowid, tags)
SELECT
	id,
	GROUP_CONCAT(kv, ' ')
FROM
	tags
GROUP BY
	id;

This approach, combined with using a porter tokenization, allowed me to write fast queries. For example, searching for “Costco” became incredibly fast, under a millisecond, though it sometimes returned partial matches like “Costco Mart.”

SELECT rowid FROM search WHERE search MATCH "Costco";

Queries with tag-specific values (i.e., amenity=cafe) can use text search:

SELECT rowid FROM search WHERE search MATCH "amenity cafe";

This will return results with the words amenity and cafe appearing in the full-text index. It does not ensure that the tag equals that specific value. At the moment, it is best effort, so there are false positives when returning results.

Despite these improvements, the 40-gigabyte file size needed to be more manageable. This is a read-only data set, so there may be ways to compress the data. There are commercial solutions for this, provided by the core maintainers. SQLite’s virtual file system (VFS) feature allows an interface for all file operations. This allows different file-like systems to be used for storage, such as blob stores, other databases, etc.

Initially, I used GZIP compression via Go’s built-in functionality, but it proved too slow due to the need to decompress large portions of the file for random reads. It appears that the whole file has to be decompressed before reading parts of it.

Further research led me to Facebook’s Zstandard (ZSTD) compression, which supports a seekable format suitable for random access reads. This format maps well to SQLite’s page size for writing data to the file.

I could see that compressing the SQLite file with ZSTD reduced its size to about 13 gigabytes. Benchmarking of compressed and uncompressed SQLite databases. This is a test database of million entries with a random string of text.

BenchmarkReadUncompressedSQLite-4              	  159717	      7459 ns/op	     473 B/op	      15 allocs/op
BenchmarkReadCompressedSQLite-4                	  266703	      3877 ns/op	    2635 B/op	      15 allocs/op

Note: The benchmark (via Go) shows that ZSTD is faster than native. My hypothesis is that this is because the size of the database can be uncompressed once and held all in memory.

There was a performance hit with the entire database of OpenStreetMap data. I believe it has to do with how much data there is compared to the test benchmark above. However, having a compressed database, where a query cost is still sub-50 milliseconds, is helpful.

I’ve not further optimized the size of the file. I’m pretty happy with this. I have a TODO for myself to rewrite the ZSTD VFS in C instead of Go.

I want to reduce the number of false positives for the query amenity=cafe. Using the full-text index, it returns results containing the two words, as tags are not individually indexed.

When using the FTS5 virtual table, it turns out that constraints can be used on the original data. The index of the full-text search is used first (according to the query planner), so we can filter that subset down with a more familiar SQL constraint.

SELECT
	id
FROM
	entries e
	JOIN search s ON s.rowid = e.id
WHERE
  -- use FTS index to find subset of possible results
	search MATCH 'amenity cafe'
	-- use the subset to find exact matches
	AND tags->>'amenity' = 'cafe';

The equals constraint does not use an index, but since it is done on a subset of results, the operation cost is small. The query is still sub-50ms.

All this provides a read-only SQL queryable data in a single file representing OpenStreetMap metadata. The project evolved from merely transferring format migration to optimizing it for efficient search. This highlights the importance of iterative refinement and the power of combining different technologies to solve problems.

Read the whole story
chrisamico
23 days ago
reply
Boston, MA
Share this story
Delete

The Momentous Decision New York Almost Made

1 Share
Read the whole story
chrisamico
29 days ago
reply
Boston, MA
Share this story
Delete
Next Page of Stories