Updates from the Grid
Quite a bit of progress on the grid stability front this past month. LLnet has finally been put into operation and we've done a great deal of work to analyze and distribute query load from the central database. Landon Linden talked about some of the benefits already being seen with LLnet, while Sardonyx Linden and Which Linden discussed some of our major architectural challenges in scaling our infrastructure.
Right now, we're in a critical stage in our internal architectural review. While the database query work created headroom both in CPU as well as I/O utilization, those efforts only provide us with near to mid term database stability. What we are working hard to get to, is a final decision on a scalable architectural which, by design will provide stability. We've had an internal team going through a comprehensive evaluation over the past 60 days, and have a preliminary proposal that is beginning to circulate within Linden. My expectation is that we formalize our direction over the next 30 days, and begin to communicate how we scale, while preserving stability, redundancy, and quality.
In the meantime, I wanted to talk about some other ongoing infrastructure efforts.
- LLnet: We continue to work to further take advantage of our private network. Our current work is focused on renaming our server IP scheme, in order to allow us to direct public IP traffic over our private network (currently only our private Linden traffic uses LLnet). The benefit of this change will improve redundancy within our infrastructure by providing a means to route traffic to a data center that might be locally isolated by a Level 3 problem. This has occurred late last year in the Phoenix area (December 2008), and isolated all access to simulators and databases in Phoenix. Once the renumbering project is complete, we will be able to route around any local market outage, by moving traffic over LLnet from an unaffected data center (like SF or Dallas).
- Data Centers: Physical infrastructure, while not the most discussed part of the grid, is foundational to stable, scalable infrastructure. To this end, we intend to begin extending our physical infrastructure in a more balanced approach (right now all data centers are in the west region of the US). We are targeting to open a data center presence in the eastern US (likely in northern Virginia), and will be moving out of our San Francisco facility. This has a couple of benefits to stability and performance. First, we begin pushing the grid closer to a large resident base (eastern US and Europe), and second, we are going to reposition our central database (likely to our Dallas datacenter), to better address back end latency issues between simulators and the central database. (Right now the majority of our sims are in the Dallas facility). Finally, we are also working (in parallel) on evaluating cloud computing and virtual environments, with the hope that we can rapidly extend the grid outside the US borders, and better server our residents, not only in Europe, but Asia, Australia, and Latin America. I can tell you that we already have an instance of Second Life operating on Amazon's EC2 platform, and the results are promising.
- Assets: Our number 1 resdient concern and always at the top of my list. I blogged previously about a major project call Agent Inventory Services which is intended to better manage messaging between viewer and simulator when fetching assets to inventory. Its been a very long project, but one that has great promise to improve reliability of asset retrieval. In many cases, residents think that assets have been lost when, in fact, they are just not being presented within the viewer due to messaging errors between viewer and simulator. AIS is designed to fix this problem, and is packaged for release with our server update in June.
The one area that I really wanted to talk about this month, but didn't have much to discuss, was on object rez time. Still appreciably slow for me, and I know for many of you. We have looked at the problem from a number of angles, and right now are researching the impact of a server update that took place in late 2008 (when I started to notice issues). At this point the only thing I can report is that our upload of some assets to bulk storage (S3) is not a component of this issue. Beyond that, I don't have much to report, other than this remains a major issue for me on the performance/quality side.