March 17 marked the first Sirius Madness, a new name for the always-awesome annual event Varrow Madness. I’m happy to report that this year’s Madness felt like the conference we’ve come to expect every year (with just a little less orange).
This time I presented an intro to OpenStack. Until mid 2015 I wasn’t super familiar with OpenStack so I decided it was time to learn it and share that information.
Here’s the slide deck from that Sirius Madness session, with some extra content I added after OpenStack Summit in Austin.
Check it out, Cisco UCS has quietly gotten a little more flexible! Last week I was in a meeting with a Cisco SE who pointed out a feature of UCSM 2.2(4) that I had overlooked, “Server Pack”. Here are the release notes, and the exerpt:
Server Pack—This feature allows you to support new server platforms * on existing infrastructure without requiring a complete firmware upgrade. In this model, new B/C server bundles enabling the new servers will be supported on the previous infrastructure A bundle. For example, Release 2.2(5) B/C server bundles are supported with Release 2.2(4) infrastructure A bundle as highlighted in Table 2 . New features introduced in the B/C bundles may only become available after upgrading the A bundle to the respective version.
The Server Pack feature provides the additional flexibility of adding new server platforms to active UCS domains without incurring the operational overhead of upgrading firmware across the whole domain.
So, once they’re on 2.2(4), customers who purchase blades with newer firmware won’t necessarily have to immediately upgrade the entire UCS system.
Cisco has released a new model for the MDS product line, the 9396S. I dug into the 9396S datasheet and compared it to the 9148S datasheet to see where it really made sense. As the ‘S’ implies, both switches support 2, 4, 8, and 16 Gb FC. Both have 16 Gb dedicated per port. The only real per-port functional difference I see is in the way buffer credits are handled, and in most cases that won’t matter. Both approaches take the same amount of rack space: 2 RU for 96 ports.
So, what’s the point? I did some cost comparison and a 9396S fully populated with 16 Gb FC SFP’s will cost more than two 9148S switches with the same config. The tradeoff is in the ISL’s. ISL’s seem harmless enough and are very useful when expanding an existing fabric. The danger is that ISL’s can easily become a bottleneck, and that’s a bottleneck that is not always easy to see. If I chose to save a little money and use 9148S’s in a new fabric, I might be able to locate the most demanding workloads on the same 9148S as the storage arrays, and have less demanding workloads on the other 9148S. This would reduce the amount of traffic that traverses the ISL’s between those switches. This seems reasonable, but there are two problems with this strategy.
The first problem is that most environments won’t be able to isolate their storage workloads to just a few ports. Most companies are virtual-first, and the heavy workloads move from host to host. There’s a good chance, in an environment that needs more than 48 ports per fabric, that all those HBA’s won’t fit on the same 9148S as the storage array they’re talking to. That means, at least some of those connections will be traversing the ISL’s.
The second problem is growth. If we’re designing and implementing a new pair of storage fabrics, we can optimally place all of our components. Everything will work great and we’ll all be happy. And then, in a few months or a year we’ll have to add hosts to accommodate new workloads, and we won’t have ports adjacent to our storage arrays. Very few shops want to redesign their storage fabrics every time they add a host. Or, often hosts will get reassigned at the VMware level (or whatever hypervisor you happen to be running; chances are it’s vSphere). When that happens, the storage workload may change drastically without the storage team being aware. The end result is the same: a lot of I/O traversing ISL’s that may or may not be sized to handle it.
The 9396S is there to prevent this. By using a single switch in each fabric instead of two with ISL’s, we can have a simpler design that is able to grow, workloads can move around without creating new bottlenecks. At least up to 96 ports; beyond that it’s worth it to consider moving up to a director class switch.
One of the best things about my job is that I get to talk to customers every day about the challenges they’re facing. I get to act as a pollinator, collecting ideas and perspectives, adding my own into the mix, then sharing them and helping those customers determine what solutions are available and which might be a good fit for their environment.
Over the last year, I’ve been seeing the conversations turn more and more to addressing data protection. Customers are facing tighter service level objectives, with much more data, and the industry offers more options, so traditional scheduled backups are no longer adequate to address every workload in the environment. Here’s what I think is the best approach to getting a data protection under control. This is a simplified version, we can get a lot more sophisticated to address specific situations.
Business requirements and data classification
The first thing in this case is the hardest, and progress usually stalls here. Business requirements are crucial to having a *good* data protection strategy. Without business requirements, an IT organization will have to “give it their best shot”, and that may or may not be in line with what upper management, or the business unit, or the application owner expects. Another way of looking at it that we need the business requirements to build a good design, which we need to have a successful implementation, which we need before we migrate or transition from the existing system, which we will then operate as our live environment. At Varrow, and now as Sirius, we designed our services around this concept, summarized as “Assess > Architect > Implement > Transition > Operate”. The output of each step feeds directly into the next, like a flow chart. As technical people, we like to start in the Architect (or Design) phase, sometimes in the Implement phase and figure it out as we go. The downside of that is we (the IT organization) end up making up our own business requirements, using assumptions that may not be in line with reality. Based on the conversation we have on a weekly basis, those assumptions very likely *don’t* line up with reality and no one finds out until things go sideways.
So back to the point: a lot of customers don’t know how to get the business requirements, or they don’t have time to do that while simultaneously running their environment. In these cases, I recommend a data classification assessment and data protection assessment. This is not just a tool that crawls your environment and spits out a report showing where your data lives. Instead, it should be interview driven so that the business units or application owners can give their perception of what service level they think they’re already getting, as well as what service level they believe their data requires. The results of this kind of assessment are usually eye-opening for an organization, and can really help IT justify the requests they make. If your IT organization already has the business requirements (not just “I’ve worked here for years and know what they expect”, but real requirements from the business units for each of the many types of data that exist in the environment), then it’s ready for designing a solid data protection strategy.
Designing for data protection
Here’s the part that customers usually jump to first, but that requires the groundwork in the section above: determining which technologies make the most sense in your environment. Probably, the business requirements show a small amount of data that’s very, very important, and large amounts of data that the company can function without for a reasonably long period of time. So, in most environments, if an IT org used a single method to address all of the data in their environment they would end up either paying too much or failing to meet their most demanding RTO’s / RPO’s.
Because of this, the vast majority of customers will want to establish tiers with different service level objectives (RPO, RTO, uptime), and tier-0 for basic infrastructure that everything depends on. Just to reiterate: These are design decisions that should be driven by the business requirements. For example, if for whatever reason your organization only has a single type of data, then maybe a single tier and single approach would be just fine for you. (But, that case is an exception.)
Once we have these tiers defined, then, finally, we can start talking about technologies to meet those SLO’s. Here’s an example of what these tiers might look like. For simplicity, I’ll keep RPO and RTO equal, though there are many situations where they might be different.
Armed with this, we can determine technologies that will address each of the tiers. I think it’s best to start with the Tier 3 data and build upwards. Most organizations already have a backup application, so as long as it’s robust and reliable it should be able to accommodate Tier 3.
Keep in mind that speeds and feeds (read throughput, concurrent threads, etc) get important when designing for a specific RTO. If your backup target can only restore half of your data in 24 hours, then you won’t meet a 24 hour RTO if your only storage array fails. In that scenario, you’ll also need something to restore that data onto, so the rabbit hole can get pretty deep pretty quickly depending on what kinds of failures need to be protected against.
Another point is that recovery in this context is only one aspect of “Backups”. Backups will actually be used most often for operational recovery (“I’ve lost a file or server due to corruption or misuse, please restore it for me”), and also used for retention, compliance, and other things. So what we’re doing is making the most of our backup solution by leveraging it for our recovery plans, and squeezing more value from it.
Moving up to Tier 2 in our example, we would probably want to use array based or hypervisor based snapshots to meet a 4 hour RPO/RTO. Restoring from backup may be possible for a subset of data, but for a larger recovery it’s going to be unlikely that 4 hours is enough time to restore from backup. (If it is, you may have oversized your backup solution). Ideally, these snapshots would be managed in the backup software, and we’re seeing an industry trend toward that.
Tier 1 gets more demanding, and more expensive. We probably don’t want to do snapshots that often, for several reasons. First, it’s demanding on the array or hypervisor. Secondly, t’s probably triggering VSS in the guest OS and doing that in rapid succession can have performance implications. So, to address this we would probably recommend some kind of CDP (Continuous Data Protection) solution. If the price makes sense, this could be the same technology we’d use for Tier 2 as well.
Finally, for Tier 0 apps the requirement delta may seem small but there’s a big difference, and the price usually goes up pretty sharply. Basically, here all we have time for is a guest OS reboot or an application level failover. That means this would have to be an HA event, and not a restore or DR-type failover. We’ll need to remove any SPOF’s all the way down the stack, including storage. For commodity servers, we would probably look at something like VMware MSC.
Application Level Data Protection
Another industry trend that makes a lot of sense is the idea of moving data services up the stack, to the application layer. You can get a lot more intelligence built in by handling it there, at the cost of decentralizing management and reporting. For example, your Exchange admin may be running DAG, so she gets more intelligent failover, performance isn’t impacted by backups, and using local storage instead of a storage array removed the SPOF of a shared volume. In this example, the backup team probably doesn’t have visibility into the status of replication or the separate images, leaving a blind spot when it comes to managing and reporting on the overall health of the enterprise as it pertains to data protection. To address that, all of the major backup products are moving to integrate application level data protection with the overall data protection management platform. That would give you the benefits without losing visibility. This approach still requires the application owner to be responsible for managing the data protection. Some organizations may have historically kept that responsibility with the backup team, but as we get into these narrower SLO’s it becomes more expensive to build this level of protection into the infrastructure.
Okay, so what would this look like in real-world terms, instead of nebulous concepts?
I’ll start with commodity servers, like a Windows or Linux VM that runs an application service or daemon that doesn’t have native data protection built in. I’ll use EMC products in the example because I’ve spent most of my time working with EMC products over the last 5 years. There are other storage and replication products that can accomplish similar things.
The recovery automation for Tier 1 could be VMware Site Recovery Manager, a 3rd party product or it could be scripted. For virtual machines, it could also be hypervisor based replication plus SRM or replaced with something like Zerto that does the replication, CDP, and failover automation.
Note that these tier numbers are subjective and you might choose to make this Tier 1 through 4 and reserve Tier 0 for infrastructure only.
All of these will require compute to be already in place for recovery at the time of the outage. If your recovery plan can tolerate 5 days RTO for example, then you might be able to get by with procuring server hardware after the incident occurs. If you’ve priced out these products, you’ll know that the price goes up quickly from Tier 3 to Tier 0. These kinds of decisions, with associated costs, are why business requirements are important.
What if it’s Microsoft Exchange, or SQL, or Oracle?
Because we don’t need infrastructure-level replication and HA, the cost is probably less expensive, and the tradeoff is that operationally the Exchange team has to manage all of their copies and local storage instead of leaving that to the infrastructure team.
The approach can be used for Microsoft SQL with SQL Always-On. That might be a synchronous copy used for the top tier workloads and an asynchronous copy for longer SLO’s and site resiliency. Oracle can do this with Data Guard. In each case, the application will probably handle recovery better, and the cost will probably be lower (but not necessarily) by handling data protection natively. Also in each case, that pushes the responsibility away from the infrastructure team and over to the application team, so the operational aspects have to be planned out and well-understood. There are also reporting tools, like EMC Data Protection Advisor and Rocket Servergraph, that allow management teams to have visibility into the overall data protection health, while handling different aspects with different products that may not integrate directly.
While there are a lot of flashier new technologies out there that we’re all excited about, I’m having these kinds of data protection conversations often enough that I wanted to write this up.
Holy cow, it’s been a long time since I updated this blog. Hopefully I can avoid another dry spell like that, and have more frequent articles. Anyway, World Backup Day is today!
Since most of the readers here are familiar with backups in general, I thought I’d dive into a different aspect of backups: Data Retention, and where it should occur. As you probably know, data retention refers to how long we keep data. For backups, we usually set up retention policies and apply them to different backup sets. When a customer wants to talk about backups, I’ll ask what their retention policies are, and often I’ll get an answer something like “we keep a few things “indefinitely” and the rest for 60 days.” For myself, I cringe when I hear “forever” or “indefinitely”, for several reasons.
First, it’s impractical. The LTO-2 tape on which you backed up that data 6 years ago *cannot be read* by your modern LTO-6 tape library. LTO as a rule will only go back 2 generations, so today’s LTO-6 can read back as far as LTO-4. Are you going to restore that LTO-4 data and re-back it up to LTO-6 tapes today, so it can be read by LTO-8 drives in a few years? Or do you want to keep several unopened LTO-4 tape drives on the shelf, with no support available in the off-chance you need to restore one of those tapes? Or, do you keep that data on disk “forever”? Depending on the size of the data, that could get expensive.
Second, it’s risky. I’ve had legal departments tell me that they don’t want to keep ANY data longer than strictly required by corporate level policy and/or compliance reasons. That data is discoverable if there’s a relevant lawsuit, and if something in there that opens the company to liability but is not required to operate the business or meet compliance they want it gone. From what I understand, if *any* data is available for a certain timeframe, everything is discoverable for that timeframe so any solution that spans back dozens of years better be tight.
Thirdly, and here’s the big one that really started to make a lot of sense to me recently: If you’re required to keep that data for X number of years for compliance, there’s a very good chance the application is doing that internally. This could be by archiving records to a separate table in the application, or keeping them in the live table, or being dumped out to a flat file on the app owner’s file share, or something else. As a backup admin, I may not know that. But, what I *do* know is that I don’t want to keep those backups for 7 years, for example, if the application is already archiving records for 7 years inside the application. This is important, because if I have 7 years of data in the application, and I keep the latest backup for 7 years, then I’ll potentially have 14 years worth of data retained: A 7 year old backup that contains data that’s 7 years older than that. That’s opening us up to the risk in bullet 2, AND it’s wasteful for the backup resources: tapes, library time, backup admin time. For this reason, I highly recommend working with the owners of any data that has a long term retention requirement like this, to determine how much of it they’re handling inside the app.
You *could* determine that you no longer need to keep those backups for years and years. You might only need 30 or 60 days retained in backups. And suddenly, your backup data might all fit into your existing backup-to-disk target (Data Domain for example) and not need to go to tape at all. There are shops with highly regulated data sets that do this today. The recovery time and recovery point get much better when all restores can be done from disk. Managing a tape solution also takes a lot of time and money that can be freed up.
So in closing, it’s worthwhile to really know where the retention is happening, and where it should be happening.
Quick shout out to my friend and colleague Jason Jones. I’ve learned a lot about backups over the last couple years as we worked together on solutions for multiple customers.
Kyle Quinby and I have submitted our “Virtualizing SQL” session to VMworld 2013. I’m really excited about the possibility of presenting there. If you had a chance to attend the session at Varrow Madness, if you’d like to see it at VMworld, or if you’d just like to help us get selected, please go over to the VMworld 2013 site and vote. It’s session 4809. Voting should only take a minute, it’s basically just a “thumbs up” button next to the session in the list, once you’ve registered a VMworld account.
Several Varrow teammates also have sessions, some are VMworld veterans and for some (like me) this would be the first time presenting there. Here’s a full list of Varrow sessions (copied from Jason Nash’s blog)
Unlisted vSphere Distributed Switch Deep Dive
Jason Nash, Varrow
Jason says: “This will be a revised version of my highly rated session from last year. It is not in the voting list right now due to possible NDA content. But wanted to send out a note that I hope to have a revised version for this year.”
4570 Ask the Expert VCDXs
Jason Nash, Varrow – Matt Cowger, EMC – Chris Colotti, VMware – Chris Wahl, Ahead – Rick Scherer, EMC
One of the highest rated sessions at VMworld is back for its sixth year.
Come get interactive with a panel of five VMware Certified Design Experts, they’ll answer your questions live.
Experts in Cloud, Application Modernization and End User Computing are here to help.
4969 Understanding vSphere Physical Connectivity – Deep Dive & Recommendations
Jason Nash, Varrow
This session will provide an in-depth look at your options for physically connecting vSphere hosts to the network. The discussion will focus on common question areas that come up during knowledge workshops and customer design sessions. Throughout the session videos and animations will be used to help attendees easily see the expected result from many of these configuration options. The presentation will focus heavily on the different hashing types and traffic control, especially the more advanced options such as Load-Based Teaming and Network I/O Control. Other areas of focus include physical separation of traffic, networks of differing security requirements such asDMZs, and suggested NIC configurations for both 1Gb and 10Gb environments. Finally, recommendations for physical switch configurations will also be covered. Throughout the session best practices, recommendations, and lessons learned from many production deployments will be shared. Attendees should expect to walk away with a deep understanding of the physical connectivity options available with vSphere, how they can be utilized in their environment, and the best methods for deploying them.
5038 Achieving High Availability with vSphere Storage Metro Clusters and EMC VPLEX
Joe Kelly and Martin Valencia, Varrow
High Availability in todays datacenters stretch beyond a single or multi host failure. Increased visibility and mobility across datacenters is quickly becoming a must have for all workloads. With the introduction of Active/Active Storage solutions to the market, VMware has begun to support these SVD’s or Storage Virtualization Devices for qualified products. vSphere Metro Storage Cluster (vMSC) is a configuration option that allows for stretched clusters between two synchronous joined datacenters. Providing these cluster capabilities, that are distributed in nature, allow for fluid VM mobility using vMotion between datacenters. This session will focus on the benefits and challenges of such a solution based on front line field experience using VPLEX 5.X and vSphere 5.X. Multi-site vSphere cluster designs just got better.
NOTE: Chad Sakac of EMC lists this session in his favorite picks. Chad is a great friend of Varrow and says “Varrow has some of the best cross-domain experts I’ve seen at a VAR. They’ve had a lot of experiences with a lot of technology – so I would expect that this would be a very balanced and deep session.”.
4809 SQL Virtualization – How to bring Tier 1 SQL workloads into an optimized vSphere environment
Tony Pittman and Kyle Quinby, Varrow
Drawing from real-world experiences, Kyle Quinby and Tony Pittman will be going through some accumulated knowledge on tips and tricks to squeak out every last bit of power of your SQL VMs. They will be focusing on how to lay a solid foundation for SQL for both performance and high availability.
Tier 1 Database workloads truly have a home in modern virtual infrastructure. Consolidation ratios are finally what they should be, and bookkeeping aspects such as licensing and backup/DR have finally become relatively simple to achieve.
This session will be appreciated by,
1. SQL admins or developers that are looking for insight into how their virtual admins can help them
2. Virtual infrastructure administrators, that are unsure what steps they should be taking for their SQL workloads
3. Long-term planners (architects/CIO) looking to get some “best practice” information to create a virtual SQL strategy
5053 Cloud Continuity: How does the Cloud fit into your Business Continuity Plan?
Tom Cornwell, Varrow
Today cloud is becoming ubiquitous. However, it can be utilized in many different ways. Whether you utilize cloud for your production environments or not, cloud can play a part in your business continuity plan. This session describes different business continuity strategies utilizing cloud. Topics include: cloud as a source, cloud as a target, and cloud to cloud DR. In addition, we will discuss mixed strategies and different types of cloud implementations including Platform-as-a-Service, Infrastructure-as-a-Service, Storage-as-a-Service, and DR-as-a-Service.
5158 How High Point Regional deployed a secure, feature-rich resilient desktop to their users
Dave Lawrence, Varrow. Adam Flowers, High Point Regional Health System
Come hear how High Point Regional Health System implemented Horizon View using the latest technologies such as HTML Access, Persona Management, ImprivataOneSign, Trend Micro Deep Security and Zero and repurposed clients all on a vSphereStreched Cluster with EMC VPLEX. Come hear the lessons learned during this dynamic deployment and how to integrate all of these great solutions together.
5296 Hypervisor-Based Disaster Recovery: The Missing Link forVirtualizing Mission-Critical Applications
Jason Nash, Varrow. Shannon Snowden, Zerto.
Mission-critical, tier-1 applications such as database and transactional applications are often the last to be virtualized. Despite the many benefits of virtualizing these applications, some companies still question the ability to protect and recover these applications in virtualized environments. Most likely, each application is on a different type of storage, so having one solution across the environment is not possible – driving up complexity and costs. Traditional BC/DR technologies are built for physical environments, requiring manual and complex processes to utilize these systems forvirtualized applications, and each type of hardware has its own type of replication – so there is no consistency.
New disaster recovery technologies are filling this gap for large and small enterprises alike, delivering all the flexibility customers expect from a virtualized environment, with the aggressive RPOs and RTOs that mission-critical applications require. In this session, we will show how hypervisor-based replication solves these critical DR issues, clearing all barriers to virtualize tier-1 applications.
5470 The seven things you need to look at in your environment BEFORE installing Horizon View!
Dave Lawrence, Varrow
Ready to install Horizon View? Have you checked your DHCP scope? Is your version of vSphere supported? Are you using Group Policy loopback processing mode? Should you? There are seven main components of your environment that you should review BEFORE installing Horizon View. Come to our session and find out what the seven components are and how to avoid the most common environmental pitfalls that our engineers have come across during hundreds of View deployments.
5553 Cloud Bursting: Strategies to overflow into the cloud
Tom Cornwell, Varrow
Cloud bursting is a concept that to many sounds like a pipe dream. However, by looking at IT from a service perspective rather than from an infrastructure perspective, the tools exist to adopt this radical idea. This session will discuss how tools from VMware can help you make the cloud bursting dream a reality.
I really hope you were at Varrow Madness with us in Durham, NC last week, because it was pretty incredible. We had lots of attendees, something like 40+ breakout sessions, plus hands-on labs, a convention floor and 2 really cool general sessions that included dancers with iLuminate suits.
Since then, a good many requests have come in for a copy of the sessions I presented with Kyle Quinby and Martin Valencia. We decided to go ahead and publish the slide decks on our blogs to make it easier to get that content.
In the morning, Martin Valencia and I presented “Mixed Workloads on EMC VNX Storage Arrays”. We did not go super-deep on performance optimization and tuning, David Gadwah with EMC presented on that in the afternoon and can go a lot deeper into the nuts and bolts than I could. Instead we focused on how to simplify the storage layout and get the best performance out of the array for workloads that don’t necessarily justify dedicated drives. Here’s the slide deck:
In the afternoon, Kyle Quinby and I presented “Virtualizing Microsoft SQL”. We included specifics around virtualizing MSSQL 2012 and vSphere 5.1, but the content also applies to earlier versions of SQL and vSphere, and the concepts can apply to other hypervisors. The last slide in the deck has links to a lot of the supporting materials we have used as references. Kyle also has a video demo of SQL 2012 Always On, protecting a database and failing-over. It’s pretty cool stuff, much improved over MSCS, Failover Clustering, and SQL Mirroring.
If you came out and joined us, thank you for your time and participation, and I hope it was valuable and that you had as much fun as I did. If you couldn’t make it, I highly recommend penciling it onto your calendar for March of next year.