Oracle Database 23ai: Backup and Recovery - Part 1

In this two-part special, Lois Houston and Nikita Abraham delve into the critical topic of backup and recovery in Oracle Database 23ai. Together with Bill Millar, Senior Principal Database & MySQL Instructor, they discuss the role of database administrators, strategies for protecting data, and dealing with various types of data failure. Oracle MyLearn: https://mylearn.oracle.com/ou/course/oracle-database-23ai-backup-and-recovery/141127/ Oracle University Learning Community: https://education.oracle.com/ou-community LinkedIn: https://www.linkedin.com/showcase/oracle-university/ Twitter: https://twitter.com/Oracle_Edu Special thanks to Arijit Ghosh, David Wright, Radhika Banka, and the OU Studio Team for helping us create this episode. --------------------------------------------------------- Episode Transcript: 00:00 Welcome to the Oracle University Podcast, the first stop on your cloud journey. During this series of informative podcasts, we’ll bring you foundational training on the most popular Oracle technologies. Let’s get started! 00:26 Nikita: Welcome to the Oracle University Podcast! I’m Nikita Abraham, Team Lead of Editorial Services with Oracle University, and with me is Lois Houston, Director of Innovation Programs. Lois: Hi there! For the last two weeks, we’ve been having really exciting discussions on Oracle AI Vector Search. We covered the fundamentals, benefits, the vector workflow, and lots more. Today, we’re going to talk about backup and recovery in Oracle Database 23ai with Bill Millar. If you’ve been listening this season, you’ll know that Bill is a Senior Principal Database & MySQL Instructor with Oracle University. Nikita: In this two-part special, we’ll dive into some of the things you need to know about backup and recovery, especially if you’re a database and backup admin. So, if you're the person in charge of keeping data safe and handling disaster recovery, this is definitely worth your time. 01:20 Lois: That’s right, Niki. Hi Bill, thanks for joining us again. What’s the role of a Database Administrator, or DBA, when it comes to backup and recovery? Bill: The DBA is typically responsible for ensuring the database is open and available when needed and at times you need to work with system administrators and other people within your organization to achieve that. But we want to try to protect the database from failure wherever possible. We want to increase the mean time between failures. Hopefully, we don't have failures, and we have to increase that time. But it might mean that we need to ensure we have redundant hardware and that in place, again, maybe out of the realm of the DBA, but people within your organization can help with that. We want to protect those critical components by using the redundancy. And we want to decrease the mean time to recover. Failures happen, but how fast can we get access back to that data after that failure. The faster we can do it, the happier customers are. Minimize the loss of data. It's never good to lose data, especially in a critical environment, but maybe in test and development, maybe not so bad. 02:39 Nikita: How do we ensure a separation of duties for backup and recovery processes? Bill: For a separation of duties, we do have a user called SYSBACKUP. It has the privileges that's required to perform backup and recoveries, the privilege to connect and execute the commands in what we refer to as RMAN, our Recovery Manager. As I said, it has permissions for backup and recovery because you do need to shut down the database, start up the database, those type of things. We're able to connect to that closed database to try to troubleshoot it, to get it to the open state again. It does not include any privileges to access data. The SYSBACKUP user is created when we install the database, when we create the database. We can use it explicitly for privileged user connection. It allows us to connect to the database. So RMAN connects as SYSBACKUP. 03:37 Lois: Bill, what should people keep in mind when figuring out what’s considered critical data? Bill: You want to try to identify your critical data. Some data might be highly required to access and make sure we don't lose don't lose data, but then you might have some environments. OK, I don't need to have them up and running as fast. If we lose a little data, it may not hurt, but we want to identify the difference in the different data that we have on different environments. So we want to also prioritize that critical data, which data do we need access to first because how much will the company lose per hour of downtime because we can't do business. We want to make sure the access data protection requirements. Not everybody has access to everything. And there are different types of disaster that can happen that are going to be totally out of your control. There's the physical disaster, a hurricane or tornado, outages, power outages, component failures, failures within the building itself, corruption of data because of some of these failures. And then, the most dreaded one, the one that happens most often, usually those human errors, the logical errors, where the data is just bad, we are able to access and everything. It's just that something has changed that shouldn't have been changed. We want to make sure we access our recovery requirements. 05:04 Lois: So, what are they? What are those requirements? Bill: We want to base that requirement based on how critical is that data, how soon do we need to have access to that? What is our recovery point objective? Do we have a tolerance for any type of data loss? How frequently should we backups? How often they should be taken? What type of backups will be another thing we'll want to figure out? Is point-in-time recovery required? Are we able to or do we ever need to go back to a previous point in time to do something? It's not always just recovery for a database failure. We might need to do a recovery point in time to a different system so we can investigate something. What is my recovery time objective? Again, what is the tolerance for the downtime? How long can I be down? The downtime, the biggest part of when a system goes down is trying to identify what is the problem, then next is what is going to be my plan to recover, and then perform in the recovery. We might have a tiered required time objective based off of critical data, and then depending on the failure. Is that failure at the entire database? Is it just a tablespace? Is it just a table? Is it just a row? That also determines how long it takes to recover and what type of recovery we might try to perform. What is my backup retention policy? Do I have a requirement to where I have to have my backups off site? And it doesn't mean like back in the old days of mainframe computers, you'd back up to tape and you'd take those tapes off site. You might still do that today. Or, am I backing up to a cloud environment? So what do I need to have for that? What about long-term backups? We work with our day-to-day backups, but there's those backups that require for longer, archives like end of year backups. Some places require to keep their end of year backup for like 10 years. How are we going to handle that? So these are some of the things that we have to think about when we start talking about backup and recovery. 07:23 Did you know that the Oracle University Learning Community regularly holds live events hosted by Oracle expert instructors. Find out how to prepare for your certification exams. Learn about the latest technology advances and features. Ask questions in real time and learn from an Oracle subject matter expert. From Ask Me Anything about certification to Ask the Instructor coaching sessions, you’ll be able to achieve your learning goals for 2024 in no time. Join a live event today and witness firsthand the transformative power of the Oracle University Learning Community. Visit mylearn.oracle.com to get started. 08:04 Nikita: Welcome back! Bill, I want to talk about the different failures that can occur in an Oracle database. How would you categorize them? Bill: There are different category of failure. This is not an all-inclusive list by any means. It's just something that possibly can happen. So they can usually be divided into different categories like statement failure. All right. When doing a select and insert, update, delete, the statement itself fails. A user process fails. Single database session fails for some reason. Network failure, connectivity is lost. The user error, probably one of the most common ones we have to deal with. A user successfully completes an operation, but that operation was erroneous. They dropped the wrong table, updated the wrong row. Then there's the instance failure. The database itself shuts down unexpectedly. And then media failure, usually a hard failure of our disk. Something of memory, something failed and caused an error. 09:12 Lois: Ok. I want to dive a little deeper into each of these categories that you mentioned. Let’s start with statement failures. What are typical problems that one might face? Bill: Attempts to enter invalid data into a table. They're trying to put a numeric field in a date field, and usually just working with the user is going to correct that. Is that the DBA responsible? Yes, no, maybe. They attempt to form operations with insufficient privileges. Attempts to allocate space that fails, well, that depends on are they going-- do they have unlimited storage or do they have a limit? Logic errors in the application. Well, that's where we're going to have to work with those developers to try to correct those type of errors. 09:59 Nikita: What about user process failures? Bill: User performs an abnormal disconnect, doesn't close out properly. It can cause something to hang up or even possibly erroneous data to be updated. A user session is abnormally terminated. Well, usually, we don't have to try to resolve those user type errors, but something we might need to look into. A user experiences a program error that terminates the session. Again, usually it's the application developers, but it's something as a DBA, we might want to keep an eye on. Is it the same person? Is it from the same location? Is it the same module within that application? Maybe there's some things we can help to identify what the possible problem can be. 10:43 Nikita: Bill, tell us about common issues that can lead to network failures. What can we do to mitigate these problems and ensure network resilience? Bill: The listener fails. Well, we can connect a backup listener and configure how it can connect time failover can work. A network interface card fails. Well, again, we're not the hardware people, but can we work with our network, our server team, whatever, to possibly have redundant network cards? The network connection fails itself. Can we configure a backup network connection? 11:18 Lois: And what about user errors? How can we recover from those types of scenarios? Bill: The user inadvertently deletes or modifies data. Well, we have some things we'll look at as far as like rollback a transaction along with the dependent transactions. Rewind that table back to where it should have been. You're also can use LogMiner. You can look at our redo logs to try to figure out where that bad transaction was. User drops a table inadvertently. Well, we can recover the table from the recycle bin if we have the recycle bin on or we may need to recover from a backup. 11:56 Nikita: What are common causes of instance failures, Bill? Bill: The dreaded power outage. Well, hopefully, we have some type of up system to keep us running, even if it's not for continuous operation. Maybe if it's just to allow us to gracefully take a system down. The dreaded hardware failure. If you have a way to predict a hardware failure, you can make a lot of money. Always happens at the most inopportune times. But then again, do we have redundant hardware? Do we have something in place to help allow us to continue to operate in case of a hardware failure? Failure of one of the critical background processes. Why did it fail? We can go out. We can look at our alert log, we have trace files. And then we have, you have the Enterprise Manager Cloud Control. We can do the same thing as looking at the alert log and trace files. But the Enterprise Manager Cloud Control gives us a GUI interface to allow us to do that. 12:53 Lois: Before we let you go, Bill, can you tell us about media and data failures? Bill: Failure of a disk drive, failure of a disk controller, deletion or corruption of a file needed for database operation, well, this is the dreaded media failure. So we're going to restore from a backup. If we need to move, we can move a data file to a different location. We can notify, hey, here's that new location. And then recover by applying any of the incremental backups, any of the redo to get it back to where it should be. And then we have the data failures. We can't access the component, missing data files at OS level. And maybe our system administrators deleted something thinking it wasn't needed, or maybe even a developer on a development type system. Don't have the right permissions. Tablespace is offline. Well, why is it offline? Did somebody took the wrong tablespace offline? We have physical corruptions, block checksum failures. It's inconsistent between the header and footer. Invalid block header field values, like all of them are zeroed out. Then we have the logical corruptions, inconsistent dictionary, corrupt row piece, the inconsistencies, a control file not synchronized with the data files, usually because we recovered something and didn't do it the right way. I/O failures, maybe we just exceeded the number of open files that we're allowed to have. Maybe it's just a network or an I/O error itself. And these are different types of failures that you might experience. Again, it's not an all-inclusive list. It's just a few examples. 14:41 Nikita: I know you said it’s not an all-inclusive list and you were just giving us a few examples, but that seemed quite thorough! Thank you so much, Bill, for walking us through all of that today! Lois: Yeah, I totally agree! Thanks Bill! For more on what we discussed today, visit mylearn.oracle.com. Search for the Oracle Database 23ai: Backup and Recovery course. Next week, we’ll get into instance recovery and recovery strategies. Until then, this is Lois Houston… Nikita: And Nikita Abraham, signing off! 15:15 That’s all for this episode of the Oracle University Podcast. If you enjoyed listening, please click Subscribe to get all the latest episodes. We’d also love it if you would take a moment to rate and review us on your podcast app. See you again on the next episode of the Oracle University Podcast.

Om Podcasten