Site Reliability Engineer

Qualia
Summary
Join Qualia as a Site Reliability Engineer and safeguard our core Resware systems. Ensure service robustness and operational process improvement by troubleshooting and resolving critical issues, particularly those related to SQL Server, Windows Server/IIS, and our core Resware application. Maintain and enhance internal R&D tools, oversee installer stability, and lead incident management and root cause analysis. Proactively monitor and improve systems, meticulously document processes, and identify automation opportunities. Collaborate with development teams to ensure stability and maintainability, and effectively communicate with stakeholders. This role is pivotal in maintaining user trust and providing a stable foundation during Qualia's modernization initiative. The position offers a competitive salary and benefits package.
Requirements
- Demonstrated experience (typically 5+ years) in a Technical Support Engineer (Tier 3+), Systems Engineer, Site Reliability Engineer (SRE), or similar role focused on maintaining and troubleshooting complex software systems
- Solid understanding of Windows Server environments and IIS configuration/management
- Proficiency with .NET framework applications (ability to understand code, debug, and diagnose issues, even if not actively developing new features)
- Familiarity with scripting and IaC languages (e.g. PowerShell, Terraform) for diagnostics and minor automation tasks
- Strong expertise in Microsoft SQL Server administration, troubleshooting (query optimization, indexing, upgrade scripts)
- Familiarity with Azure and cloud services
- You are driven to understand the "why" behind problems, not just apply surface-level fixes. You possess strong analytical and diagnostic skills, with an ability to think strategically about long-term solutions
- You believe in the power of good documentation and have experience creating and maintaining technical documentation
- You can articulate complex technical concepts clearly and concisely to diverse audiences
- You are dedicated to providing excellent service and support to internal teams and, by extension, Qualiaβs customers
Responsibilities
- Troubleshoot & Resolve Critical Issues: Investigate, diagnose, and resolve complex technical issues in production and internal environments, particularly those related to SQL Server (upgrade scripts, index performance), Windows Server/IIS configurations, and our core Resware application (including "mothership" centralized processes and services)
- Internal Tooling Stewardship: Own the maintenance, support, and enhancement of existing internal R&D tools. Research their purpose, understand their function within the larger development lifecycle, and ensure they remain effective and reliable
- Installer Management: Oversee the stability and functionality of the Resware installer, addressing any bugs or issues that arise during deployment or upgrades
- Incident Management & Root Cause Analysis (RCA): Lead or significantly contribute to the RCA process for incidents, ensuring that learnings are captured and preventative measures are identified and implemented
- Proactive Monitoring & Alerting: Collaborate with the team to refine monitoring solutions, ensuring early detection of potential issues before they impact users or internal operations
- Documentation & Knowledge Transfer: Meticulously document systems, processes, and troubleshooting guides to build a comprehensive knowledge base, reducing knowledge silos and enabling faster resolution of future issues
- Identify Automation Opportunities: While not primarily an automation role, keenly identify repetitive tasks, system inefficiencies, or areas prone to error that could be candidates for automation by the broader R&D team (e.g., DevOps engineers)
- Contribute to Stability by Design: Provide feedback from an operational perspective to development teams, helping to ensure new features or changes are designed for stability and maintainability
- Process Optimization: Analyze existing operational procedures, identify bottlenecks or areas for improvement, and propose practical solutions to enhance efficiency and reliability
- Understand Business Impact: Prioritize issues based on their potential impact on customers (internal and external) and business operations
- Collaborative Problem Solving: Work closely with other R&D team members, support teams, and IT to resolve issues and improve system health
- Effective Communication: Clearly communicate technical issues, proposed solutions, and project status to both technical and non-technical stakeholders
Preferred Qualifications
- Prior experience with Resware or within the title and escrow industry would be a significant advantage, allowing you to understand the context of our tools and customer needs more quickly
- You don't wait for issues to find you; you are proactive in identifying potential problems and taking ownership to drive them to resolution
- You thrive in a team environment and understand the importance of collaboration to achieve common goals
Benefits
- Comprehensive health plans
- A 401k program
- Commuter benefits
- Parental leave
- A flexible time off policy