Skip to content

data.generation

The data.generation module provides synthetic data creation tools for the AFCCP system, designed to support testing, experimentation, and scenario modeling for cadet-to-AFSC assignment problems. It includes both simple and advanced generators that simulate cadet preferences, AFSC eligibility, base assignments, training courses, and other problem parameters.

This module allows users to rapidly produce valid input datasets for CadetCareerProblem that mimic either minimal input assumptions or realistic programmatic constraints.

Submodules

  • data.generation.basic Provides minimal synthetic data generation pipelines with:
    • Uniform or random utility scores
    • Simplified cadet and AFSC attributes
    • Basic constraint setups for quick prototyping
  • data.generation.realistic Generates highly realistic cadet datasets based on:
    • Real CIP distributions and merit-based AFSC interest
    • Custom samplers for AFSC/cadet utilities
    • Condition-based sampling to satisfy rare AFSC degree quotas

Functionality

The data.generation module supports: - Randomized dataset creation: from scratch or based on conditional rules - AFSC-specific quota balancing: especially for underrepresented AFSCs like 62EXE - Cadet utility modeling: using KDE-based samplers from empirical data - Preference shaping: structured base, training, and AFSC preference profiles - Training course simulation: schedules, capacities, and cadet matching windows

Typical Use Cases

  • Unit testing optimization models with diverse synthetic populations
  • Stress-testing sensitivity analysis pipelines on edge-case cadet distributions
  • Exploring rare AFSC scenarios using CIP-based data generation
  • Prototyping utility functions, preference matrices, and training alignment logic

See Also

  • data.processing: Tools to clean and restructure raw real-world data before generation

  • data.preferences: Utility score generation and AFSC–cadet preference logic

  • CadetCareerProblem: Primary object that consumes synthetic datasets generated here via .load_data() and .solve_*() methods

generate_random_instance(N=1600, M=32, P=6, S=6, generate_only_nrl=False, generate_extra=False)

This procedure takes in the specified parameters (defined below) and then simulates new random "fixed" cadet/AFSC input parameters. These parameters are then returned and can be used to solve the VFT model.

Parameters:

Name Type Description Default
N

number of cadets

1600
M

number of AFSCs

32
P

number of preferences allowed

6
S

number of Bases

6
generate_only_nrl

Only generate NRL AFSCs (default to False)

False
generate_extra

Whether to generate extra components (bases/IST). Defaults to False.

False

Returns:

Type Description

model fixed parameters

Source code in afccp/data/generation/basic.py
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
def generate_random_instance(N=1600, M=32, P=6, S=6, generate_only_nrl=False, generate_extra=False):
    """
    This procedure takes in the specified parameters (defined below) and then simulates new random "fixed" cadet/AFSC
    input parameters. These parameters are then returned and can be used to solve the VFT model.
    :param N: number of cadets
    :param M: number of AFSCs
    :param P: number of preferences allowed
    :param S: number of Bases
    :param generate_only_nrl: Only generate NRL AFSCs (default to False)
    :param generate_extra: Whether to generate extra components (bases/IST). Defaults to False.
    :return: model fixed parameters
    """

    # Initialize parameter dictionary
    # noinspection PyDictCreation
    p = {'N': N, 'P': P, 'M': M, 'num_util': P, 'cadets': np.arange(N), 'I': np.arange(N), 'J': np.arange(M),
         'usafa': np.random.choice([0, 1], size=N, p=[2 / 3, 1 / 3]), 'merit': np.random.rand(N)}

    # Generate various features of the cadets
    p['merit_all'] = p['merit']
    p['assigned'] = np.array(['' for _ in range(N)])
    p['soc'] = np.array(['USAFA' for _ in range(p['N'])])
    p['soc'][np.where(p['usafa'] == 0)[0]] = 'ROTC'

    # Calculate quotas for each AFSC
    p['pgl'], p['usafa_quota'], p['rotc_quota'] = np.zeros(M), np.zeros(M), np.zeros(M)
    p['quota_min'], p['quota_max'] = np.zeros(M), np.zeros(M)
    p['quota_e'], p['quota_d'] = np.zeros(M), np.zeros(M)
    for j in range(M):

        # Get PGL target
        p['pgl'][j] = max(10, np.random.normal(1000 / M, 100))

    # Scale PGL and force integer values and minimum of 1
    p['pgl'] = np.around((p['pgl'] / np.sum(p['pgl'])) * N * 0.8)
    indices = np.where(p['pgl'] == 0)[0]
    p['pgl'][indices] = 1

    # Sort PGL by size
    p['pgl'] = np.sort(p['pgl'])[::-1]

    # USAFA/ROTC Quotas
    p['usafa_quota'] = np.around(np.random.rand(M) * 0.3 + 0.1 * p['pgl'])
    p['rotc_quota'] = p['pgl'] - p['usafa_quota']

    # Min/Max
    p['quota_min'], p['quota_max'] = p['pgl'], np.around(p['pgl'] * (1 + np.random.rand(M) * 0.9))

    # Target is a random integer between the minimum and maximum targets
    target = np.around(p['quota_min'] + np.random.rand(M) * (p['quota_max'] - p['quota_min']))
    p['quota_e'], p['quota_d'] = target, target

    # Generate AFSCs
    p['afscs'] = np.array(['R' + str(j + 1) for j in range(M)])

    # Determine what "accessions group" each AFSC is in
    if generate_only_nrl:
        p['acc_grp'] = np.array(["NRL" for _ in range(M)])
    else:

        # If there are 3 or more AFSCs, we want all three accessions groups represented
        if M >= 3:
            invalid = True
            while invalid:

                # If we have 6 or fewer, limit USSF to just one AFSC
                if M <= 6:
                    p['acc_grp'] = ['USSF']
                    for _ in range(M - 1):
                        p['acc_grp'].append(np.random.choice(['NRL', 'Rated']))
                else:
                    p['acc_grp'] = [np.random.choice(['NRL', 'Rated', 'USSF']) for _ in range(M)]

                # Make sure we have at least one AFSC from each accession's group
                invalid = False  # "Innocent until proven guilty"
                for grp in ['NRL', 'Rated', 'USSF']:
                    if grp not in p['acc_grp']:
                        invalid = True
                        break

                # If we have 4 or more AFSCs, make sure we have at least two Rated
                if M >= 4:
                    if p['acc_grp'].count('Rated') < 2:
                        invalid = True
            p['acc_grp'] = np.array(p['acc_grp'])  # Convert to numpy array

        # If we only have one or two AFSCs, they'll all be NRL
        else:
            p['acc_grp'] = np.array(["NRL" for _ in range(M)])

    # Add an "*" to the list of AFSCs to be considered the "Unmatched AFSC"
    p["afscs"] = np.hstack((p["afscs"], "*"))

    # Add degree tier qualifications to the set of parameters
    def generate_degree_tier_qualifications():
        """
        I made this nested function, so I could have a designated section to generate degree qualifications and such
        """

        # Determine degree tiers and qualification information
        p['qual'] = np.array([['P1' for _ in range(M)] for _ in range(N)])
        p['Deg Tiers'] = np.array([[' ' * 10 for _ in range(4)] for _ in range(M)])
        for j in range(M):

            if p['acc_grp'][j] == 'Rated':  # All Degrees eligible for Rated
                p['qual'][:, j] = np.array(['P1' for _ in range(N)])
                p['Deg Tiers'][j, :] = ['P = 1', 'I = 0', '', '']

                # Pick 20% of the cadets at random to be ineligible for this Rated AFSC
                indices = random.sample(list(np.arange(N)), k=int(0.2 * N))
                p['qual'][indices, j] = 'I2'
            else:
                # Determine what tiers to use on this AFSC
                if N < 100:
                    random_number = np.random.rand()
                    if random_number < 0.2:
                        tiers = ['M1', 'I2']
                        p['Deg Tiers'][j, :] = ['M = 1', 'I = 0', '', '']
                    elif 0.2 < random_number < 0.4:
                        tiers = ['D1', 'P2']
                        target_num = round(np.random.rand(), 2)
                        p['Deg Tiers'][j, :] = ['D > ' + str(target_num), 'P < ' + str(1 - target_num), '', '']
                    elif 0.4 < random_number < 0.6:
                        tiers = ['P1']
                        p['Deg Tiers'][j, :] = ['P = 1', '', '', '']
                    else:
                        tiers = ['M1', 'P2']
                        target_num = round(np.random.rand(), 2)
                        p['Deg Tiers'][j, :] = ['M > ' + str(target_num), 'P < ' + str(1 - target_num), '', '']
                else:
                    random_number = np.random.rand()
                    if random_number < 0.1:
                        tiers = ['M1', 'I2']
                        p['Deg Tiers'][j, :] = ['M = 1', 'I = 0', '', '']
                    elif 0.1 < random_number < 0.2:
                        tiers = ['D1', 'P2']
                        target_num = round(np.random.rand(), 2)
                        p['Deg Tiers'][j, :] = ['D > ' + str(target_num), 'P < ' + str(1 - target_num), '', '']
                    elif 0.2 < random_number < 0.3:
                        tiers = ['P1']
                        p['Deg Tiers'][j, :] = ['P = 1', '', '', '']
                    elif 0.3 < random_number < 0.4:
                        tiers = ['M1', 'P2']
                        target_num = round(np.random.rand(), 2)
                        p['Deg Tiers'][j, :] = ['M > ' + str(target_num), 'P < ' + str(1 - target_num), '', '']
                    elif 0.4 < random_number < 0.5:
                        tiers = ['M1', 'D2', 'P3']
                        target_num_1 = round(np.random.rand() * 0.7, 2)
                        target_num_2 = round(np.random.rand() * (1 - target_num_1) * 0.8, 2)
                        target_num_3 = round(1 - target_num_1 - target_num_2, 2)
                        p['Deg Tiers'][j, :] = ['M > ' + str(target_num_1), 'D > ' + str(target_num_2),
                                                'P < ' + str(target_num_3), '']
                    elif 0.5 < random_number < 0.6:
                        tiers = ['D1', 'D2', 'P3']
                        target_num_1 = round(np.random.rand() * 0.7, 2)
                        target_num_2 = round(np.random.rand() * (1 - target_num_1) * 0.8, 2)
                        target_num_3 = round(1 - target_num_1 - target_num_2, 2)
                        p['Deg Tiers'][j, :] = ['D > ' + str(target_num_1), 'D > ' + str(target_num_2),
                                                'P < ' + str(target_num_3), '']
                    elif 0.6 < random_number < 0.7:
                        tiers = ['M1', 'D2', 'I3']
                        target_num = round(np.random.rand(), 2)
                        p['Deg Tiers'][j, :] = ['M > ' + str(target_num), 'D < ' + str(1 - target_num), 'I = 0', '']
                    elif 0.7 < random_number < 0.8:
                        tiers = ['M1', 'P2', 'I3']
                        target_num = round(np.random.rand(), 2)
                        p['Deg Tiers'][j, :] = ['M > ' + str(target_num), 'P < ' + str(1 - target_num), 'I = 0', '']
                    else:
                        tiers = ['M1', 'D2', 'P3', 'I4']
                        target_num_1 = round(np.random.rand() * 0.7, 2)
                        target_num_2 = round(np.random.rand() * (1 - target_num_1) * 0.8, 2)
                        target_num_3 = round(1 - target_num_1 - target_num_2, 2)
                        p['Deg Tiers'][j, :] = ['M > ' + str(target_num_1), 'D > ' + str(target_num_2),
                                                'P < ' + str(target_num_3), 'I = 0']

                # Generate the tiers for the cadets
                c_tiers = np.random.randint(0, len(tiers), N)
                p['qual'][:, j] = np.array([tiers[c_tiers[i]] for i in range(N)])

        # NxM qual matrices with various features
        p["ineligible"] = (np.core.defchararray.find(p['qual'], "I") != -1) * 1
        p["eligible"] = (p["ineligible"] == 0) * 1
        for t in [1, 2, 3, 4]:
            p["tier " + str(t)] = (np.core.defchararray.find(p['qual'], str(t)) != -1) * 1
        p["mandatory"] = (np.core.defchararray.find(p['qual'], "M") != -1) * 1
        p["desired"] = (np.core.defchararray.find(p['qual'], "D") != -1) * 1
        p["permitted"] = (np.core.defchararray.find(p['qual'], "P") != -1) * 1

        # NEW: Exception to degree qualification based on CFM ranks
        p["exception"] = (np.core.defchararray.find(p['qual'], "E") != -1) * 1

        # Initialize information for AFSC degree tiers
        p["t_count"] = np.zeros(p['M']).astype(int)
        p["t_proportion"] = np.zeros([p['M'], 4])
        p["t_leq"] = (np.core.defchararray.find(p["Deg Tiers"], "<") != -1) * 1
        p["t_geq"] = (np.core.defchararray.find(p["Deg Tiers"], ">") != -1) * 1
        p["t_eq"] = (np.core.defchararray.find(p["Deg Tiers"], "=") != -1) * 1
        p["t_mandatory"] = (np.core.defchararray.find(p["Deg Tiers"], "M") != -1) * 1
        p["t_desired"] = (np.core.defchararray.find(p["Deg Tiers"], "D") != -1) * 1
        p["t_permitted"] = (np.core.defchararray.find(p["Deg Tiers"], "P") != -1) * 1

        # Loop through each AFSC
        for j, afsc in enumerate(p["afscs"][:p['M']]):

            # Loop through each potential degree tier
            for t in range(4):
                val = p["Deg Tiers"][j, t]

                # Empty degree tier
                if 'M' not in val and 'D' not in val and 'P' not in val and 'I' not in val:
                # if val in ["nan", "", ''] or pd.isnull(val):
                    t -= 1
                    break

                # Degree Tier Proportion
                p["t_proportion"][j, t] = val.split(" ")[2]

            # Num tiers
            p["t_count"][j] = t + 1

        return p   # Return updated parameters
    p = generate_degree_tier_qualifications()

    # Cadet preferences
    utility = np.random.rand(N, M)  # random utility matrix
    max_util = np.max(utility, axis=1)
    p['utility'] = np.around(utility / np.array([[max_util[i]] for i in range(N)]), 2)

    # Get cadet preferences
    p["c_pref_matrix"] = np.zeros([p["N"], p["M"]]).astype(int)
    for i in range(p['N']):

        # Sort the utilities to get the preference list
        utilities = p["utility"][i, :p["M"]]
        sorted_indices = np.argsort(utilities)[::-1]
        preferences = np.argsort(
            sorted_indices) + 1  # Add 1 to change from python index (at 0) to rank (start at 1)
        p["c_pref_matrix"][i, :] = preferences

    # Create the "column data" preferences and utilities
    p['c_preferences'], p['c_utilities'] = afccp.data.preferences.update_cadet_columns_from_matrices(p)
    p['c_preferences'] = p['c_preferences'][:, :P]
    p['c_utilities'] = p['c_utilities'][:, :P]

    # If we want to generate extra components to match with, we do so here
    if generate_extra:
        p['S'] = S
        p = generate_extra_components(p)

    # Update set of parameters
    p = afccp.data.adjustments.parameter_sets_additions(p)

    return p  # Return updated parameters

generate_random_value_parameters(parameters, num_breakpoints=24)

Generate Random Value Parameters for a Cadet-AFSC Assignment Problem.

This function constructs a randomized set of value-focused thinking (VFT) parameters for a given cadet-AFSC matching instance. These include AFSC weights, cadet weights, value function definitions, and constraint structures across defined objectives. It supports a mix of manually assigned logic and randomized components and can be used to simulate plausible input conditions for testing the assignment algorithm.

Parameters

parameters : dict The problem instance parameters, including cadet/AFSC info, merit scores, eligibility, quotas, and utilities. num_breakpoints : int, optional Number of breakpoints to use in piecewise linear value functions, by default 24.

Returns

dict A dictionary vp containing generated value parameters, including objectives, weights, constraints, value functions, and breakpoints.

Examples

vp = generate_random_value_parameters(parameters, num_breakpoints=16)

See Also

Source code in afccp/data/generation/basic.py
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
def generate_random_value_parameters(parameters, num_breakpoints=24):
    """
    Generate Random Value Parameters for a Cadet-AFSC Assignment Problem.

    This function constructs a randomized set of value-focused thinking (VFT) parameters for a given cadet-AFSC
    matching instance. These include AFSC weights, cadet weights, value function definitions, and constraint structures
    across defined objectives. It supports a mix of manually assigned logic and randomized components and can be
    used to simulate plausible input conditions for testing the assignment algorithm.

    Parameters
    ----------
    parameters : dict
        The problem instance parameters, including cadet/AFSC info, merit scores, eligibility, quotas, and utilities.
    num_breakpoints : int, optional
        Number of breakpoints to use in piecewise linear value functions, by default 24.

    Returns
    -------
    dict
        A dictionary `vp` containing generated value parameters, including objectives, weights, constraints,
        value functions, and breakpoints.

    Examples
    --------
    ```python
    vp = generate_random_value_parameters(parameters, num_breakpoints=16)
    ```

    See Also
    --------
    - [`generate_afocd_value_parameters`](../../../afccp/reference/data/values/#data.values.generate_afocd_value_parameters):
      Adds tiered AFOCD objectives and fills in default VFT structure for a given instance.
    - [`create_segment_dict_from_string`](../../../afccp/reference/data/values/#data.values.create_segment_dict_from_string):
      Parses string definitions into nonlinear segment dictionaries for value functions.
    - [`value_function_builder`](../../../afccp/reference/data/values/#data.values.value_function_builder):
      Linearizes nonlinear value functions using a fixed number of breakpoints.
    - [`cadet_weight_function`](../../../afccp/reference/data/values/#data.values.cadet_weight_function):
      Creates weights across cadets based on merit scores and function type.
    - [`afsc_weight_function`](../../../afccp/reference/data/values/#data.values.afsc_weight_function):
      Creates weights across AFSCs based on projected gains/losses and selected function type.
    """

    # Shorthand
    p = parameters

    # Objective to parameters lookup dictionary (if the parameter is in "p", we include the objective)
    objective_lookups = {'Norm Score': 'a_pref_matrix', 'Merit': 'merit', 'USAFA Proportion': 'usafa',
                         'Combined Quota': 'quota_d', 'USAFA Quota': 'usafa_quota', 'ROTC Quota': 'rotc_quota',
                         'Utility': 'utility', 'Mandatory': 'mandatory',
                         'Desired': 'desired', 'Permitted': 'permitted'}
    for t in ["1", "2", "3", "4"]:  # Add in AFOCD Degree tiers
        objective_lookups["Tier " + t] = "tier " + t

    # Add the AFSC objectives that are included in this instance (check corresponding parameters using dict above)
    objectives = np.array([objective for objective in objective_lookups if objective_lookups[objective] in p])

    # Initialize set of value parameters
    vp = {'objectives': objectives, 'cadets_overall_weight': np.random.rand(), 'O': len(objectives),
          'K': np.arange(len(objectives)), 'num_breakpoints': num_breakpoints, 'cadets_overall_value_min': 0,
          'afscs_overall_value_min': 0}
    vp['afscs_overall_weight'] = 1- vp['cadets_overall_weight']

    # Generate AFSC and cadet weights
    weight_functions = ['Linear', 'Direct', 'Curve_1', 'Curve_2', 'Equal']
    vp['cadet_weight_function'] = np.random.choice(weight_functions)
    vp['afsc_weight_function'] = np.random.choice(weight_functions)
    vp['cadet_weight'] = afccp.data.values.cadet_weight_function(p['merit_all'], func= vp['cadet_weight_function'])
    vp['afsc_weight'] = afccp.data.values.afsc_weight_function(p['pgl'], func = vp['afsc_weight_function'])

    # Stuff that doesn't matter here
    vp['cadet_value_min'], vp['afsc_value_min'] = np.zeros(p['N']), np.zeros(p['N'])
    vp['USAFA-Constrained AFSCs'], vp['Cadets Top 3 Constraint'] = '', ''
    vp['USSF OM'] = False

    # Initialize arrays
    vp['objective_weight'], vp['objective_target'] = np.zeros([p['M'], vp['O']]), np.zeros([p['M'], vp['O']])
    vp['constraint_type'] = np.zeros([p['M'], vp['O']])
    vp['objective_value_min'] = np.array([[' ' * 20 for _ in vp['K']] for _ in p['J']])
    vp['value_functions'] = np.array([[' ' * 200 for _ in vp['K']] for _ in p['J']])

    # Initialize breakpoints
    vp['a'] = [[[] for _ in vp['K']] for _ in p["J"]]
    vp['f^hat'] = [[[] for _ in vp['K']] for _ in p["J"]]

    # Initialize objective set
    vp['K^A'] = {}

    # Get AFOCD Tier objectives
    vp = afccp.data.values.generate_afocd_value_parameters(p, vp)
    vp['constraint_type'] = np.zeros([p['M'], vp['O']])  # Turn off all the constraints again

    # Loop through all AFSCs
    for j in p['J']:

        # Loop through all AFSC objectives
        for k, objective in enumerate(vp['objectives']):

            maximum, minimum, actual = None, None, None
            if objective == 'Norm Score':
                vp['objective_weight'][j, k] = (np.random.rand() * 0.2 + 0.3) * 100  # Scale up to 100
                vp['value_functions'][j, k] = 'Min Increasing|0.3'
                vp['objective_target'][j, k] = 1

            if objective == 'Merit':
                vp['objective_weight'][j, k] = (np.random.rand() * 0.4 + 0.05) * 100
                vp['value_functions'][j, k] = 'Min Increasing|-0.3'
                vp['objective_target'][j, k] = p['sum_merit'] / p['N']
                actual = np.mean(p['merit'][p['I^E'][j]])

            elif objective == 'USAFA Proportion':
                vp['objective_weight'][j, k] = (np.random.rand() * 0.3 + 0.05) * 100
                vp['value_functions'][j, k] = 'Balance|0.15, 0.15, 0.1, 0.08, 0.08, 0.1, 0.6'
                vp['objective_target'][j, k] = p['usafa_proportion']
                actual = len(p['I^D'][objective][j]) / len(p['I^E'][j])

            elif objective == 'Combined Quota':
                vp['objective_weight'][j, k] = (np.random.rand() * 0.8 + 0.2) * 100
                vp['value_functions'][j, k] = 'Quota_Normal|0.2, 0.25, 0.2'
                vp['objective_target'][j, k] = p['quota_d'][j]

                # Get bounds and turn on this constraint
                minimum, maximum = p['quota_min'][j], p['quota_max'][j]
                vp['objective_value_min'][j, k] = str(int(minimum)) + ", " + str(int(maximum))
                vp['constraint_type'][j, k] = 2

            elif objective == 'USAFA Quota':
                vp['objective_weight'][j, k] = 0
                vp['value_functions'][j, k] = 'Min Increasing|0.3'
                vp['objective_target'][j, k] = p['usafa_quota'][j]

                # Bounds on this constraint (but leave it off)
                vp['objective_value_min'][j, k] = str(int(p['usafa_quota'][j])) + ", " + \
                                                  str(int(p['quota_max'][j]))

            elif objective == 'ROTC Quota':
                vp['objective_weight'][j, k] = 0
                vp['value_functions'][j, k] = 'Min Increasing|0.3'
                vp['objective_target'][j, k] = p['rotc_quota'][j]

                # Bounds on this constraint (but leave it off)
                vp['objective_value_min'][j, k] = str(int(p['rotc_quota'][j])) + ", " + \
                                                  str(int(p['quota_max'][j]))

            # If we care about this objective, we load in its value function breakpoints
            if vp['objective_weight'][j, k] != 0:

                # Create the non-linear piecewise exponential segment dictionary
                segment_dict = afccp.data.values.create_segment_dict_from_string(
                    vp['value_functions'][j, k], vp['objective_target'][j, k],
                    minimum=minimum, maximum=maximum, actual=actual)

                # Linearize the non-linear function using the specified number of breakpoints
                vp['a'][j][k], vp['f^hat'][j][k] = afccp.data.values.value_function_builder(
                    segment_dict, num_breakpoints=num_breakpoints)

        # Scale the objective weights for this AFSC, so they sum to 1
        vp['objective_weight'][j] = vp['objective_weight'][j] / sum(vp['objective_weight'][j])
        vp['K^A'][j] = np.where(vp['objective_weight'][j] != 0)[0]

    return vp

generate_extra_components(parameters)

Generate additional components (bases, training courses, and timing factors) for a CadetCareerProblem instance.

This function augments the problem parameters with synthetic bases (locations), base capacities, cadet base preferences, training courses, and training start distributions. It also assigns weights to AFSC, base, and training preferences, enabling richer downstream optimization scenarios.

Parameters

parameters : dict The problem parameter dictionary for a CadetCareerProblem instance. Must contain: - M : int Number of AFSCs. - N : int Number of cadets. - S : int Number of bases to generate. - pgl : np.ndarray PGL targets per AFSC. - acc_grp : np.ndarray Accession group labels per AFSC (e.g., "Rated", "USSF", "NRL"). - usafa : np.ndarray Indicator for USAFA cadets.

Returns

dict Updated parameters dictionary with additional fields: - afsc_assign_base : np.ndarray Flags for AFSCs assigned to bases. - bases : np.ndarray Names of generated bases. - base_min, base_max : np.ndarray Min/max base capacities per AFSC. - base_preferences : dict Cadet-level base preference lists. - b_pref_matrix, base_utility : np.ndarray Matrices encoding cadet base preferences and utilities. - baseline_date : datetime.date Baseline date for training course scheduling. - training_preferences, training_threshold, base_threshold : np.ndarray Randomized cadet-level training/base thresholds and preferences. - weight_afsc, weight_base, weight_course : np.ndarray Weights for AFSC vs base vs course assignment importance. - training_start : np.ndarray Cadet training start dates (distribution differs for USAFA vs ROTC). - courses, course_start, course_min, course_max : dict Course identifiers, schedules, and capacities by AFSC. - T : np.ndarray Number of courses per AFSC.

Workflow

  1. Base Assignment

    • Randomly selects which AFSCs require base-level assignments.
    • Generates base names from Excel-style column naming (A, B, ..., AA, etc.).
    • Distributes base capacities (base_min, base_max) across AFSCs.
  2. Cadet Base Preferences

    • Randomly assigns each cadet preferences over bases.
    • Generates a preference matrix (b_pref_matrix) and base utilities (base_utility).
  3. Training Preferences

    • Creates training preference labels (Early vs Late) and thresholds.
    • Allocates random weights for AFSC, base, and training course priorities.
  4. Training Start Dates

    • USAFA cadets start late May.
    • ROTC cadets follow a spring/late graduation distribution.
  5. Training Courses

    • Generates course identifiers (random strings of letters).
    • Randomizes start dates and max capacities.
    • Computes T, the number of courses per AFSC.

Notes

  • baseline_date is set to Jan 1 of the year after the current system year.
  • Weights are normalized per cadet to sum to 1.
  • Utility values are randomized but ensure first-choice base has utility 1.0.

Examples

p = {'M': 5, 'N': 100, 'S': 3,
     'pgl': np.array([10, 20, 15, 30, 25]),
     'acc_grp': np.array(["NRL", "Rated", "NRL", "USSF", "NRL"]),
     'usafa': np.random.randint(0, 2, size=100)}
p = generate_extra_components(p)
p.keys()

Example Output:

dict_keys([... 'bases', 'base_preferences', 'training_start', 'courses', 'T' ...])

Source code in afccp/data/generation/basic.py
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
def generate_extra_components(parameters):
    """
    Generate additional components (bases, training courses, and timing factors)
    for a CadetCareerProblem instance.

    This function augments the problem parameters with synthetic **bases** (locations),
    **base capacities**, **cadet base preferences**, **training courses**, and
    **training start distributions**. It also assigns weights to AFSC, base, and
    training preferences, enabling richer downstream optimization scenarios.

    Parameters
    ----------
    parameters : dict
        The problem parameter dictionary for a `CadetCareerProblem` instance.
        Must contain:
        - `M` : int
            Number of AFSCs.
        - `N` : int
            Number of cadets.
        - `S` : int
            Number of bases to generate.
        - `pgl` : np.ndarray
            PGL targets per AFSC.
        - `acc_grp` : np.ndarray
            Accession group labels per AFSC (e.g., "Rated", "USSF", "NRL").
        - `usafa` : np.ndarray
            Indicator for USAFA cadets.

    Returns
    -------
    dict
        Updated parameters dictionary with additional fields:
        - `afsc_assign_base` : np.ndarray
            Flags for AFSCs assigned to bases.
        - `bases` : np.ndarray
            Names of generated bases.
        - `base_min`, `base_max` : np.ndarray
            Min/max base capacities per AFSC.
        - `base_preferences` : dict
            Cadet-level base preference lists.
        - `b_pref_matrix`, `base_utility` : np.ndarray
            Matrices encoding cadet base preferences and utilities.
        - `baseline_date` : datetime.date
            Baseline date for training course scheduling.
        - `training_preferences`, `training_threshold`, `base_threshold` : np.ndarray
            Randomized cadet-level training/base thresholds and preferences.
        - `weight_afsc`, `weight_base`, `weight_course` : np.ndarray
            Weights for AFSC vs base vs course assignment importance.
        - `training_start` : np.ndarray
            Cadet training start dates (distribution differs for USAFA vs ROTC).
        - `courses`, `course_start`, `course_min`, `course_max` : dict
            Course identifiers, schedules, and capacities by AFSC.
        - `T` : np.ndarray
            Number of courses per AFSC.

    Workflow
    --------
    1. **Base Assignment**
        - Randomly selects which AFSCs require base-level assignments.
        - Generates base names from Excel-style column naming (`A`, `B`, ..., `AA`, etc.).
        - Distributes base capacities (`base_min`, `base_max`) across AFSCs.

    1. **Cadet Base Preferences**
        - Randomly assigns each cadet preferences over bases.
        - Generates a preference matrix (`b_pref_matrix`) and base utilities (`base_utility`).

    1. **Training Preferences**
        - Creates training preference labels (`Early` vs `Late`) and thresholds.
        - Allocates random weights for AFSC, base, and training course priorities.

    1. **Training Start Dates**
        - USAFA cadets start late May.
        - ROTC cadets follow a spring/late graduation distribution.

    1. **Training Courses**
        - Generates course identifiers (random strings of letters).
        - Randomizes start dates and max capacities.
        - Computes `T`, the number of courses per AFSC.

    Notes
    -----
    - `baseline_date` is set to **Jan 1 of the year after the current system year**.
    - Weights are normalized per cadet to sum to 1.
    - Utility values are randomized but ensure first-choice base has utility 1.0.

    Examples
    --------
    ```python
    p = {'M': 5, 'N': 100, 'S': 3,
         'pgl': np.array([10, 20, 15, 30, 25]),
         'acc_grp': np.array(["NRL", "Rated", "NRL", "USSF", "NRL"]),
         'usafa': np.random.randint(0, 2, size=100)}
    p = generate_extra_components(p)
    p.keys()
    ```

    Example Output:
    ```
    dict_keys([... 'bases', 'base_preferences', 'training_start', 'courses', 'T' ...])
    ```
    """

    # Shorthand
    p = parameters

    # Get list of ordered letters (based on Excel column names)
    alphabet = list(string.ascii_uppercase)
    excel_columns = copy.deepcopy(alphabet)
    for letter in alphabet:
        for letter_2 in alphabet:
            excel_columns.append(letter + letter_2)

    # Determine which AFSCs we assign bases for
    p['afsc_assign_base'] = np.zeros(p['M']).astype(int)
    for j in range(p['M']):
        if p['acc_grp'][j] != "Rated" and np.random.rand() > 0.3:
            p['afsc_assign_base'][j] = 1

    # Name the bases according to the Excel columns (just a method of generating unique ordered letters)
    p['bases'] = np.array(["Base " + excel_columns[b] for b in range(p['S'])])

    # Get capacities for each AFSC at each base
    p['base_min'] = np.zeros((p['S'], p['M'])).astype(int)
    p['base_max'] = np.zeros((p['S'], p['M'])).astype(int)
    afscs_with_base_assignments = np.where(p['afsc_assign_base'])[0]
    for j in afscs_with_base_assignments:
        total_max = p['pgl'][j] * 1.5
        base_max = np.array([np.random.rand() for _ in range(p['S'])])
        base_max = (base_max / np.sum(base_max)) * total_max
        p['base_max'][:, j] = base_max.astype(int)
        p['base_min'][:, j] = (base_max * 0.4).astype(int)

    # Generate random cadet preferences for bases
    bases = copy.deepcopy(p['bases'])
    p['base_preferences'] = {}
    p['b_pref_matrix'] = np.zeros((p['N'], p['S'])).astype(int)
    p['base_utility'] = np.zeros((p['N'], p['S']))
    for i in range(p['N']):
        random.shuffle(bases)
        num_base_pref = np.random.choice(np.arange(2, p['S'] + 1))
        p['base_preferences'][i] = np.array([np.where(p['bases'] == base)[0][0] for base in bases[:num_base_pref]])

        # Convert to base preference matrix
        p['b_pref_matrix'][i, p['base_preferences'][i]] = np.arange(1, len(p['base_preferences'][i]) + 1)

        utilities = np.around(np.random.rand(num_base_pref), 2)
        p['base_utility'][i, p['base_preferences'][i]] = np.sort(utilities)[::-1]
        p['base_utility'][i, p['base_preferences'][i][0]] = 1.0  # First choice is always utility of 1!

    # Get the baseline starting date (January 1st of the year we're classifying)
    next_year = datetime.datetime.now().year + 1
    p['baseline_date'] = datetime.date(next_year, 1, 1)

    # Generate training preferences for each cadet
    p['training_preferences'] = np.array(
        [random.choices(['Early', 'Late'], weights=[0.9, 0.1])[0] for _ in range(p['N'])])

    # Generate base/training "thresholds" for when these preferences kick in (based on preferences for AFSCs)
    p['training_threshold'] = np.array([np.random.choice(np.arange(p['M'] + 1)) for _ in range(p['N'])])
    p['base_threshold'] = np.array([np.random.choice(np.arange(p['M'] + 1)) for _ in range(p['N'])])

    # Generate weights for AFSCs, bases, and courses
    p['weight_afsc'], p['weight_base'], p['weight_course'] = np.zeros(p['N']), np.zeros(p['N']), np.zeros(p['N'])
    for i in range(p['N']):

        # Force some percentage of cadets to make their threshold the last possible AFSC (this means these don't matter)
        if np.random.rand() > 0.8:
            p['base_threshold'][i] = p['M']
        if np.random.rand() > 0.7:
            p['training_threshold'][i] = p['M']

        # Generate weights for bases, training (courses), and AFSCs
        if p['base_threshold'][i] == p['M']:
            w_b = 0
        else:
            w_b = np.random.triangular(0, 50, 100)
        if p['training_threshold'][i] == p['M']:
            w_c = 0
        else:
            w_c = np.random.triangular(0, 20, 100)
        w_a = np.random.triangular(0, 90, 100)

        # Scale weights so that they sum to one and load into arrays
        p['weight_afsc'][i] = w_a / (w_a + w_b + w_c)
        p['weight_base'][i] = w_b / (w_a + w_b + w_c)
        p['weight_course'][i] = w_c / (w_a + w_b + w_c)

    # Generate training start dates for each cadet
    p['training_start'] = []
    for i in range(p['N']):

        # If this cadet is a USAFA cadet
        if p['usafa'][i]:

            # Make it May 28th of this year
            p['training_start'].append(datetime.date(next_year, 5, 28))

        # If it's an ROTC cadet, we sample from two different distributions (on-time and late grads)
        else:

            # 80% should be in spring
            if np.random.rand() < 0.8:
                dt = datetime.date(next_year, 4, 15) + datetime.timedelta(int(np.random.triangular(0, 30, 60)))
                p['training_start'].append(dt)

            # 20% should be after
            else:
                dt = datetime.date(next_year, 6, 1) + datetime.timedelta(int(np.random.triangular(0, 30*5, 30*6)))
                p['training_start'].append(dt)
    p['training_start'] = np.array(p['training_start'])

    # Generate training courses for each AFSC
    p['courses'], p['course_start'], p['course_min'], p['course_max'] = {}, {}, {}, {}
    p['course_count'] = np.zeros(p['M'])
    for j in range(p['M']):

        # Determine total number of training slots to divide up
        total_max = p['pgl'][j] * 1.5

        # Determine number of courses to generate
        if total_max <= 3:
            T = 1
        elif total_max <= 10:
            T = np.random.choice([1, 2])
        elif total_max < 25:
            T = np.random.choice([2, 3])
        elif total_max < 100:
            T = np.random.choice([3, 4, 5])
        else:
            T = np.random.choice([4, 5, 6, 7, 8, 9])

        # Course minimums and maximums
        random_nums = np.random.rand(T)
        p['course_max'][j] = np.around(total_max * (random_nums / np.sum(random_nums))).astype(int)
        p['course_min'][j] = np.zeros(T).astype(int)

        # Generate course specific information
        p['courses'][j], p['course_start'][j] = [], []
        current_date = p['baseline_date'] + datetime.timedelta(int(np.random.triangular(30*5, 30*9, 30*11)))
        for _ in range(T):

            # Course names (random strings of letters)
            num_letters = random.choice(np.arange(4, 10))
            p['courses'][j].append(''.join(random.choices(alphabet, k=num_letters)))

            # Course start date
            p['course_start'][j].append(current_date)

            # Get next course start date
            current_date += datetime.timedelta(int(np.random.triangular(30, 30*4, 30*6)))

        # Convert to numpy arrays
        for param in ['courses', 'course_start', 'course_max', 'course_min']:
            p[param][j] = np.array(p[param][j])

    # Number of training courses per AFSC
    p['T'] = np.array([len(p['courses'][j]) for j in range(p['M'])])

    # Return updated parameters
    return p

generate_concave_curve(num_points, max_x)

Generates x and y coordinates for a concave function.

Args: num_points (int): Number of points to generate. max_x (float): Maximum value along the x-axis.

Returns: tuple: (x_values, y_values) as numpy arrays.

Source code in afccp/data/generation/basic.py
744
745
746
747
748
749
750
751
752
753
754
755
756
757
def generate_concave_curve(num_points, max_x):
    """
    Generates x and y coordinates for a concave function.

    Args:
        num_points (int): Number of points to generate.
        max_x (float): Maximum value along the x-axis.

    Returns:
        tuple: (x_values, y_values) as numpy arrays.
    """
    x_values = np.linspace(0, max_x, num_points)
    y_values = 1 - np.exp(-x_values / (max_x / 6))  # Adjust curvature
    return x_values, y_values

generate_realistic_castle_value_curves(parameters, num_breakpoints: int = 10)

Generate Concave Value Curves for CASTLE AFSCs.

Creates piecewise linear approximations of realistic concave value functions for each CASTLE-level AFSC. These curves are used to evaluate the marginal utility of inventory across AFSCs, enabling smooth optimization and modeling in the CASTLE simulation.

Parameters: parameters (dict): Problem instance parameters containing CASTLE AFSC groups and PGL values. num_breakpoints (int, optional): Number of breakpoints to use in the piecewise value curve. Defaults to 10.

Returns: dict: A dictionary q containing the following keys for each CASTLE AFSC: - 'a': Array of x-values (inventory levels). - 'f^hat': Array of corresponding y-values (utility). - 'r': Number of breakpoints. - 'L': Index array of breakpoints.

Example:

q = generate_realistic_castle_value_curves(parameters, num_breakpoints=12)
x_vals = q['a']['21A']       # x-values for AFSC 21A
y_vals = q['f^hat']['21A']   # corresponding utility values

See Also: - generate_concave_curve: Generates a concave (diminishing returns) curve with specified number of points and max range.

Source code in afccp/data/generation/basic.py
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
def generate_realistic_castle_value_curves(parameters, num_breakpoints: int = 10):
    """
    Generate Concave Value Curves for CASTLE AFSCs.

    Creates piecewise linear approximations of realistic concave value functions for each CASTLE-level AFSC.
    These curves are used to evaluate the marginal utility of inventory across AFSCs, enabling smooth
    optimization and modeling in the CASTLE simulation.

    Parameters:
        parameters (dict): Problem instance parameters containing CASTLE AFSC groups and PGL values.
        num_breakpoints (int, optional): Number of breakpoints to use in the piecewise value curve.
            Defaults to 10.

    Returns:
        dict: A dictionary `q` containing the following keys for each CASTLE AFSC:
            - `'a'`: Array of x-values (inventory levels).
            - `'f^hat'`: Array of corresponding y-values (utility).
            - `'r'`: Number of breakpoints.
            - `'L'`: Index array of breakpoints.

    Example:
        ```python
        q = generate_realistic_castle_value_curves(parameters, num_breakpoints=12)
        x_vals = q['a']['21A']       # x-values for AFSC 21A
        y_vals = q['f^hat']['21A']   # corresponding utility values
        ```

    See Also:
        - [`generate_concave_curve`](../../../afccp/reference/data/generation/#data.generation.generate_concave_curve):
          Generates a concave (diminishing returns) curve with specified number of points and max range.
    """
    # Shorthand
    p = parameters

    # Define "q" dictionary for value function components
    q = {'a': {}, 'f^hat': {}, 'r': {}, 'L': {}}
    for afsc in p['castle_afscs']:
        # Sum up the PGL targets for all "AFPC" AFSCs grouped for this "CASTLE" AFSC
        pgl = np.sum(p['pgl'][p['J^CASTLE'][afsc]])

        # Generate x and y coordinates for concave shape
        x, y = generate_concave_curve(num_points=num_breakpoints, max_x=pgl * 2)

        # Save breakpoint information to q dictionary
        q['a'][afsc], q['f^hat'][afsc] = x, y
        q['r'][afsc], q['L'][afsc] = len(x), np.arange(len(x))

    return q

train_ctgan(epochs=1000, printing=True, name='CTGAN_Full')

Train CTGAN to produce realistic data based on the current "ctgan_data" file in the support sub-folder. This function then saves the ".pkl" file back to the support sub-folder

Source code in afccp/data/generation/realistic.py
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
def train_ctgan(epochs=1000, printing=True, name='CTGAN_Full'):
    """
    Train CTGAN to produce realistic data based on the current "ctgan_data" file in the support sub-folder. This
    function then saves the ".pkl" file back to the support sub-folder
    """

    # Import data
    data = afccp.globals.import_csv_data(afccp.globals.paths['support'] + 'data/ctgan_data.csv')
    data = data[[col for col in data.columns if col not in ['YEAR']]]
    metadata = SingleTableMetadata()  # SDV requires this now
    metadata.detect_from_dataframe(data=data)  # get the metadata from dataframe

    # Create the synthesizer model
    model = CTGANSynthesizer(metadata, epochs=epochs, verbose=True)

    # List of constraints for CTGAN
    constraints = []

    # Get list of columns that must be between 0 and 1
    zero_to_one_columns = ["Merit"]
    for col in data.columns:
        if "_Cadet" in col or "_AFSC" in col:
            zero_to_one_columns.append(col)

    # Create the "zero to one" constraints and add them to our list of constraints
    for col in zero_to_one_columns:
        zero_to_one_constraint = {"constraint_class": "ScalarRange",
                                  "constraint_parameters": {
                                      'column_name': col,
                                      'low_value': 0,
                                      'high_value': 1,
                                      'strict_boundaries': False
                                  }}
        constraints.append(zero_to_one_constraint)

    # Add the constraints to the model
    model.add_constraints(constraints)

    # Train the model
    if printing:
        print("Training the model...")
    model.fit(data)

    # Save the model
    filepath = afccp.globals.paths["support"] + name + '.pkl'
    model.save(filepath)
    if printing:
        print("Model saved to", filepath)

generate_ctgan_instance(N=1600, name='CTGAN_Full', pilot_condition=False, degree_qual_type='Consistent')

This procedure takes in the specified number of cadets and then generates a representative problem instance using CTGAN that has been trained from a real class year of cadets

Parameters:

Name Type Description Default
pilot_condition

If we want to sample cadets according to pilot preferences (make this more representative)

False
name

Name of the CTGAN model to import

'CTGAN_Full'
N

number of cadets

1600

Returns:

Type Description

model fixed parameters

Source code in afccp/data/generation/realistic.py
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
def generate_ctgan_instance(N=1600, name='CTGAN_Full', pilot_condition=False, degree_qual_type='Consistent'):
    """
    This procedure takes in the specified number of cadets and then generates a representative problem
    instance using CTGAN that has been trained from a real class year of cadets
    :param pilot_condition: If we want to sample cadets according to pilot preferences
    (make this more representative)
    :param name: Name of the CTGAN model to import
    :param N: number of cadets
    :return: model fixed parameters
    """

    # Load in the model
    filepath = afccp.globals.paths["support"] + name + '.pkl'
    model = CTGANSynthesizer.load(filepath)

    # Split up the number of ROTC/USAFA cadets
    N_usafa = round(np.random.triangular(0.25, 0.33, 0.4) * N)
    N_rotc = N - N_usafa

    # Pilot is by far the #1 desired career field, let's make sure this is represented here
    N_usafa_pilots = round(np.random.triangular(0.3, 0.4, 0.43) * N_usafa)
    N_usafa_generic = N_usafa - N_usafa_pilots
    N_rotc_pilots = round(np.random.triangular(0.25, 0.3, 0.33) * N_rotc)
    N_rotc_generic = N_rotc - N_rotc_pilots

    # Condition the data generated to produce the right composition of pilot first choice preferences
    usafa_pilot_first_choice = Condition(num_rows = N_usafa_pilots, column_values={'SOC': 'USAFA', '11XX_Cadet': 1})
    usafa_generic_cadets = Condition(num_rows=N_usafa_generic, column_values={'SOC': 'USAFA'})
    rotc_pilot_first_choice = Condition(num_rows=N_rotc_pilots, column_values={'SOC': 'ROTC', '11XX_Cadet': 1})
    rotc_generic_cadets = Condition(num_rows=N_rotc_generic, column_values={'SOC': 'ROTC'})

    # Sample data  (Sampling from conditions may take too long!)
    if pilot_condition:
        data = model.sample_from_conditions(conditions=[usafa_pilot_first_choice, usafa_generic_cadets,
                                                        rotc_pilot_first_choice, rotc_generic_cadets])
    else:
        data = model.sample(N)

    # Load in AFSCs data
    filepath = afccp.globals.paths["support"] + 'data/afscs_data.csv'
    afscs_data = afccp.globals.import_csv_data(filepath)

    # Get list of AFSCs
    afscs = np.array(afscs_data['AFSC'])

    # Initialize parameter dictionary
    p = {'afscs': afscs, 'N': N, 'P': len(afscs), 'M': len(afscs), 'merit': np.array(data['Merit']),
         'cadets': np.arange(N), 'usafa': np.array(data['SOC'] == 'USAFA') * 1,
         'cip1': np.array(data['CIP1']), 'cip2': np.array(data['CIP2']), 'num_util': 10,  # 10 utilities taken
         'rotc': np.array(data['SOC'] == 'ROTC'), 'I': np.arange(N), 'J': np.arange(len(afscs))}

    # Clean up degree columns (remove the leading "c" I put there if it's there)
    for i in p['I']:
        if p['cip1'][i][0] == 'c':
            p['cip1'][i] = p['cip1'][i][1:]
        if p['cip2'][i][0] == 'c':
            p['cip2'][i] = p['cip2'][i][1:]

    # Create "SOC" variable
    p['soc'] = np.array(['USAFA' if p['usafa'][i] == 1 else "ROTC" for i in p['I']])

    # Fix percentiles for USAFA and ROTC
    re_scaled_om = p['merit']
    for soc in ['usafa', 'rotc']:
        indices = np.where(p[soc])[0]  # Indices of these SOC-specific cadets
        percentiles = p['merit'][indices]  # The percentiles of these cadets
        N = len(percentiles)  # Number of cadets from this SOC
        sorted_indices = np.argsort(percentiles)[::-1]  # Sort these percentiles (descending)
        new_percentiles = (np.arange(N)) / (N - 1)  # New percentiles we want to replace these with
        magic_indices = np.argsort(sorted_indices)  # Indices that let us put the new percentiles in right place
        new_percentiles = new_percentiles[magic_indices]  # Put the new percentiles back in the right place
        np.put(re_scaled_om, indices, new_percentiles)  # Put these new percentiles in combined SOC OM spot

    # Replace merit
    p['merit'] = re_scaled_om

    # Add AFSC features to parameters
    p['acc_grp'] = np.array(afscs_data['Accessions Group'])
    p['Deg Tiers'] = np.array(afscs_data.loc[:, 'Deg Tier 1': 'Deg Tier 4'])
    p['Deg Tiers'][pd.isnull(p["Deg Tiers"])] = ''  # TODO

    # Determine AFSCs by Accessions Group
    p['afscs_acc_grp'] = {}
    if 'acc_grp' in p:
        for acc_grp in ['Rated', 'USSF', 'NRL']:
            p['J^' + acc_grp] = np.where(p['acc_grp'] == acc_grp)[0]
            p['afscs_acc_grp'][acc_grp] = p['afscs'][p['J^' + acc_grp]]

    # Useful data elements to help us generate PGL targets
    usafa_prop, rotc_prop, pgl_prop = np.array(afscs_data['USAFA Proportion']), \
                                      np.array(afscs_data['ROTC Proportion']), \
                                      np.array(afscs_data['PGL Proportion'])

    # Total targets needed to distribute
    total_targets = int(p['N'] * min(0.95, np.random.normal(0.93, 0.08)))

    # PGL targets
    p['pgl'] = np.zeros(p['M']).astype(int)
    p['usafa_quota'] = np.zeros(p['M']).astype(int)
    p['rotc_quota'] = np.zeros(p['M']).astype(int)
    for j in p['J']:

        # Create the PGL target by sampling from the PGL proportion triangular distribution
        p_min = max(0, 0.8 * pgl_prop[j])
        p_max = 1.2 * pgl_prop[j]
        prop = np.random.triangular(p_min, pgl_prop[j], p_max)
        p['pgl'][j] = int(max(1, prop * total_targets))

        # Get the ROTC proportion of this PGL target to allocate
        if rotc_prop[j] in [1, 0]:
            prop = rotc_prop[j]
        else:
            rotc_p_min = max(0, 0.8 * rotc_prop[j])
            rotc_p_max = min(1, 1.2 * rotc_prop[j])
            prop = np.random.triangular(rotc_p_min, rotc_prop[j], rotc_p_max)

        # Create the SOC-specific targets
        p['rotc_quota'][j] = int(prop * p['pgl'][j])
        p['usafa_quota'][j] = p['pgl'][j] - p['rotc_quota'][j]

    # Initialize the other pieces of information here
    for param in ['quota_e', 'quota_d', 'quota_min', 'quota_max']:
        p[param] = p['pgl']

    # Break up USSF and 11XX AFSC by SOC
    for afsc in ['USSF', '11XX']:
        for col in ['Cadet', 'AFSC']:
            for soc in ['USAFA', 'ROTC']:
                data[f'{afsc}_{soc[0]}_{col}'] = 0
                data.loc[data['SOC'] == soc, f'{afsc}_{soc[0]}_{col}'] = data.loc[data['SOC'] == soc, f'{afsc}_{col}']

    c_pref_cols = [f'{afsc}_Cadet' for afsc in afscs]
    util_original = np.around(np.array(data[c_pref_cols]), 2)

    # Initialize cadet preference information
    p['c_utilities'] = np.zeros((p['N'], 10))
    p['c_preferences'] = np.array([[' ' * 6 for _ in range(p['M'])] for _ in range(p['N'])])
    p['cadet_preferences'] = {}
    p['c_pref_matrix'] = np.zeros((p['N'], p['M'])).astype(int)
    p['utility'] = np.zeros((p['N'], p['M']))

    # Loop through each cadet to tweak their preferences
    for i in p['cadets']:

        # Manually fix 62EXE preferencing from eligible cadets
        ee_j = np.where(afscs == '62EXE')[0][0]
        if '1410' in data.loc[i, 'CIP1'] or '1447' in data.loc[i, 'CIP1']:
            if np.random.rand() > 0.6:
                util_original[i, ee_j] = np.around(max(util_original[i, ee_j], min(1, np.random.normal(0.8, 0.18))),
                                                   2)

        # Fix rated/USSF volunteer situation
        for acc_grp in ['Rated', 'USSF']:
            if data.loc[i, f'{acc_grp} Vol']:
                if np.max(util_original[i, p[f'J^{acc_grp}']]) < 0.6:
                    util_original[i, p[f'J^{acc_grp}']] = 0
                    data.loc[i, f'{acc_grp} Vol'] = False
            else:  # Not a volunteer

                # We have a higher preference for these kinds of AFSCs
                if np.max(util_original[i, p[f'J^{acc_grp}']]) >= 0.6:
                    data.loc[i, f'{acc_grp} Vol'] = True  # Make them a volunteer now

        # Was this the last choice AFSC? Remove from our lists
        ordered_list = np.argsort(util_original[i])[::-1]
        last_choice = data.loc[i, 'Last Choice']
        if last_choice in afscs:
            j = np.where(afscs == last_choice)[0][0]
            ordered_list = ordered_list[ordered_list != j]

        # Add the "2nd least desired AFSC" to list
        second_last_choice = data.loc[i, '2nd-Last Choice']
        bottom = []
        if second_last_choice in afscs and afsc != last_choice:  # Check if valid and not in bottom choices
            j = np.where(afscs == second_last_choice)[0][0]  # Get index of AFSC
            ordered_list = ordered_list[ordered_list != j]  # Remove index from preferences
            bottom.append(second_last_choice)  # Add it to the list of bottom choices

        # If it's a valid AFSC that isn't already in the bottom choices
        third_last_choice = data.loc[i, '3rd-Last Choice']  # Add the "3rd least desired AFSC" to list
        if third_last_choice in afscs and afsc not in [last_choice, second_last_choice]:
            j = np.where(afscs == third_last_choice)[0][0]  # Get index of AFSC
            ordered_list = ordered_list[
                ordered_list != j]  # Reordered_list = np.argsort(util_original[i])[::-1]move index from preferences
            bottom.append(third_last_choice)  # Add it to the list of bottom choices

        # If we have an AFSC in the bottom choices, but NOT the LAST choice, move one to the last choice
        if len(bottom) > 0 and pd.isnull(last_choice):
            afsc = bottom.pop(0)
            data.loc[i, 'Last Choice'] = afsc
        data.loc[i, 'Second Least Desired AFSCs'] = ', '.join(bottom)  # Put it in the dataframe

        # Save cadet preference information
        num_pref = 10 if np.random.rand() > 0.1 else int(np.random.triangular(11, 15, 26))
        p['c_utilities'][i] = util_original[i, ordered_list[:10]]
        p['cadet_preferences'][i] = ordered_list[:num_pref]
        p['c_preferences'][i, :num_pref] = afscs[p['cadet_preferences'][i]]
        p['c_pref_matrix'][i, p['cadet_preferences'][i]] = np.arange(1, len(p['cadet_preferences'][i]) + 1)
        p['utility'][i, p['cadet_preferences'][i][:10]] = p['c_utilities'][i]

    # Get qual matrix information
    p['Qual Type'] = degree_qual_type
    p = afccp.data.adjustments.gather_degree_tier_qual_matrix(cadets_df=None, parameters=p)

    # Get the qual matrix to know what people are eligible for
    ineligible = (np.core.defchararray.find(p['qual'], "I") != -1) * 1
    eligible = (ineligible == 0) * 1
    I_E = [np.where(eligible[:, j])[0] for j in p['J']]  # set of cadets that are eligible for AFSC j

    # Modify AFSC utilities based on eligibility
    a_pref_cols = [f'{afsc}_AFSC' for afsc in afscs]
    p['afsc_utility'] = np.around(np.array(data[a_pref_cols]), 2)
    for acc_grp in ['Rated', 'USSF']:
        for j in p['J^' + acc_grp]:
            volunteer_col = np.array(data['Rated Vol'])
            volunteers = np.where(volunteer_col)[0]
            not_volunteers = np.where(volunteer_col == False)[0]
            ranked = np.where(p['afsc_utility'][:, j] > 0)[0]
            unranked = np.where(p['afsc_utility'][:, j] == 0)[0]

            # Fill in utility values with OM for rated folks who don't have an AFSC score
            volunteer_unranked = np.intersect1d(volunteers, unranked)
            p['afsc_utility'][volunteer_unranked, j] = p['merit'][volunteer_unranked]

            # If the cadet didn't actually volunteer, they should have utility of 0
            non_volunteer_ranked = np.intersect1d(not_volunteers, ranked)
            p['afsc_utility'][non_volunteer_ranked, j] = 0

    # Remove cadets from this AFSC's preferences if the cadet is not eligible
    for j in p['J^NRL']:

        # Get appropriate sets of cadets
        eligible_cadets = I_E[j]
        ineligible_cadets = np.where(ineligible[:, j])[0]
        ranked_cadets = np.where(p['afsc_utility'][:, j] > 0)[0]
        unranked_cadets = np.where(p['afsc_utility'][:, j] == 0)[0]

        # Fill in utility values with OM for eligible folks who don't have an AFSC score
        eligible_unranked = np.intersect1d(eligible_cadets, unranked_cadets)
        p['afsc_utility'][eligible_unranked, j] = p['merit'][eligible_unranked]

        # If the cadet isn't actually eligible, they should have utility of 0
        ineligible_ranked = np.intersect1d(ineligible_cadets, ranked_cadets)
        p['afsc_utility'][ineligible_ranked, j] = 0

    # Collect AFSC preference information
    p['afsc_preferences'] = {}
    p['a_pref_matrix'] = np.zeros((p['N'], p['M'])).astype(int)
    for j in p['J']:

        # Sort the utilities to get the preference list
        utilities = p["afsc_utility"][:, j]
        ineligible_indices = np.where(utilities == 0)[0]
        sorted_indices = np.argsort(utilities)[::-1][:p['N'] - len(ineligible_indices)]
        p['afsc_preferences'][j] = sorted_indices

        # Since 'afsc_preferences' is an array of AFSC indices, we can do this
        p['a_pref_matrix'][p['afsc_preferences'][j], j] = np.arange(1, len(p['afsc_preferences'][j]) + 1)

    # Needed information for rated OM matrices
    dataset_dict = {'rotc': 'rr_om_matrix', 'usafa': 'ur_om_matrix'}
    cadets_dict = {'rotc': 'rr_om_cadets', 'usafa': 'ur_om_cadets'}
    p["Rated Cadets"] = {}

    # Create rated OM matrices for each SOC
    for soc in ['usafa', 'rotc']:

        # Rated AFSCs for this SOC
        if soc == 'rotc':
            rated_J_soc = np.array([j for j in p['J^Rated'] if '_U' not in p['afscs'][j]])
        else:  # usafa
            rated_J_soc = np.array([j for j in p['J^Rated'] if '_R' not in p['afscs'][j]])

        # Cadets from this SOC
        soc_cadets = np.where(p[soc])[0]

        # Determine which cadets are eligible for at least one rated AFSC
        p["Rated Cadets"][soc] = np.array([i for i in soc_cadets if np.sum(p['c_pref_matrix'][i, rated_J_soc]) > 0])
        p[cadets_dict[soc]] = p["Rated Cadets"][soc]

        # Initialize OM dataset
        p[dataset_dict[soc]] = np.zeros([len(p["Rated Cadets"][soc]), len(rated_J_soc)])

        # Create OM dataset
        for col, j in enumerate(rated_J_soc):

            # Get the maximum rank someone had
            max_rank = np.max(p['a_pref_matrix'][p["Rated Cadets"][soc], j])

            # Loop through each cadet to convert rank to percentile
            for row, i in enumerate(p["Rated Cadets"][soc]):
                rank = p['a_pref_matrix'][i, j]
                if rank == 0:
                    p[dataset_dict[soc]][row, col] = 0
                else:
                    p[dataset_dict[soc]][row, col] = (max_rank - rank + 1) / max_rank

    # Return parameters
    return p

augment_2026_data_with_ots(N: int = 3000, import_name: str = '2026_0', export_name: str = '2026O')

Augment a base instance with a synthetic OTS cohort and export a new, fully wired instance.

This pipeline loads the trained CTGAN model and historical CTGAN training table, samples N realistic cadets (with extra emphasis on degree‑scarce AFSCs), converts them to OTS, re-computes OM and AFSC utilities under AFCCP rules (eligibility & volunteer logic), and stitches the new cohort into all downstream CSVs (Cadets, Preferences, Utilities, AFSC Preferences, CASTLE input, etc.). The result is written to instances/{export_name}/4. Model Input/.

Parameters

N : int, optional Number of OTS cadets to generate (default 3000). import_name : str, optional Name of the source instance to copy/extend (e.g., '2026_0'). Reads input CSVs from instances/{import_name}/4. Model Input/. export_name : str, optional Name of the destination instance to create (e.g., '2026O'). Writes outputs to instances/{export_name}/4. Model Input/.

Workflow

1) Load CTGAN training data (<support>/data/ctgan_data.csv) and AFSCs for the source instance. 2) Load CTGAN model (<support>/CTGAN_Full.pkl). 3) Targeted sampling for degree‑scarce AFSCs via generate_data_with_degree_preference_fixes (with KDE utility bootstrapping), then sample the remainder from the CTGAN. 4) Force SOC to OTS, re‑scale OM and blend AFSC utilities with OM / cadet utility using re_calculate_ots_om_and_afsc_rankings. 5) Align volunteers and degree fields for OTS with align_ots_preferences_and_degrees_somewhat (USSF turned off for OTS). 6) Build AFCCP parameter dict and eligibility‑aware AFSC utilities with construct_parameter_dictionary_and_augment_data (zero for ineligible/non‑volunteer; OM backfill where appropriate). 7) Rebuild AFSC preference rankings and matrices with construct_full_afsc_preferences_data, and cadet‑side preferences/utilities with construct_full_cadets_data. 8) Merge everything with existing source CSVs via compile_new_dataframes and export.

Files Read

  • <support>/data/ctgan_data.csv
  • <support>/CTGAN_Full.pkl
  • instances/{import_name}/4. Model Input/{import_name} AFSCs.csv
  • instances/{import_name}/4. Model Input/{import_name} AFSCs Preferences.csv
  • instances/{import_name}/4. Model Input/{import_name} Cadets.csv
  • instances/{import_name}/4. Model Input/{import_name} Castle Input.csv

Files Written (to instances/{export_name}/4. Model Input/)

  • {export_name} Cadets.csv
  • {export_name} AFSCs Preferences.csv
  • {export_name} AFSCs.csv (copied base AFSCs, unchanged schema)
  • {export_name} Raw Data.csv (the assembled OTS sampling table)
  • {export_name} Castle Input.csv
  • Plus augmented matrices produced by compile_new_dataframes (e.g., Cadets Preferences, Cadets Utility, Cadets Selected, AFSCs Buckets, OTS Rated OM).

Returns

None Side‑effects only. Progress is printed to stdout; artifacts are saved to disk.

Notes

  • Assumes the CTGAN model is saved as <support>/CTGAN_Full.pkl.
  • Assumes the source instance (import_name) contains the standard AFCCP CSVs under 4. Model Input/ with 2026‑style schemas.
  • OTS candidates are excluded from USSF by construction (USSF Vol = False, utilities set to 0).
  • Rated OM for OTS is derived from OM where needed, filtered by eligibility.

Examples

augment_2026_data_with_ots(N=3000, import_name='2026_0', export_name='2026O')

Source code in afccp/data/generation/realistic.py
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
def augment_2026_data_with_ots(N: int = 3000, import_name: str = '2026_0', export_name: str = '2026O'):
    """
    Augment a base instance with a synthetic OTS cohort and export a new, fully wired instance.

    This pipeline loads the trained CTGAN model and historical CTGAN training table, samples **N**
    realistic cadets (with extra emphasis on degree‑scarce AFSCs), converts them to OTS,
    re-computes OM and AFSC utilities under AFCCP rules (eligibility & volunteer logic), and
    stitches the new cohort into all downstream CSVs (Cadets, Preferences, Utilities, AFSC
    Preferences, CASTLE input, etc.). The result is written to
    `instances/{export_name}/4. Model Input/`.

    Parameters
    ----------
    N : int, optional
        Number of OTS cadets to generate (default 3000).
    import_name : str, optional
        Name of the *source* instance to copy/extend (e.g., `'2026_0'`).
        Reads input CSVs from `instances/{import_name}/4. Model Input/`.
    export_name : str, optional
        Name of the *destination* instance to create (e.g., `'2026O'`).
        Writes outputs to `instances/{export_name}/4. Model Input/`.

    Workflow
    --------
    1) Load CTGAN training data (`<support>/data/ctgan_data.csv`) and AFSCs for the source instance.
    2) Load CTGAN model (`<support>/CTGAN_Full.pkl`).
    3) Targeted sampling for degree‑scarce AFSCs via
       `generate_data_with_degree_preference_fixes` (with KDE utility bootstrapping), then
       sample the remainder from the CTGAN.
    4) Force SOC to `OTS`, re‑scale OM and blend AFSC utilities with OM / cadet utility using
       `re_calculate_ots_om_and_afsc_rankings`.
    5) Align volunteers and degree fields for OTS with `align_ots_preferences_and_degrees_somewhat`
       (USSF turned off for OTS).
    6) Build AFCCP parameter dict and eligibility‑aware AFSC utilities with
       `construct_parameter_dictionary_and_augment_data` (zero for ineligible/non‑volunteer;
       OM backfill where appropriate).
    7) Rebuild AFSC preference rankings and matrices with
       `construct_full_afsc_preferences_data`, and cadet‑side preferences/utilities with
       `construct_full_cadets_data`.
    8) Merge everything with existing source CSVs via `compile_new_dataframes` and export.

    Files Read
    ----------
    - `<support>/data/ctgan_data.csv`
    - `<support>/CTGAN_Full.pkl`
    - `instances/{import_name}/4. Model Input/{import_name} AFSCs.csv`
    - `instances/{import_name}/4. Model Input/{import_name} AFSCs Preferences.csv`
    - `instances/{import_name}/4. Model Input/{import_name} Cadets.csv`
    - `instances/{import_name}/4. Model Input/{import_name} Castle Input.csv`

    Files Written (to `instances/{export_name}/4. Model Input/`)
    ------------------------------------------------------------
    - `{export_name} Cadets.csv`
    - `{export_name} AFSCs Preferences.csv`
    - `{export_name} AFSCs.csv` (copied base AFSCs, unchanged schema)
    - `{export_name} Raw Data.csv` (the assembled OTS sampling table)
    - `{export_name} Castle Input.csv`
    - Plus augmented matrices produced by `compile_new_dataframes`
      (e.g., Cadets Preferences, Cadets Utility, Cadets Selected, AFSCs Buckets, OTS Rated OM).

    Returns
    -------
    None
        Side‑effects only. Progress is printed to stdout; artifacts are saved to disk.

    Notes
    -----
    - Assumes the CTGAN model is saved as `<support>/CTGAN_Full.pkl`.
    - Assumes the source instance (`import_name`) contains the standard AFCCP CSVs under
      `4. Model Input/` with 2026‑style schemas.
    - OTS candidates are excluded from USSF by construction (`USSF Vol = False`, utilities set to 0).
    - Rated OM for OTS is derived from OM where needed, filtered by eligibility.

    Examples
    --------
    >>> augment_2026_data_with_ots(N=3000, import_name='2026_0', export_name='2026O')
    """

    # Load in original data
    print('Loading in data...')
    filepath = afccp.globals.paths["support"] + 'data/ctgan_data.csv'
    full_data = pd.read_csv(filepath)
    cadet_cols = np.array([col for col in full_data.columns if '_Cadet' in col])

    # Import 'AFSCs' data
    filepath = f'instances/{import_name}/4. Model Input/{import_name} AFSCs.csv'
    afscs_df = afccp.globals.import_csv_data(filepath)
    afscs = np.array([col.split('_')[0] for col in cadet_cols])

    # Load in the model
    print('Loading in model...')
    filepath = afccp.globals.paths["support"] + 'CTGAN_Full.pkl'
    model = CTGANSynthesizer.load(filepath)

    # Sample the data
    print('Sampling data...')
    data_degrees = generate_data_with_degree_preference_fixes(model, full_data, afscs_df)
    data_all_else = model.sample(N - len(data_degrees))
    data = pd.concat((data_degrees, data_all_else), ignore_index=True)

    # These are all OTS candidates now!
    data['SOC'] = 'OTS'

    # Determine AFSCs by accessions group
    rated = np.array([np.where(cadet_cols == f'{afsc}_Cadet')[0][0] for afsc in ['11XX', '12XX', '13B', '18X']])
    afscs_acc_grp = {'Rated': rated, 'USSF': np.array([0])}

    # Re-calculate OM/AFSC Rankings for OTS
    print('Modifying data...')
    data = re_calculate_ots_om_and_afsc_rankings(data)

    # OTS isn't going to USSF
    data['USSF Vol'], data['USSF_Cadet'], data['USSF_AFSC'] = False, 0, 0
    data = align_ots_preferences_and_degrees_somewhat(data, afscs_acc_grp)

    # Non-rated AFSC indices
    nrl_indices = np.array(
        [np.where(afscs == afsc)[0][0] for afsc in afscs if afsc not in ['USSF', '11XX', '12XX', '13B', '18X']])

    # Construct the parameter dictionary and adjust AFSC utilities
    data, p = construct_parameter_dictionary_and_augment_data(
        data, afscs, afscs_df, afscs_acc_grp, nrl_indices=nrl_indices)

    # Import AFSCs Preferences data
    filepath = f'instances/{import_name}/4. Model Input/{import_name} AFSCs Preferences.csv'
    a_pref_df = afccp.globals.import_csv_data(filepath)

    # Construct the full AFSC preference data
    full_a_pref_df = construct_full_afsc_preferences_data(p, a_pref_df, afscs, nrl_indices)

    # Import 'Cadets' dataframe
    filepath = f'instances/{import_name}/4. Model Input/{import_name} Cadets.csv'
    cadets_df = afccp.globals.import_csv_data(filepath)

    # Construct the cadets data
    full_cadets_df = construct_full_cadets_data(p, cadets_df, data, afscs)

    # Import CASTLE data
    filepath = f'instances/{import_name}/4. Model Input/{import_name} Castle Input.csv'
    castle_df = afccp.globals.import_csv_data(filepath)

    # Dictionary of dataframes to export with new OTS 2026 instance
    print('Compiling current 2026 data...')
    new_dfs = {'Cadets': full_cadets_df, 'AFSCs Preferences': full_a_pref_df, 'AFSCs': afscs_df, 'Raw Data': data,
               'Castle Input': castle_df}
    new_dfs = compile_new_dataframes(new_dfs, p, cadets_df, afscs, rated, data, import_name)

    # Export new dataframes for new instance
    print('Export new data instance...')
    folder_path = f'instances/{export_name}/4. Model Input/'
    os.makedirs(folder_path, exist_ok=True)
    for df_name, df in new_dfs.items():
        print(f'Data: "{df_name}", Shape: {np.shape(df)}')
        filepath = f'{folder_path}{export_name} {df_name}.csv'
        df.to_csv(filepath, index=False)

generate_data_with_degree_preference_fixes(model, full_data, afscs_df)

Generate synthetic cadet data for rare AFSCs, preserving degree distribution preferences and realistic cadet/AFSC utilities.

This function focuses on AFSCs that are difficult to fill ("rare" AFSCs), generating synthetic cadets in a way that matches observed degree patterns (CIP1) from historical data. It uses extract_afsc_cip_sampling_information to determine sampling quotas and conditions, and sample_cadets_for_degree_conditions to produce matching synthetic records. Cadet and AFSC utility values are then resampled for realism.

Parameters

model : object A generative model instance (e.g., CTGAN) implementing sample_from_conditions(conditions) to produce synthetic cadets. full_data : pandas.DataFrame Full dataset containing historical cadet and AFSC information. afscs_df : pandas.DataFrame DataFrame containing AFSC metadata, including 'OTS Target' values.

Returns

pandas.DataFrame Synthetic dataset containing cadets for rare AFSCs with realistic degree distributions and utility values.

Notes

  • Rare AFSC eligibility is hardcoded as a list of AFSC strings in this function.
  • Degree sampling is biased toward more common CIPs by cubic weighting (proportions ∝ frequency³).
  • Cadet and AFSC utilities are drawn from kernel density estimators (KDEs) fitted on historical data.

See Also

Source code in afccp/data/generation/realistic.py
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
def generate_data_with_degree_preference_fixes(model, full_data, afscs_df):
    """
    Generate synthetic cadet data for rare AFSCs, preserving degree distribution
    preferences and realistic cadet/AFSC utilities.

    This function focuses on AFSCs that are difficult to fill ("rare" AFSCs),
    generating synthetic cadets in a way that matches observed degree patterns
    (CIP1) from historical data. It uses
    [`extract_afsc_cip_sampling_information`](../../../reference/data/generation/#data.generation.extract_afsc_cip_sampling_information)
    to determine sampling quotas and conditions, and
    [`sample_cadets_for_degree_conditions`](../../../reference/data/generation/#data.generation.sample_cadets_for_degree_conditions)
    to produce matching synthetic records. Cadet and AFSC utility values are
    then resampled for realism.

    Parameters
    ----------
    model : object
        A generative model instance (e.g., CTGAN) implementing
        `sample_from_conditions(conditions)` to produce synthetic cadets.
    full_data : pandas.DataFrame
        Full dataset containing historical cadet and AFSC information.
    afscs_df : pandas.DataFrame
        DataFrame containing AFSC metadata, including 'OTS Target' values.

    Returns
    -------
    pandas.DataFrame
        Synthetic dataset containing cadets for rare AFSCs with realistic
        degree distributions and utility values.

    Notes
    -----
    - Rare AFSC eligibility is hardcoded as a list of AFSC strings in this
      function.
    - Degree sampling is biased toward more common CIPs by cubic weighting
      (proportions ∝ frequency³).
    - Cadet and AFSC utilities are drawn from kernel density estimators
      (KDEs) fitted on historical data.

    See Also
    --------
    - [`extract_afsc_cip_sampling_information`](../../../afccp/reference/data/generation/#data.generation.extract_afsc_cip_sampling_information)
    - [`sample_cadets_for_degree_conditions`](../../../afccp/reference/data/generation/#data.generation.sample_cadets_for_degree_conditions)
    """

    # Filter dataframe to rare AFSCs (degree-wise)
    afscs_rare_eligible = ['13H', '32EXA', '32EXC', '32EXE', '32EXF',
                           '32EXJ', '61C', '61D', '62EXC', '62EXE', '62EXH', '62EXI']
    afscs_rare_df = afscs_df.set_index('AFSC')['OTS Target'].loc[afscs_rare_eligible]

    # Extract data generating parameters
    total_gen, afsc_cip_data, afsc_cip_conditions, afsc_util_samplers, cadet_util_samplers = \
        extract_afsc_cip_sampling_information(full_data, afscs_rare_eligible, afscs_rare_df)

    # Generate the data
    data = sample_cadets_for_degree_conditions(model, total_gen, afscs_rare_eligible, afsc_cip_data,
                                               afsc_cip_conditions)

    # Modify the utilities for the cadet/AFSC pairs
    i = 0
    for afsc in afscs_rare_eligible:
        for cip, count in afsc_cip_data[afsc].items():
            count = int(count)
            data.loc[i:i + count - 1, f'{afsc}_Cadet'] = cadet_util_samplers[afsc](count)
            data.loc[i:i + count - 1, f'{afsc}_AFSC'] = afsc_util_samplers[afsc](count)
            i += count

    return data

extract_afsc_cip_sampling_information(full_data, afscs_rare_eligible, afscs_rare_df)

Extract degree distribution and utility sampling information for rare AFSCs.

This function identifies cadets who have strong mutual preference with specific rare AFSCs (both the AFSC ranks the cadet highly and the cadet ranks the AFSC highly), determines the distribution of primary degrees (CIP1) for those cadets, and constructs constraints to ensure proportional representation in generated synthetic data. It also fits kernel density estimators (KDEs) to model cadet and AFSC utility scores for each AFSC-degree combination.

Parameters

full_data : pandas.DataFrame Full dataset containing cadet records with columns for degree codes (CIP1), AFSC utilities (<AFSC>_AFSC), and cadet preferences (<AFSC>_Cadet). afscs_rare_eligible : list of str List of AFSC codes considered rare and eligible for targeted sampling. afscs_rare_df : pandas.DataFrame or pandas.Series Data structure mapping each AFSC to the number of cadets needed to meet quotas for that AFSC.

Returns

total_gen : int Total number of synthetic cadets to generate across all rare AFSCs. afsc_cip_data : dict Mapping of {afsc: pandas.Series} where the Series index is degree codes (CIP1) and values are the number of cadets to generate for each degree. afsc_cip_conditions : dict Mapping {afsc: {cip: Condition}} specifying generation constraints for each AFSC-degree combination. afsc_util_samplers : dict Mapping {afsc: callable} returning AFSC utility samples for a given AFSC. cadet_util_samplers : dict Mapping {afsc: callable} returning cadet utility samples for a given AFSC.

Notes

  • Only cadets with mutual interest scores > 0.6 for a given AFSC are considered.
  • Degree frequencies are cubed to overweight common degrees, then scaled to match target generation counts using safe_round.
  • For AFSC 62EXE, target counts are halved due to quota filling difficulty.
  • Generation quotas are inflated by 40% or at least 3 extra cadets to ensure adequate representation.
Source code in afccp/data/generation/realistic.py
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
def extract_afsc_cip_sampling_information(full_data, afscs_rare_eligible, afscs_rare_df):
    """
    Extract degree distribution and utility sampling information for rare AFSCs.

    This function identifies cadets who have strong mutual preference with specific
    rare AFSCs (both the AFSC ranks the cadet highly and the cadet ranks the AFSC
    highly), determines the distribution of primary degrees (CIP1) for those cadets,
    and constructs constraints to ensure proportional representation in generated
    synthetic data. It also fits kernel density estimators (KDEs) to model cadet and
    AFSC utility scores for each AFSC-degree combination.

    Parameters
    ----------
    full_data : pandas.DataFrame
        Full dataset containing cadet records with columns for degree codes (`CIP1`),
        AFSC utilities (`<AFSC>_AFSC`), and cadet preferences (`<AFSC>_Cadet`).
    afscs_rare_eligible : list of str
        List of AFSC codes considered rare and eligible for targeted sampling.
    afscs_rare_df : pandas.DataFrame or pandas.Series
        Data structure mapping each AFSC to the number of cadets needed to meet
        quotas for that AFSC.

    Returns
    -------
    total_gen : int
        Total number of synthetic cadets to generate across all rare AFSCs.
    afsc_cip_data : dict
        Mapping of `{afsc: pandas.Series}` where the Series index is degree codes
        (CIP1) and values are the number of cadets to generate for each degree.
    afsc_cip_conditions : dict
        Mapping `{afsc: {cip: Condition}}` specifying generation constraints for each
        AFSC-degree combination.
    afsc_util_samplers : dict
        Mapping `{afsc: callable}` returning AFSC utility samples for a given AFSC.
    cadet_util_samplers : dict
        Mapping `{afsc: callable}` returning cadet utility samples for a given AFSC.

    Notes
    -----
    - Only cadets with mutual interest scores > 0.6 for a given AFSC are considered.
    - Degree frequencies are cubed to overweight common degrees, then scaled to match
      target generation counts using [`safe_round`](../../../reference/data/processing/#data.processing.safe_round).
    - For AFSC `62EXE`, target counts are halved due to quota filling difficulty.
    - Generation quotas are inflated by 40% or at least 3 extra cadets to ensure
      adequate representation.
    """

    afsc_cip_data = {}
    afsc_util_samplers = {}
    cadet_util_samplers = {}
    afsc_cip_conditions = {}
    total_gen = 0
    for afsc in afscs_rare_eligible:

        # Filter the real data on people who wanted this AFSC, and the AFSC wanted them
        conditions = (full_data[f'{afsc}_AFSC'] > 0.6) & (full_data[f'{afsc}_Cadet'] > 0.6)
        columns = ['YEAR', 'CIP1', 'CIP2', 'Merit', 'SOC', f'{afsc}_Cadet', f'{afsc}_AFSC']

        # Get the degrees of these people
        d = full_data.loc[conditions][columns]['CIP1'].value_counts()
        degrees = np.array(d.index)

        # Figure out how many degrees we have to ensure are present in this newly created dataset
        val = int(afscs_rare_df.loc[afsc])
        if afsc == '62EXE':  # We struggle to fill this quota!!
            val = val / 2
        num_gen = np.ceil(max(val * 1.4, val + 3))
        proportions = np.array(d ** 3) / np.array(d ** 3).sum()  # Tip the scales in favor of the more common CIP
        counts = safe_round(proportions * num_gen)
        afsc_cip_data[afsc] = pd.Series(counts, index=degrees)  # Save the degree information for this AFSC
        afsc_cip_data[afsc] = afsc_cip_data[afsc][afsc_cip_data[afsc] > 0]

        # Save functions to sample cadet/AFSC utilities for the ones with these degrees
        afsc_util_samplers[afsc] = fit_kde_sampler(list(full_data.loc[conditions][columns][f'{afsc}_AFSC']))
        cadet_util_samplers[afsc] = fit_kde_sampler(list(full_data.loc[conditions][columns][f'{afsc}_Cadet']))

        afsc_cip_conditions[afsc] = {}
        for cip, count in afsc_cip_data[afsc].items():
            condition = Condition(num_rows=int(count), column_values={"CIP1": cip})
            afsc_cip_conditions[afsc][cip] = condition
            total_gen += count

    return total_gen, afsc_cip_data, afsc_cip_conditions, afsc_util_samplers, cadet_util_samplers

safe_round(data, decimals: int = 0, axis: int = -1)

Round values while preserving the sum along a given axis.

This function rounds data to decimals decimal places but adjusts a minimal subset of elements so that the rounded values sum to the same (rounded) total as the original, slice‑by‑slice along axis. It does this by distributing the leftover rounding “units” to the entries whose fractional parts are most favorable (largest magnitude residuals with the correct sign), using a stable tie‑break so results are deterministic.

Parameters

data : numpy.ndarray Input array to round. Must be numeric. (Other array‑likes are coerced; behavior is only guaranteed for NumPy arrays.) decimals : int, optional Number of decimal places to keep (default 0). axis : int, optional Axis along which to preserve the slice sums (default -1). Each 1D slice along this axis will have its rounded sum equal to the original sum rounded to decimals.

Returns

numpy.ndarray or same type as data when feasible Rounded array with the same shape as data. If data is a NumPy array, a NumPy array is returned. For some other types, the function attempts to reconstruct the input type after rounding.

Notes

  • Let S = sum(data, axis) and S_r = round(S, decimals). The output y satisfies sum(y, axis) == S_r exactly (up to floating‑point representation).
  • Within each slice, the adjustment is minimal in the sense that only the elements with the largest compatible residuals are modified by ± one unit in the scaled space (10**decimals).
  • Time complexity is O(n log n) per slice due to sorting; memory usage is linear in the slice size.
  • This procedure does not enforce monotonicity or ordering of values.

Examples

import numpy as np x = np.array([0.24, 0.24, 0.24, 0.24, 0.04]) x.sum(), round(x.sum(), 2) (1.0, 1.0) y = safe_round(x, decimals=1, axis=0) y array([0.2, 0.2, 0.2, 0.2, 0.2]) y.sum() 1.0

X = np.array([[0.333, 0.333, 0.334], ... [0.125, 0.125, 0.750]]) Y = safe_round(X, decimals=2, axis=1) Y array([[0.33, 0.33, 0.34], [0.12, 0.13, 0.75]]) Y.sum(axis=1) array([1. , 1. ])

Source code in afccp/data/generation/realistic.py
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
def safe_round(data, decimals: int = 0, axis: int = -1):
    """
    Round values while preserving the sum along a given axis.

    This function rounds `data` to `decimals` decimal places but adjusts a minimal
    subset of elements so that the rounded values sum to the same (rounded) total
    as the original, slice‑by‑slice along `axis`. It does this by distributing the
    leftover rounding “units” to the entries whose fractional parts are most
    favorable (largest magnitude residuals with the correct sign), using a stable
    tie‑break so results are deterministic.

    Parameters
    ----------
    data : numpy.ndarray
        Input array to round. Must be numeric. (Other array‑likes are coerced;
        behavior is only guaranteed for NumPy arrays.)
    decimals : int, optional
        Number of decimal places to keep (default 0).
    axis : int, optional
        Axis along which to preserve the slice sums (default -1). Each 1D slice
        along this axis will have its rounded sum equal to the original sum
        rounded to `decimals`.

    Returns
    -------
    numpy.ndarray or same type as `data` when feasible
        Rounded array with the same shape as `data`. If `data` is a NumPy array,
        a NumPy array is returned. For some other types, the function attempts to
        reconstruct the input type after rounding.

    Notes
    -----
    - Let `S = sum(data, axis)` and `S_r = round(S, decimals)`. The output `y`
      satisfies `sum(y, axis) == S_r` exactly (up to floating‑point representation).
    - Within each slice, the adjustment is minimal in the sense that only the
      elements with the largest compatible residuals are modified by ± one unit
      in the scaled space (10**decimals).
    - Time complexity is `O(n log n)` per slice due to sorting; memory usage is
      linear in the slice size.
    - This procedure does not enforce monotonicity or ordering of values.

    Examples
    --------
    >>> import numpy as np
    >>> x = np.array([0.24, 0.24, 0.24, 0.24, 0.04])
    >>> x.sum(), round(x.sum(), 2)
    (1.0, 1.0)
    >>> y = safe_round(x, decimals=1, axis=0)
    >>> y
    array([0.2, 0.2, 0.2, 0.2, 0.2])
    >>> y.sum()
    1.0

    >>> X = np.array([[0.333, 0.333, 0.334],
    ...               [0.125, 0.125, 0.750]])
    >>> Y = safe_round(X, decimals=2, axis=1)
    >>> Y
    array([[0.33, 0.33, 0.34],
           [0.12, 0.13, 0.75]])
    >>> Y.sum(axis=1)
    array([1.  , 1.  ])
    """
    data_type = type(data)
    constructor = {}

    # 1) Scale by 10^decimals
    scale = 10.0 ** decimals
    scaled = data * scale

    # 2) Naively round each element to the nearest integer
    rounded = np.rint(scaled)

    # 3) Compute how many integer "units" the sum *should* have in each slice
    sum_rounded = np.sum(rounded, axis=axis, keepdims=True)
    sum_desired = np.rint(np.sum(scaled, axis=axis, keepdims=True))
    difference = sum_desired - sum_rounded

    n = data.shape[axis]
    leftover_div = np.floor_divide(difference, n)
    leftover_mod = difference - leftover_div * n
    rounded += leftover_div

    # 5) Select elements to tweak
    difference = scaled - rounded
    leftover_sign = np.sign(leftover_mod)
    difference_sign = np.sign(difference)
    candidate_mask = (difference_sign == leftover_sign) & (difference_sign != 0)
    sort_key = np.where(candidate_mask, -np.abs(difference), np.inf)
    sorted_idx = np.argsort(sort_key, axis=axis, kind='stable')

    ranks = np.empty_like(sorted_idx)
    shape_for_r = [1] * data.ndim
    shape_for_r[axis] = n
    r_array = np.arange(n, dtype=sorted_idx.dtype).reshape(shape_for_r)
    np.put_along_axis(ranks, sorted_idx, r_array, axis=axis)

    leftover_mod_int = np.abs(leftover_mod).astype(int)
    choose_mask = ranks < leftover_mod_int
    rounded += leftover_sign * choose_mask

    result = rounded / scale

    if data_type is np.ndarray:
        return result

    return data_type(result.squeeze(), **constructor)

sample_cadets_for_degree_conditions(model, total_gen, afscs_rare_eligible, afsc_cip_data, afsc_cip_conditions)

Generate synthetic cadets matching AFSC-degree sampling conditions.

Iterates over rare AFSCs and their associated degree quotas to generate synthetic cadets using the provided generative model. For each AFSC-degree combination, the function samples cadets that meet the degree condition constraints, appending them to a cumulative dataset.

Parameters

model : object A generative model instance (e.g., CTGAN) implementing sample_from_conditions(conditions) to produce synthetic cadets. total_gen : int Total number of cadets to generate across all AFSC-degree combinations. afscs_rare_eligible : list of str List of AFSC codes considered rare and eligible for targeted generation. afsc_cip_data : dict Mapping {afsc: pandas.Series} where the Series index is degree codes (CIP1) and values are the number of cadets to generate for each degree. afsc_cip_conditions : dict Mapping {afsc: {cip: Condition}} specifying generation constraints for each AFSC-degree combination.

Returns

pandas.DataFrame A concatenated dataset of synthetic cadets meeting all AFSC-degree constraints.

Notes

  • This function logs progress to the console, showing both the number and percentage of cadets generated so far.
  • Sampling order is AFSC-major, iterating over all degrees within each AFSC before moving to the next AFSC.
  • The count values in afsc_cip_data are expected to be integers or convertible to integers.
Source code in afccp/data/generation/realistic.py
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
def sample_cadets_for_degree_conditions(model, total_gen, afscs_rare_eligible, afsc_cip_data, afsc_cip_conditions):
    """
    Generate synthetic cadets matching AFSC-degree sampling conditions.

    Iterates over rare AFSCs and their associated degree quotas to generate
    synthetic cadets using the provided generative model. For each AFSC-degree
    combination, the function samples cadets that meet the degree condition
    constraints, appending them to a cumulative dataset.

    Parameters
    ----------
    model : object
        A generative model instance (e.g., CTGAN) implementing
        `sample_from_conditions(conditions)` to produce synthetic cadets.
    total_gen : int
        Total number of cadets to generate across all AFSC-degree combinations.
    afscs_rare_eligible : list of str
        List of AFSC codes considered rare and eligible for targeted generation.
    afsc_cip_data : dict
        Mapping `{afsc: pandas.Series}` where the Series index is degree codes (CIP1)
        and values are the number of cadets to generate for each degree.
    afsc_cip_conditions : dict
        Mapping `{afsc: {cip: Condition}}` specifying generation constraints for each
        AFSC-degree combination.

    Returns
    -------
    pandas.DataFrame
        A concatenated dataset of synthetic cadets meeting all AFSC-degree constraints.

    Notes
    -----
    - This function logs progress to the console, showing both the number and
      percentage of cadets generated so far.
    - Sampling order is AFSC-major, iterating over all degrees within each AFSC
      before moving to the next AFSC.
    - The `count` values in `afsc_cip_data` are expected to be integers or
      convertible to integers.
    """

    # Generate dataframe
    data = pd.DataFrame()
    i = 0
    for afsc in afscs_rare_eligible:
        for cip, count in afsc_cip_data[afsc].items():
            print(f'{afsc} {cip}: {int(count)}...')
            df_gen = model.sample_from_conditions([afsc_cip_conditions[afsc][cip]])
            data = pd.concat((data, df_gen), ignore_index=True)
            i += count
            print(f'{afsc} {cip}: ({int(i)}/{int(total_gen)}) {round((i / total_gen) * 100, 2)}% complete.')

    return data